## Extended Practice: Importances and Coefficients
---
* ### Ingrid Arbieto Nelson

<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Extended-Practice:-Importances-and-Coefficients" data-toc-modified-id="Extended-Practice:-Importances-and-Coefficients-1">Extended Practice: Importances and Coefficients</a></span><ul class="toc-item"><li><span><a href="#Ingrid-Arbieto-Nelson" data-toc-modified-id="Ingrid-Arbieto-Nelson-1.1">Ingrid Arbieto Nelson</a></span></li><li><span><a href="#Task" data-toc-modified-id="Task-1.2">Task</a></span></li><li><span><a href="#The-Data" data-toc-modified-id="The-Data-1.3">The Data</a></span></li></ul></li><li><span><a href="#Importances-&amp;-Coefficients" data-toc-modified-id="Importances-&amp;-Coefficients-2">Importances &amp; Coefficients</a></span><ul class="toc-item"><li><span><a href="#Imports" data-toc-modified-id="Imports-2.1">Imports</a></span></li><li><span><a href="#Load-Data" data-toc-modified-id="Load-Data-2.2">Load Data</a></span></li><li><span><a href="#PreProcessing" data-toc-modified-id="PreProcessing-2.3">PreProcessing</a></span></li><li><span><a href="#Fit-&amp;-Evaluate-Tree-Based-Model" data-toc-modified-id="Fit-&amp;-Evaluate-Tree-Based-Model-2.4">Fit &amp; Evaluate Tree Based Model</a></span></li><li><span><a href="#Extract-&amp;-Plot-Feature-Importances" data-toc-modified-id="Extract-&amp;-Plot-Feature-Importances-2.5">Extract &amp; Plot Feature Importances</a></span><ul class="toc-item"><li><span><a href="#What-were-your-5-most-important-features?" data-toc-modified-id="What-were-your-5-most-important-features?-2.5.1">What were your 5 most important features?</a></span></li></ul></li><li><span><a href="#Extract-&amp;-Plot-Permutation-Feature-Importances" data-toc-modified-id="Extract-&amp;-Plot-Permutation-Feature-Importances-2.6">Extract &amp; Plot Permutation Feature Importances</a></span><ul class="toc-item"><li><span><a href="#What-were-your-5-most-permutation-important-features?" data-toc-modified-id="What-were-your-5-most-permutation-important-features?-2.6.1">What were your 5 most permutation important features?</a></span></li></ul></li></ul></li><li><span><a href="#Linear-Regression" data-toc-modified-id="Linear-Regression-3">Linear Regression</a></span></li></ul></div>

The following practice assignment is much longer than a typical practice assignment.
* You may skip this assignment if you feel comfortable with what you have learned thus far.
* Note: while the target grades (G1-G3) are different, all of the features from this data set are the same as those from the Student Performance lessons. They require the same preprocessing steps as you've seen in the previous lessons.

### Task
For this assignment, we will be using the alternative version of the student performance dataset that we've been exploring in the lessons this week. You will create a model to predict the students' final grades (G3), but using the Math grades version of the data. The features are the same as the dataset used in the lessons, but the G1, G2, and G3 columns are the students' grades for Math instead of Portuguese.
* First, preprocess the data.
   * A: Perform train-test-split with G3 as the target.
   * B: Use a ColumnTransformer with the required preprocessing steps
      * Drop any unnecessary binary categories using the drop='if_binary' argument for OneHotEncoder.
      * Don't forget to add verbose_feature_names_out=False
   * C: Create DataFrame versions of your X_train and X_test data using the correct feature names.
* Second, fit a tree-based model of your choice (that produces feature importances).
   * A: Evaluate its performance on the training and test data.
   * B: extract and visualize the feature importances determined by the model.
   * C: Answer what were the top 5 most important features?
* Third, apply sklearn's permutation_importance.
   * A: visualize the permutation importances.
   * B: Answer what are the top 5 most important features the same as the top 5 most important features (according to our built-in importance)?
* Fourth, Fit a sklearn LinearRegression model.
   * A: Evaluate its performance on the training & test data.
   * B: visualize the model's top 15 largest coefficients (according to absolute value).
   * C: Select the 3 largest coefficients (by absolute value) and explain what they mean and what insights they might provide.

### The Data

Student Performance - Math
   * [Share URL](https://docs.google.com/spreadsheets/d/1EbTcrapgIgMETN5H9Khw9N92k4OLN1Zu/edit#gid=326611786)
   * Direct Link:
https://docs.google.com/spreadsheets/d/e/2PACX-1vS6xDKNpWkBBdhZSqepy48bXo55QnRv1Xy6tXTKYzZLMPjZozMfYhHQjAcC8uj9hQ/pub?output=xlsx

* Note: the dataset is an Excel document, and you will need to specify that sheet_name='student-mat' in pd.read_excel
* Original Source & Data Dictionary:
https://archive.ics.uci.edu/ml/datasets/student+performance

## Importances & Coefficients

### Imports

In [None]:
## standard imports
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

## Preprocessing
from sklearn.model_selection import train_test_split
from sklearn.compose import make_column_transformer, make_column_selector
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.impute import SimpleImputer


## Models & evaluation metrics
from sklearn import metrics
from sklearn.linear_model import LinearRegression
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor
import xgboost as xgb
from sklearn.inspection import permutation_importance
import joblib

## setting random state for reproducibility
SEED = 321
np.random.seed(SEED)
plt.style.use(('ggplot','tableau-colorblind10'))

import warnings
warnings.filterwarnings('ignore')

In [None]:
## set pandas to display more columns
pd.set_option('display.max_columns',50)

### Load Data

In [None]:
file = "https://docs.google.com/spreadsheets/d/e/2PACX-1vS6xDKNpWkBBdhZSqepy48bXo55QnRv1Xy6tXTKYzZLMPjZozMfYhHQjAcC8uj9hQ/pub?output=xlsx"
df = pd.read_excel(file, sheet_name='student-mat')
df.info()

In [None]:
df.head()

### PreProcessing

In [None]:
## assign X and y
y = df['G3']
X = df.drop(columns='G3')

## train-test-split
X_train, X_test, y_train, y_test = train_test_split(X,y, random_state=SEED)
X_train.head()

In [None]:
## make pipelines for categorical vs numeric data
cat_sel = make_column_selector(dtype_include='object')
cat_pipe = make_pipeline(SimpleImputer(strategy='constant',
                                       fill_value='MISSING'),
                         OneHotEncoder(drop='if_binary', sparse=False)
                        )

num_sel = make_column_selector(dtype_include='number')
num_pipe = make_pipeline(SimpleImputer(strategy='mean'))


## make the preprocessing column transformer with verbose_feature_names_out=False
preprocessor = make_column_transformer((num_pipe, num_sel),
                                       (cat_pipe,cat_sel),
                                      verbose_feature_names_out=False)
preprocessor

In [None]:
## fit column transformer and run get_feature_names_out
preprocessor.fit(X_train)
feature_names = preprocessor.get_feature_names_out()


X_train_df = pd.DataFrame(preprocessor.transform(X_train), 
                          columns = feature_names, index = X_train.index)

X_test_df = pd.DataFrame(preprocessor.transform(X_test), 
                          columns = feature_names, index = X_test.index)
X_test_df.head(3)

### Fit & Evaluate Tree Based Model

In [None]:
def evaluate_regression(model, X_train,y_train, X_test, y_test): 
    """Evaluates a scikit-learn-compatible regression model using r-squared and RMSE

    Args:
        model (Regressor): Regression Model with a .predict method
        X_train (DataFrame): Training Features
        y_train (Series): Training Target
        X_test (DataFrame): Test Features
        y_test (Series): Test Target
    """

    ## Training Data
    y_pred_train = model.predict(X_train)
    r2_train = metrics.r2_score(y_train, y_pred_train)
    rmse_train = metrics.mean_squared_error(y_train, y_pred_train, 
                                            squared=False)
    
    print(f"Training Data:\tR^2= {r2_train:.2f}\tRMSE= {rmse_train:.2f}")
        
    
    ## Test Data
    y_pred_test = model.predict(X_test)
    r2_test = metrics.r2_score(y_test, y_pred_test)
    rmse_test = metrics.mean_squared_error(y_test, y_pred_test, 
                                            squared=False)
    
    print(f"Test Data:\tR^2= {r2_test:.2f}\tRMSE= {rmse_test:.2f}")

In [None]:
reg = RandomForestRegressor()
reg.fit(X_train_df,y_train)
evaluate_regression(reg, X_train_df, y_train, X_test_df,y_test)

In [None]:
feature_importance = pd.Series(reg.feature_importances_, index=feature_names,
                        name='Random Forest Feature Importances')
feature_importance.head()

### Extract & Plot Feature Importances

In [None]:
ax = feature_importance.sort_values().tail(10).plot(kind='barh',figsize=(4,6))
ax.set(ylabel='Feature Name',xlabel='Feature Importance',
       title='Top 10 Most Important Features');

#### What were your 5 most important features?

* G2
* absences
* age
* study time
* health

### Extract & Plot Permutation Feature Importances

In [None]:
r = permutation_importance(reg, X_train_df, y_train ,n_repeats =5)
r.keys()

In [None]:
## can make the mean importances into a series
perm_importances = pd.Series(r['importances_mean'],index=X_train_df.columns,
                           name = 'permutation importance')
perm_importances.head()

In [None]:
ax = perm_importances.sort_values().tail(10).plot(kind='barh',figsize=(4,6))
ax.set(ylabel='Feature Name',xlabel='Permutation Importance',
       title='Top 10 Most Important Features: Permutation Importance');

#### What were your 5 most permutation important features?

* G2
* absences
* age
* studytime
* health

## Linear Regression

In [None]:
lin_reg = LinearRegression()
lin_reg.fit(X_train_df,y_train)
evaluate_regression(lin_reg, X_train_df, y_train, X_test_df,y_test)

In [None]:
coeffs = pd.Series(lin_reg.coef_, index=feature_names,
                        name='Coefficients')
coeffs.loc['intercept'] = lin_reg.intercept_
coeffs.head()

In [None]:
## Plot the coefficients
ax = plot_vals.sort_values().plot(kind='barh')#,figsize=(5,6))
ax.axvline(0,color='k')
ax.set_title(f'Top {top_n} Largest Coeffiicents');

In [None]:
## rank the coeffs and select the top_n
top_n=15
coeff_rank = coeffs.abs().rank().sort_values(ascending=False)
top_n_features = coeff_rank.head(top_n)
coeffs_plot = coeffs.loc[top_n_features.index].sort_values()
coeffs_plot

In [None]:
top_n_features = coeff_rank.head(top_n)
coeffs_plot = coeffs.loc[top_n_features.index].sort_values()
coeffs_plot

In [None]:
## sort features and keep top_n and set title
ax = coeffs_plot.sort_values().plot(kind='barh',figsize=(5,6))
ax.axvline(0,color='k');
ax.set(title = f"Top {top_n} Largest Coefficients",ylabel="Feature Name",
      xlabel='Coefficient');

In [None]:
def plot_coeffs(coeffs, top_n=None,  figsize=(4,5), intercept=False, 
                annotate=False, ha='left',va='center', size=12, xytext=(4,0),
                  textcoords='offset points'):
    """Plots the top_n coefficients from a Series, with optional annotations."""
    if (intercept==False) & ('intercept' in coeffs.index):
        coeffs = coeffs.drop('intercept')
    if top_n==None:
        ## sort all features and set title
        plot_vals = coeffs#.sort_values()
        title = "All Coefficients - Ranked by Magnitude"
    else:
        ## rank the coeffs and select the top_n
        coeff_rank = coeffs.abs().rank().sort_values(ascending=False)
        top_n_features = coeff_rank.head(top_n)
        plot_vals = coeffs.loc[top_n_features.index].sort_values()
        ## sort features and keep top_n and set title
        title = f"Top {top_n} Largest Coefficients"
    ## plotting top N importances
    ax = plot_vals.plot(kind='barh', figsize=figsize)
    ax.set(xlabel='Coefficient', 
           ylabel='Feature Names', 
           title=title)
    ax.axvline(0, color='k')
    
    if annotate==True:
        annotate_hbars(ax, ha=ha,va=va,size=size,xytext=xytext,
                       textcoords=textcoords)
    ## return ax in case want to continue to update/modify figure
    return ax

def annotate_hbars(ax, ha='left',va='center',size=12,  xytext=(4,0),
                  textcoords='offset points'):
    for bar in ax.patches:
    
        ## calculate center of bar
        bar_ax = bar.get_y() + bar.get_height()/2
        ## get the value to annotate
        val = bar.get_width()
        if val < 0:
            val_pos = 0
        else:
            val_pos = val
        # ha and va stand for the horizontal and vertical alignment
        ax.annotate(f"{val:.3f}", (val_pos,bar_ax), ha=ha,va=va,size=size,
                        xytext=xytext, textcoords=textcoords)

In [None]:
plot_coeffs(coeffs,top_n=15,intercept=False,annotate=True);

In [None]:
ax = plot_coeffs(coeffs,top_n=15)
annotate_hbars(ax)