# Week 17, Lecture 02: CodeAlong


## **Part 2: Explaining Models with Model Explainers**

### Lesson Objectives

- By the end of this lesson, students will be able to:
    - Load variables and models from a joblib file into a new notebook.
    - Apply permutation importance
    - Apply shap analysis 
    - Visualize global and local explanations.


### Continuing with Life Expectancy Prediction

> Task Inspired by: https://medium.com/@shanzehhaji/using-a-linear-regression-model-to-predict-life-expectancy-de3aef66ac21

- Kaggle Dataset on Life Expectancy:
    - https://www.kaggle.com/datasets/kumarajarshi/life-expectancy-who

In [1]:
## Our standard imports
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import missingno as miss

## Preprocessing tools
from sklearn.model_selection import train_test_split
from sklearn.compose import make_column_transformer, make_column_selector
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.impute import SimpleImputer

## Models & evaluation metrics
from sklearn import metrics
from sklearn.linear_model import LinearRegression
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor


## setting random state for reproducibility
SEED = 321
np.random.seed(SEED)
## Matplotlib style
fav_style = ('ggplot','tableau-colorblind10')
fav_context  ={'context':'notebook', 'font_scale':1.1}
plt.style.use(fav_style)
sns.set_context(**fav_context)
plt.rcParams['savefig.transparent'] = False
plt.rcParams['savefig.bbox'] = 'tight'


import joblib, os

In [2]:
## Importing Custom Functions
import sys,os
# sys.path.append(os.path.abspath("../"))
%load_ext autoreload
%autoreload 2
from CODE import data_enrichment as de

### Functionized Code From Part 1

In [3]:
def evaluate_regression(model, X_train,y_train, X_test, y_test,for_slides=True): 
    """Evaluates a scikit learn regression model using r-squared and RMSE
    FOR SLIDES VERS DOES MULTIPLE PRINT STATEMENTS FOR VERTICAL DISPLAY OF INFO"""
    
    ## Training Data
    y_pred_train = model.predict(X_train)
    r2_train = metrics.r2_score(y_train, y_pred_train)
    rmse_train = metrics.mean_squared_error(y_train, y_pred_train, 
                                            squared=False)
    mae_train = metrics.mean_absolute_error(y_train, y_pred_train)
    

    ## Test Data
    y_pred_test = model.predict(X_test)
    r2_test = metrics.r2_score(y_test, y_pred_test)
    rmse_test = metrics.mean_squared_error(y_test, y_pred_test, 
                                            squared=False)
    mae_test = metrics.mean_absolute_error(y_test, y_pred_test)
    
    if for_slides:
        df_version =[['Split','R^2','MAE','RMSE']]
        df_version.append(['Train',r2_train, mae_train, rmse_train])
        df_version.append(['Test',r2_test, mae_test, rmse_test])
        df_results = pd.DataFrame(df_version[1:], columns=df_version[0])
        df_results = df_results.round(2)
        display(df_results.style.hide(axis='index').format(precision=2, thousands=','))
        
    else: 
        print(f"Training Data:\tR^2 = {r2_train:,.2f}\tRMSE = {rmse_train:,.2f}\tMAE = {mae_train:,.2f}")
        print(f"Test Data:\tR^2 = {r2_test:,.2f}\tRMSE = {rmse_test:,.2f}\tMAE = {mae_test:,.2f}")

def get_coefficients(lin_reg):
    coeffs = pd.Series(lin_reg.coef_, index= lin_reg.feature_names_in_)
    coeffs.loc['intercept'] = lin_reg.intercept_
    return coeffs

def plot_coefficients(coeffs, sort_values=True, top_n=None, figsize=(6,4),
                     title="Linear Regression Coefficients", xlabel='Coefficient'):
    """Plots a Series of coefficients as horizotal bar chart, with option to sort
    and to only keep top_n coefficients"""
        
    if top_n is not None:
        top_n = coeffs.abs().rank().sort_values(ascending=False).head(top_n)
        coeffs = coeffs.loc[top_n.index]
        
    if sort_values:
        coeffs = coeffs.sort_values()

        
        
    ax = coeffs.plot(kind='barh', figsize=figsize)
    ax.axvline(0, color='k')
    ax.set(xlabel=xlabel, title=title);
    plt.show()
    return ax


def get_importances(rf_reg):
    importances = pd.Series(rf_reg.feature_importances_, index= rf_reg.feature_names_in_)
    return importances


def plot_importances(importances, sort_values=True, top_n=None, figsize=(6,4),
                     title="Feature Importance", xlabel='Importance'):
    if sort_values:
        importances = importances.sort_values()
        
    if top_n is not None:
        importances = importances.tail(top_n)
        
        
    ax = importances.plot(kind='barh', figsize=figsize)
    ax.axvline(0, color='k')
    ax.set(xlabel=xlabel, title=title);
    plt.show()
    return ax

##  üïπÔ∏è Loading Objects from a Joblib

In [4]:
## Load the joblib file stored in the models folder
fname = "Models/wk1-lect01-codealong.joblib"

loaded = joblib.load(fname)
# Preview the contents of the loaded joblib objects
loaded.keys()

dict_keys(['X_train', 'y_train', 'X_test', 'y_test', 'preprocessor', 'LinearRegression', 'RandomForestRegressor'])

In [5]:
## Saving the loaded objects as separate varaibles

## Saving the loaded objects as separate varaibles

X_train = loaded['X_train']
X_test = loaded['X_test']
y_train = loaded['y_train']
y_test = loaded['y_test']

preprocessor = loaded['preprocessor']
lin_reg_final = loaded['LinearRegression']
rf_reg = loaded['RandomForestRegressor']

X_train.head()

Unnamed: 0,Year,Status,Adult Mortality,infant deaths,Alcohol,percentage expenditure,Hepatitis B,Measles,BMI,under-five deaths,Polio,Total expenditure,Diphtheria,HIV/AIDS,GDP,Population,thinness 1-19 years,thinness 5-9 years,Income composition of resources,Schooling
2414,2010,Developing,359.0,27,,0.0,,0,,41,,,,4.0,1562.239346,167192.0,,,0.0,0.0
1576,2009,Developing,491.0,36,1.16,7.915071,93.0,21,17.3,56,93.0,1.18,93.0,14.9,42.737966,1471462.0,6.9,6.8,0.415,9.9
2559,2009,Developing,174.0,10,0.3,35.716138,93.0,0,35.9,12,93.0,5.93,93.0,0.2,666.345857,7472819.0,3.7,3.8,0.601,11.1
1524,2013,Developed,178.0,0,15.04,1968.816817,93.0,35,61.4,0,93.0,6.59,93.0,0.1,15712.82376,2957689.0,2.7,2.7,0.834,16.4
33,2014,Developing,11.0,21,0.01,54.237318,95.0,0,58.4,24,95.0,7.21,95.0,0.1,547.8517,39113313.0,6.0,5.8,0.741,14.4


> Let's evaluate our models to prove they saved correctly.

In [6]:
## Use our evaluate_regression function to evalaute the linear regression

evaluate_regression(lin_reg_final,X_train,y_train, X_test, y_test)

Feature names unseen at fit time:
- Status
Feature names seen at fit time, yet now missing:
- Status_Developing



ValueError: could not convert string to float: 'Developing'

> ***Q: What is my error?***

### Re-Creating X_train_df & X_test_df

In [7]:
## Get feature names from already-fit preprocessor

feature_names = preprocessor.get_feature_names_out()

## Use the preprocessor to transform X_train into X_train_df
X_train_df = pd.DataFrame(preprocessor.transform(X_train),
                          index=X_train.index,
                          columns=feature_names)


## Use the preprocessor to transform X_test into X_test_df 

X_test_df = pd.DataFrame(preprocessor.transform(X_test),
                         index=X_test.index,
                          columns=feature_names)
X_train_df.head()


Unnamed: 0,Year,Adult Mortality,infant deaths,Alcohol,percentage expenditure,Hepatitis B,Measles,BMI,under-five deaths,Polio,Total expenditure,Diphtheria,HIV/AIDS,GDP,Population,thinness 1-19 years,thinness 5-9 years,Income composition of resources,Schooling,Status_Developing
2414,2010.0,359.0,27.0,4.580078,0.0,81.438929,0.0,38.113041,41.0,82.490834,5.931982,82.451879,4.0,1562.239346,167192.0,4.811567,4.819493,0.0,0.0,1.0
1576,2009.0,491.0,36.0,1.16,7.915071,93.0,21.0,17.3,56.0,93.0,1.18,93.0,14.9,42.737966,1471462.0,6.9,6.8,0.415,9.9,1.0
2559,2009.0,174.0,10.0,0.3,35.716138,93.0,0.0,35.9,12.0,93.0,5.93,93.0,0.2,666.345857,7472819.0,3.7,3.8,0.601,11.1,1.0
1524,2013.0,178.0,0.0,15.04,1968.816817,93.0,35.0,61.4,0.0,93.0,6.59,93.0,0.1,15712.82376,2957689.0,2.7,2.7,0.834,16.4,0.0
33,2014.0,11.0,21.0,0.01,54.237318,95.0,0.0,58.4,24.0,95.0,7.21,95.0,0.1,547.8517,39113313.0,6.0,5.8,0.741,14.4,1.0


### Evaluating Our LinearRegression

In [8]:
## Use our evaluate_regression function to evalaute the linear regression
evaluate_regression(lin_reg_final, X_train_df, y_train, X_test_df, y_test)

Split,R^2,MAE,RMSE
Train,0.82,3.01,4.03
Test,0.82,3.05,4.12


### Evaluating Our Random Forest

In [9]:
## evaluate the random forest
evaluate_regression(rf_reg,X_train_df,y_train, X_test_df, y_test)

Split,R^2,MAE,RMSE
Train,0.99,0.45,0.71
Test,0.97,1.11,1.75


## üïπÔ∏è  Permutation Importance

In [10]:
from sklearn.inspection import permutation_importance

### RandomForest Permutation Importance

>  Apply permutation importance to the random forest

In [11]:
## run performatation_importance on the rf  using the test data and random_state=SEED

result.keys()

NameError: name 'result' is not defined

In [None]:
## save the average importances as a Series



In [None]:
# Use our plot_importances function, but change title to "Permutation Importance"


> Permutation Importance Can Be Applied to ANY Model!

### LinearRegression Permutation Importance

In [None]:
## run performatation_importance on the lin_reg  using the test data and random_state=SEED


In [None]:
## Make into a series called perm_importances_linreg


# üïπÔ∏èGlobal Model Explanations

##  Applying Shap

In [None]:
# Import and init shap
import shap
shap.initjs()


### Gather a small random sample to speed the process

In [None]:
# Take a sample of the training data (nsampel=500, use random_state=SEED)



## <center> SHAP Model Usage

![shap usage by model](Data/SHAPcheatsheet.png)

### Create SHAP explainers for LinearRegression and RandomForestRegressor

In [None]:
# Instantiate a Model Explainer with the model

## Get shap values from the explainer


In [None]:
## Create an explainer for the lin_reg

## get shap values for linreg


### Summary Bar Plots

The `show=False` flag will allow us to manipulate the figure using matplotlib.

In [None]:
## create a summary plot (bar version)


In [None]:
## create a summary plot (bar version)


In [None]:
## create a summary plot (bar version)


In [None]:
## create a summary plot (dot/default version)


In [None]:
## create a summary plot (dot/default version)


> So why is our LinReg predicting a high life expectancy when infant deaths are high?

## Local Explanations

In [None]:
## Making a vers of shap vars with 0-based integer index 
# so what it matches the index for the shap_vallues



### Finding a Meaningful  Example to Explain

- Let's find the example with the most infant deaths.

In [None]:
# what is the max/range of infant deaths


In [None]:
## saving the index of the most deaths


In [None]:
# checking the feature values for selected example


In [None]:
## what was the actual life expectancy?


## Shap Force Plots

### Force Plot - Linear Reg explanation

In [None]:


## plotting example force plot for most inf.deaths (from linreg)



### Force Plot - RandomForest explanation

In [None]:
## plotting example force plot for most inf.deaths (from rf)





> ***Q: What do you notice when comparing the lin reg and rf reg force plots?***

# LIME

In [None]:
from lime.lime_tabular import LimeTabularExplainer

## Create a lime explainer with X_shap_local with regression as the mode



In [None]:
## Use the lime explainer to explain the selected example used above 




___
# APPENDIX

### Global Force Plots

In [None]:
shap.force_plot(explainer_linreg.expected_value,shap_values.values,X_shap_local,)

In [None]:
shap.force_plot(explainer.expected_value,shap_values.values,X_shap_local)