
BloomTech Data Science

*Unit 2, Sprint 3, Module 4*

---

# Model Interpretation

You will use your portfolio project dataset for all assignments this sprint.

## Assignment

Complete these tasks for your project, and document your work.

- [ ] Continue to iterate on your project: data cleaning, exploratory visualization, feature engineering, modeling.
- [ ] Make at least 1 partial dependence plot to explain your model.
- [ ] Make at least 1 Shapley force plot to explain an individual prediction.

If you aren't ready to make these plots with your own dataset, you can practice these objectives with any dataset you've worked with previously. Example solutions are available for Partial Dependence Plots with the Tanzania Waterpumps dataset, and Shapley force plots with the Titanic dataset.

Please be aware that **multi-class classification** will result in multiple Partial Dependence Plots (one for each class), and multiple sets of Shapley Values (one for each class).

## Stretch Goals

#### Partial Dependence Plots
- [ ] Make multiple PDPs with 1 feature in isolation.
- [ ] Make multiple PDPs with 2 features in interaction.
- [ ] Use Plotly to make a 3D PDP.
- [ ] Make PDPs with categorical feature(s). Use Ordinal Encoder, outside of a pipeline, to encode your data first. If there is a natural ordering, then take the time to encode it that way, instead of random integers. Then use the encoded data with pdpbox. Get readable category names on your plot, instead of integer category codes.

#### Shap Values
- [ ] Make Shapley force plots to explain at least 4 individual predictions.
    - If your project is Binary Classification, you can do a True Positive, True Negative, False Positive, False Negative.
    - If your project is Regression, you can do a high prediction with low error, a low prediction with low error, a high prediction with high error, and a low prediction with high error.
- [ ] Use Shapley values to display verbal explanations of individual predictions.
- [ ] Use the SHAP library for other visualization types.

The [SHAP repo](https://github.com/slundberg/shap) has examples for many visualization types, including:

- Force Plot, individual predictions
- Force Plot, multiple predictions
- Dependence Plot
- Summary Plot
- Summary Plot, Bar
- Interaction Values
- Decision Plots

We just did the first type during the lesson. The [Kaggle microcourse](https://www.kaggle.com/dansbecker/advanced-uses-of-shap-values) shows two more. Experiment and see what you can learn!

In [None]:
%%capture
!pip install category_encoders==2.*

In [None]:
from google.colab import files
uploaded = files.upload()

Saving demand.csv to demand.csv
Saving supply.csv to supply.csv


In [None]:
# Import libraries

import numpy as np
import pandas as pd
from matplotlib import pyplot as plt
import seaborn as sns
import numpy as np
import pandas as pd
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import MinMaxScaler
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.linear_model import LinearRegression
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.svm import SVR
from sklearn.neural_network import MLPRegressor
from sklearn.metrics import mean_squared_error

In [None]:
#pipeline
from sklearn.pipeline import make_pipeline
# encoders
from category_encoders import OrdinalEncoder
from sklearn.impute import SimpleImputer
# Boosted Models
from sklearn.ensemble import GradientBoostingClassifier
from xgboost import XGBClassifier

from sklearn.preprocessing import OneHotEncoder

In [None]:
supply = pd.read_csv("supply.csv")
demand = pd.read_csv("demand.csv")

In [None]:
supply.head()

Unnamed: 0,DATE,CSUSHPISA,MSACSR,PERMIT,TLRESCONS,EVACANTUSQ176N
0,01-01-2003,129.321,4.2,1806.333333,421328.6667,14908
1,01-04-2003,131.756,3.833333333,1837.666667,429308.6667,15244
2,01-07-2003,135.013,3.633333333,1937.333333,458890.0,15614
3,01-10-2003,138.8356667,3.966666667,1972.333333,491437.3333,15654
4,01-01-2004,143.2986667,3.7,1994.666667,506856.3333,15895


In [None]:
supply.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 82 entries, 0 to 81
Data columns (total 6 columns):
 #   Column          Non-Null Count  Dtype 
---  ------          --------------  ----- 
 0   DATE            82 non-null     object
 1   CSUSHPISA       82 non-null     object
 2   MSACSR          82 non-null     object
 3   PERMIT          82 non-null     object
 4   TLRESCONS       82 non-null     object
 5   EVACANTUSQ176N  82 non-null     object
dtypes: object(6)
memory usage: 4.0+ KB


In [None]:
demand.head()

Unnamed: 0,DATE,CSUSHPISA,MORTGAGE30US,UMCSENT,INTDSRUSM193N,MSPUS,GDP
0,01-01-2003,129.321,5.840769,79.966667,2.25,186000,11174.129
1,01-04-2003,131.756,5.506923,89.266667,2.166667,191800,11312.766
2,01-07-2003,135.013,6.033846,89.3,2.0,191900,11566.669
3,01-10-2003,138.835667,5.919286,91.966667,2.0,198800,11772.234
4,01-01-2004,143.298667,5.5975,98.0,2.0,212700,11923.447


In [None]:
demand.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 81 entries, 0 to 80
Data columns (total 7 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   DATE           81 non-null     object 
 1   CSUSHPISA      80 non-null     float64
 2   MORTGAGE30US   81 non-null     float64
 3   UMCSENT        81 non-null     float64
 4   INTDSRUSM193N  74 non-null     float64
 5   MSPUS          81 non-null     int64  
 6   GDP            81 non-null     float64
dtypes: float64(5), int64(1), object(1)
memory usage: 4.6+ KB


In [None]:
#Merge dataframes

supply['DATE'] = pd.to_datetime(supply['DATE'])
demand['DATE'] = pd.to_datetime(demand['DATE'])

supply = supply.sort_values('DATE')
demand = demand.sort_values('DATE')

supply_demand = pd.merge(supply, demand, on='DATE', suffixes=('_supply', '_demand'))

supply_demand.dropna(subset=['MSACSR', 'PERMIT', 'TLRESCONS', 'EVACANTUSQ176N', 'MORTGAGE30US', 'GDP', 'UMCSENT'], inplace=True)

In [None]:
imputer = SimpleImputer(strategy='mean')
supply_demand['INTDSRUSM193N'] = imputer.fit_transform(supply_demand[['INTDSRUSM193N']])

supply_demand = supply_demand.reset_index(drop=True)

In [None]:
supply_demand.head()

Unnamed: 0,DATE,CSUSHPISA_supply,MSACSR,PERMIT,TLRESCONS,EVACANTUSQ176N,CSUSHPISA_demand,MORTGAGE30US,UMCSENT,INTDSRUSM193N,MSPUS,GDP
0,2003-01-01,129.321,4.2,1806.333333,421328.6667,14908,129.321,5.840769,79.966667,2.25,186000,11174.129
1,2003-01-04,131.756,3.833333333,1837.666667,429308.6667,15244,131.756,5.506923,89.266667,2.166667,191800,11312.766
2,2003-01-07,135.013,3.633333333,1937.333333,458890.0,15614,135.013,6.033846,89.3,2.0,191900,11566.669
3,2003-01-10,138.8356667,3.966666667,1972.333333,491437.3333,15654,138.835667,5.919286,91.966667,2.0,198800,11772.234
4,2004-01-01,143.2986667,3.7,1994.666667,506856.3333,15895,143.298667,5.5975,98.0,2.0,212700,11923.447


In [None]:
supply_demand.drop('CSUSHPISA_supply', axis=1, inplace=True)

supply_demand.rename(columns={'CSUSHPISA_demand': 'CSUSHPISA'}, inplace=True)
supply_demand['CSUSHPISA'] = supply_demand['CSUSHPISA'].fillna(supply_demand['CSUSHPISA'].mean())

In [None]:
correlation = supply_demand.corr()['CSUSHPISA']
correlation_table = pd.DataFrame(correlation).reset_index()
correlation_table.columns = ['Factors', 'Correlation with CSUSHPISA']
print(correlation_table)

         Factors  Correlation with CSUSHPISA
0      CSUSHPISA                    1.000000
1   MORTGAGE30US                   -0.215379
2        UMCSENT                   -0.096213
3  INTDSRUSM193N                    0.102608
4          MSPUS                    0.907924
5            GDP                    0.823877


  correlation = supply_demand.corr()['CSUSHPISA']


In [None]:
supply_demand['DATE'] = pd.to_datetime(supply_demand['DATE'])
supply_demand.set_index('DATE', inplace=True)

supply_demand['MSACSR'] = pd.to_numeric(supply_demand['MSACSR'], errors='coerce')
supply_demand['PERMIT'] = pd.to_numeric(supply_demand['PERMIT'], errors='coerce')
supply_demand['TLRESCONS'] = pd.to_numeric(supply_demand['TLRESCONS'], errors='coerce')
supply_demand['EVACANTUSQ176N'] = pd.to_numeric(supply_demand['EVACANTUSQ176N'], errors='coerce')

In [None]:
supply_demand

Unnamed: 0_level_0,MSACSR,PERMIT,TLRESCONS,EVACANTUSQ176N,CSUSHPISA,MORTGAGE30US,UMCSENT,INTDSRUSM193N,MSPUS,GDP
DATE,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
2003-01-01,4.200000,1806.333333,421328.6667,14908,129.321000,5.840769,79.966667,2.250000,186000,11174.129
2003-01-04,3.833333,1837.666667,429308.6667,15244,131.756000,5.506923,89.266667,2.166667,191800,11312.766
2003-01-07,3.633333,1937.333333,458890.0000,15614,135.013000,6.033846,89.300000,2.000000,191900,11566.669
2003-01-10,3.966667,1972.333333,491437.3333,15654,138.835667,5.919286,91.966667,2.000000,198800,11772.234
2004-01-01,3.700000,1994.666667,506856.3333,15895,143.298667,5.597500,98.000000,2.000000,212700,11923.447
...,...,...,...,...,...,...,...,...,...,...
2022-01-01,6.233333,1864.000000,910611.0000,15166,290.868000,3.822308,63.133333,1.961712,433100,24740.480
2022-01-04,8.700000,1734.666667,947300.3333,15286,303.422667,5.266154,57.866667,1.961712,449300,25248.476
2022-01-07,9.566667,1610.666667,910346.0000,15306,301.726333,5.623077,56.100000,1.961712,468000,25723.941
2022-01-10,9.200000,1455.333333,870620.6667,14554,297.896667,6.664615,58.800000,1.961712,479500,26137.992


In [None]:
features = ['MSACSR', 'PERMIT', 'TLRESCONS', 'EVACANTUSQ176N', 'MORTGAGE30US', 'GDP', 'UMCSENT', 'INTDSRUSM193N', 'MSPUS']
target = 'CSUSHPISA'

In [None]:
X_train, X_test, y_train, y_test = train_test_split(supply_demand[features], supply_demand[target], test_size=0.2, random_state=42)

In [None]:
print('The baseline accuracy is ', y_train.value_counts(normalize=True).max())

The baseline accuracy is  0.015625


In [None]:
model_xgb = make_pipeline(
    OneHotEncoder(),
    SimpleImputer(strategy='mean'),
    XGBClassifier(random_state=42, n_estimators=75, n_jobs=-1)

)
model_xgb.fit(X_train,y_train);