### Shapash Model Overview
https://shapash.readthedocs.io/en/latest/

##### With this tutorial you:
Understand how to create a Shapash SmartPredictor to make prediction and have local explanation in production with a simple use case.

This tutorial describes the different steps from training the model to Shapash SmartPredictor deployment. A more detailed tutorial allows you to know more about the SmartPredictor Object.

Contents:

- Build a Regressor
- Compile Shapash SmartExplainer
- From Shapash SmartExplainer to SmartPredictor
- Save Shapash Smartpredictor Object in pickle file
- Make a prediction

#Example 1


Video walkthrough of the following code : https://www.youtube.com/watch?v=-66Wt3IHb9U

In [1]:
import seaborn as sns
df=sns.load_dataset('tips')
df.head()

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size
0,16.99,1.01,Female,No,Sun,Dinner,2
1,10.34,1.66,Male,No,Sun,Dinner,3
2,21.01,3.5,Male,No,Sun,Dinner,3
3,23.68,3.31,Male,No,Sun,Dinner,2
4,24.59,3.61,Female,No,Sun,Dinner,4


In [2]:
### Divide the dataset into independent and dependent dataset
y=df['tip']
X=df[df.columns.difference(['tip'])]
X.head()

Unnamed: 0,day,sex,size,smoker,time,total_bill
0,Sun,Female,2,No,Dinner,16.99
1,Sun,Male,3,No,Dinner,10.34
2,Sun,Male,3,No,Dinner,21.01
3,Sun,Male,2,No,Dinner,23.68
4,Sun,Female,4,No,Dinner,24.59


In [3]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 244 entries, 0 to 243
Data columns (total 7 columns):
 #   Column      Non-Null Count  Dtype   
---  ------      --------------  -----   
 0   total_bill  244 non-null    float64 
 1   tip         244 non-null    float64 
 2   sex         244 non-null    category
 3   smoker      244 non-null    category
 4   day         244 non-null    category
 5   time        244 non-null    category
 6   size        244 non-null    int64   
dtypes: category(4), float64(2), int64(1)
memory usage: 7.3 KB


In [4]:
X['day']=X['day'].cat.codes
X['sex']=X['sex'].cat.codes
X['smoker']=X['smoker'].cat.codes
X['time']=X['time'].cat.codes

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  This is separate from the ipykernel package so we can avoid doing imports until
A value is trying to be set on a copy of a slice from a DataFrame.
Try using

In [5]:
X

Unnamed: 0,day,sex,size,smoker,time,total_bill
0,3,1,2,1,1,16.99
1,3,0,3,1,1,10.34
2,3,0,3,1,1,21.01
3,3,0,2,1,1,23.68
4,3,1,4,1,1,24.59
...,...,...,...,...,...,...
239,2,0,3,1,1,29.03
240,2,1,2,0,1,27.18
241,2,0,2,0,1,22.67
242,2,0,2,1,1,17.82


In [6]:
### Train Test split
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split
X_train,X_test,y_train,y_test=train_test_split(X,y,train_size=0.75,random_state=1)
regressor = RandomForestRegressor(n_estimators=200).fit(X_train,y_train)

#### Lets Understand Our Model With Shapash 
In this section, we use the SmartExplainer Object from shapash.

- It allows users to understand how the model works with the specified data.
- This object must be used only for data mining step. Shapash provides another object for deployment.


In [None]:
!pip install shapash

In [8]:
from shapash.explainer.smart_explainer import SmartExplainer
xpl = SmartExplainer()
xpl.compile(
    x=X_test,
    model=regressor,
    
)

xpl

Backend: Shap TreeExplainer


<shapash.explainer.smart_explainer.SmartExplainer at 0x7f622fdb3890>

#### Lets Understand the results of your trained model
Then, we can easily get a first summary of the explanation of the model results.

- Here, we chose to get the 3 most contributive features for each prediction.
- We used a wording to get features names more understandable in operationnal case.

In [11]:
app = xpl.run_app(title_story='Tips Dataset')

INFO:numexpr.utils:NumExpr defaulting to 2 threads.


Dash is running on http://0.0.0.0:8050/





INFO:root:Your Shapash application run on http://47a62ae84b0a:8050/
INFO:root:Use the method .kill() to down your app.
INFO:shapash.webapp.smart_app:Dash is running on http://0.0.0.0:8050/



 * Serving Flask app "shapash.webapp.smart_app" (lazy loading)
 * Environment: production
[2m   Use a production WSGI server instead.[0m
 * Debug mode: off


INFO:werkzeug: * Running on http://0.0.0.0:8050/ (Press CTRL+C to quit)


In [12]:
xpl.to_pandas(max_contrib=3).head()

Unnamed: 0,pred,feature_1,value_1,contribution_1,feature_2,value_2,contribution_2,feature_3,value_3,contribution_3
67,1.48755,total_bill,3.07,-1.0928,day,2,-0.263028,size,1,-0.1198
243,2.47635,total_bill,18.78,-0.2365,day,0,-0.111353,smoker,1,-0.108559
206,3.48225,total_bill,26.59,0.468972,size,3,0.0953703,sex,0,-0.0383996
122,2.2203,total_bill,14.26,-0.518983,day,0,-0.152207,size,2,-0.0422378
89,3.1858,total_bill,21.16,0.390149,day,0,-0.0893861,size,2,-0.0257139


In [None]:
predictor = xpl.to_smartpredictor()

In [None]:
predictor.save('./predictor.pkl')

In [None]:
from shapash.utils.load_smartpredictor import load_smartpredictor
predictor_load = load_smartpredictor('./predictor.pkl')

#### Make a prediction with your SmartPredictor
In order to make new predictions and summarize local explainability of your model on new datasets, you can use the method add_input of the SmartPredictor.

- The add_input method is the first step to add a dataset for prediction and explainability.
- It checks the structure of the dataset, the prediction and the contribution if specified.
- It applies the preprocessing specified in the initialisation and reorder the features with the order used by the model. (see the documentation of this method)
- In API mode, this method can handle dictionnaries data which can be received from a GET or a POST request.
- Add data
- The x input in add_input method doesn't have to be encoded, add_input applies preprocessing.

In [None]:
predictor_load.add_input(x=X, ypred=y)

In [None]:
detailed_contributions = predictor_load.detail_contributions()


In [None]:
detailed_contributions.head()

Unnamed: 0,tip,day,sex,size,smoker,time,total_bill
0,1.01,0.05889,0.107123,-0.021913,-0.032742,-0.005985,0.183477
1,1.66,0.083062,-0.05877,0.003739,-0.062743,-0.018027,-1.230111
2,3.5,0.08489,-0.021616,-0.005301,-0.008053,0.006988,0.42884
3,3.31,0.125816,-0.021922,-0.043787,0.022929,0.007135,0.074177
4,3.61,0.048222,0.058356,-0.053819,-0.006406,-0.006721,-0.319232


#### Summarize explanability of the predictions
- You can use the summarize method to summarize your local explainability
- This summary can be configured with modify_mask method so that you have explainability that meets your operational needs.

In [None]:
predictor_load.modify_mask(max_contrib=3)

In [None]:
explanation = predictor_load.summarize()

In [None]:
explanation.head()

Unnamed: 0,tip,feature_1,value_1,contribution_1,feature_2,value_2,contribution_2,feature_3,value_3,contribution_3
0,1.01,total_bill,16.99,0.183477,sex,1.0,0.107123,day,3.0,0.05889
1,1.66,total_bill,10.34,-1.230111,day,3.0,0.083062,smoker,1.0,-0.062743
2,3.5,total_bill,21.01,0.42884,day,3.0,0.08489,sex,0.0,-0.021616
3,3.31,day,3.0,0.125816,total_bill,23.68,0.074177,size,2.0,-0.043787
4,3.61,total_bill,24.59,-0.319232,sex,1.0,0.058356,size,4.0,-0.053819


#Example 2

https://github.com/MAIF/shapash/blob/master/tutorial/tutorial01-Shapash-Overview-Launch-WebApp.ipynb

In [12]:
!pip install category_encoders

Collecting category_encoders
[?25l  Downloading https://files.pythonhosted.org/packages/44/57/fcef41c248701ee62e8325026b90c432adea35555cbc870aff9cfba23727/category_encoders-2.2.2-py2.py3-none-any.whl (80kB)
[K     |████                            | 10kB 14.6MB/s eta 0:00:01[K     |████████▏                       | 20kB 20.1MB/s eta 0:00:01[K     |████████████▏                   | 30kB 12.3MB/s eta 0:00:01[K     |████████████████▎               | 40kB 9.7MB/s eta 0:00:01[K     |████████████████████▎           | 51kB 7.8MB/s eta 0:00:01[K     |████████████████████████▍       | 61kB 9.0MB/s eta 0:00:01[K     |████████████████████████████▍   | 71kB 7.9MB/s eta 0:00:01[K     |████████████████████████████████| 81kB 4.9MB/s 
Installing collected packages: category-encoders
Successfully installed category-encoders-2.2.2


In [13]:
import pandas as pd
from category_encoders import OrdinalEncoder
from lightgbm import LGBMRegressor
from sklearn.model_selection import train_test_split
from sklearn.ensemble import ExtraTreesRegressor

  import pandas.util.testing as tm


In [16]:
from shapash.data.data_loader import data_loading
house_df, house_dict = data_loading('house_prices')

y_df=house_df['SalePrice'].to_frame()
X_df=house_df[house_df.columns.difference(['SalePrice'])]

house_df.head()

Unnamed: 0_level_0,MSSubClass,MSZoning,LotArea,Street,LotShape,LandContour,Utilities,LotConfig,LandSlope,Neighborhood,Condition1,Condition2,BldgType,HouseStyle,OverallQual,OverallCond,YearBuilt,YearRemodAdd,RoofStyle,RoofMatl,Exterior1st,Exterior2nd,MasVnrType,MasVnrArea,ExterQual,ExterCond,Foundation,BsmtQual,BsmtCond,BsmtExposure,BsmtFinType1,BsmtFinSF1,BsmtFinType2,BsmtFinSF2,BsmtUnfSF,TotalBsmtSF,Heating,HeatingQC,CentralAir,Electrical,1stFlrSF,2ndFlrSF,LowQualFinSF,GrLivArea,BsmtFullBath,BsmtHalfBath,FullBath,HalfBath,BedroomAbvGr,KitchenAbvGr,KitchenQual,TotRmsAbvGrd,Functional,Fireplaces,GarageType,GarageYrBlt,GarageFinish,GarageArea,GarageQual,GarageCond,PavedDrive,WoodDeckSF,OpenPorchSF,EnclosedPorch,3SsnPorch,ScreenPorch,PoolArea,MiscVal,MoSold,YrSold,SaleType,SaleCondition,SalePrice
Id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1,Unnamed: 25_level_1,Unnamed: 26_level_1,Unnamed: 27_level_1,Unnamed: 28_level_1,Unnamed: 29_level_1,Unnamed: 30_level_1,Unnamed: 31_level_1,Unnamed: 32_level_1,Unnamed: 33_level_1,Unnamed: 34_level_1,Unnamed: 35_level_1,Unnamed: 36_level_1,Unnamed: 37_level_1,Unnamed: 38_level_1,Unnamed: 39_level_1,Unnamed: 40_level_1,Unnamed: 41_level_1,Unnamed: 42_level_1,Unnamed: 43_level_1,Unnamed: 44_level_1,Unnamed: 45_level_1,Unnamed: 46_level_1,Unnamed: 47_level_1,Unnamed: 48_level_1,Unnamed: 49_level_1,Unnamed: 50_level_1,Unnamed: 51_level_1,Unnamed: 52_level_1,Unnamed: 53_level_1,Unnamed: 54_level_1,Unnamed: 55_level_1,Unnamed: 56_level_1,Unnamed: 57_level_1,Unnamed: 58_level_1,Unnamed: 59_level_1,Unnamed: 60_level_1,Unnamed: 61_level_1,Unnamed: 62_level_1,Unnamed: 63_level_1,Unnamed: 64_level_1,Unnamed: 65_level_1,Unnamed: 66_level_1,Unnamed: 67_level_1,Unnamed: 68_level_1,Unnamed: 69_level_1,Unnamed: 70_level_1,Unnamed: 71_level_1,Unnamed: 72_level_1,Unnamed: 73_level_1
1,2-Story 1946 & Newer,Residential Low Density,8450,Paved,Regular,Near Flat/Level,"All public Utilities (E,G,W,& S)",Inside lot,Gentle slope,College Creek,Normal,Normal,Single-family Detached,Two story,7,5,2003,2003,Gable,Standard (Composite) Shingle,Vinyl Siding,Vinyl Siding,Brick Face,196.0,Good,Average/Typical,Poured Contrete,Good (90-99 inches),Typical - slight dampness allowed,No Exposure/No Basement,Good Living Quarters,706,Unfinished/No Basement,0,150,856,Gas forced warm air furnace,Excellent,Yes,Standard Circuit Breakers & Romex,856,854,0,1710,1,0,2,1,3,1,Good,8,Typical Functionality,0,Attached to home,2003.0,Rough Finished,548,Typical/Average,Typical/Average,Paved,0,61,0,0,0,0,0,2,2008,Warranty Deed - Conventional,Normal Sale,208500
2,1-Story 1946 & Newer All Styles,Residential Low Density,9600,Paved,Regular,Near Flat/Level,"All public Utilities (E,G,W,& S)",Frontage on 2 sides of property,Gentle slope,Veenker,Adjacent to feeder street,Normal,Single-family Detached,One story,6,8,1976,1976,Gable,Standard (Composite) Shingle,Metal Siding,Metal Siding,,0.0,Average/Typical,Average/Typical,Cinder Block,Good (90-99 inches),Typical - slight dampness allowed,Good Exposure,Average Living Quarters,978,Unfinished/No Basement,0,284,1262,Gas forced warm air furnace,Excellent,Yes,Standard Circuit Breakers & Romex,1262,0,0,1262,0,1,2,0,3,1,Typical/Average,6,Typical Functionality,1,Attached to home,1976.0,Rough Finished,460,Typical/Average,Typical/Average,Paved,298,0,0,0,0,0,0,5,2007,Warranty Deed - Conventional,Normal Sale,181500
3,2-Story 1946 & Newer,Residential Low Density,11250,Paved,Slightly irregular,Near Flat/Level,"All public Utilities (E,G,W,& S)",Inside lot,Gentle slope,College Creek,Normal,Normal,Single-family Detached,Two story,7,5,2001,2002,Gable,Standard (Composite) Shingle,Vinyl Siding,Vinyl Siding,Brick Face,162.0,Good,Average/Typical,Poured Contrete,Good (90-99 inches),Typical - slight dampness allowed,Mimimum Exposure,Good Living Quarters,486,Unfinished/No Basement,0,434,920,Gas forced warm air furnace,Excellent,Yes,Standard Circuit Breakers & Romex,920,866,0,1786,1,0,2,1,3,1,Good,6,Typical Functionality,1,Attached to home,2001.0,Rough Finished,608,Typical/Average,Typical/Average,Paved,0,42,0,0,0,0,0,9,2008,Warranty Deed - Conventional,Normal Sale,223500
4,2-Story 1945 & Older,Residential Low Density,9550,Paved,Slightly irregular,Near Flat/Level,"All public Utilities (E,G,W,& S)",Corner lot,Gentle slope,Crawford,Normal,Normal,Single-family Detached,Two story,7,5,1915,1970,Gable,Standard (Composite) Shingle,Wood Siding,Wood Shingles,,0.0,Average/Typical,Average/Typical,Brick & Tile,Typical (80-89 inches),Good,No Exposure/No Basement,Average Living Quarters,216,Unfinished/No Basement,0,540,756,Gas forced warm air furnace,Good,Yes,Standard Circuit Breakers & Romex,961,756,0,1717,1,0,1,0,3,1,Good,7,Typical Functionality,1,Detached from home,1998.0,Unfinished/No Garage,642,Typical/Average,Typical/Average,Paved,0,35,272,0,0,0,0,2,2006,Warranty Deed - Conventional,Abnormal Sale,140000
5,2-Story 1946 & Newer,Residential Low Density,14260,Paved,Slightly irregular,Near Flat/Level,"All public Utilities (E,G,W,& S)",Frontage on 2 sides of property,Gentle slope,Northridge,Normal,Normal,Single-family Detached,Two story,8,5,2000,2000,Gable,Standard (Composite) Shingle,Vinyl Siding,Vinyl Siding,Brick Face,350.0,Good,Average/Typical,Poured Contrete,Good (90-99 inches),Typical - slight dampness allowed,Average Exposure,Good Living Quarters,655,Unfinished/No Basement,0,490,1145,Gas forced warm air furnace,Excellent,Yes,Standard Circuit Breakers & Romex,1145,1053,0,2198,1,0,2,1,4,1,Good,9,Typical Functionality,1,Attached to home,2000.0,Rough Finished,836,Typical/Average,Typical/Average,Paved,192,84,0,0,0,0,0,12,2008,Warranty Deed - Conventional,Normal Sale,250000


In [17]:
from category_encoders import OrdinalEncoder

categorical_features = [col for col in X_df.columns if X_df[col].dtype == 'object']

encoder = OrdinalEncoder(
    cols=categorical_features,
    handle_unknown='ignore',
    return_df=True).fit(X_df)

X_df=encoder.transform(X_df)

Xtrain, Xtest, ytrain, ytest = train_test_split(X_df, y_df, train_size=0.75, random_state=1)
regressor = LGBMRegressor(n_estimators=200).fit(Xtrain,ytrain)

  elif pd.api.types.is_categorical(cols):


In [18]:
from shapash.explainer.smart_explainer import SmartExplainer 
xpl = SmartExplainer(features_dict=house_dict) # optional parameter, specifies label for features name

xpl.compile(
    x=Xtest,
    model=regressor,
    preprocessing=encoder # Optional: compile step can use inverse_transform method
)

app = xpl.run_app(title_story='House Prices')

Backend: Shap TreeExplainer


INFO:numexpr.utils:NumExpr defaulting to 2 threads.




Dash is running on http://0.0.0.0:8050/



INFO:root:Your Shapash application run on http://47a62ae84b0a:8050/
INFO:root:Use the method .kill() to down your app.
INFO:shapash.webapp.smart_app:Dash is running on http://0.0.0.0:8050/



 * Serving Flask app "shapash.webapp.smart_app" (lazy loading)
 * Environment: production


In [19]:
app.kill()

In [20]:
summary_df= xpl.to_pandas(
    max_contrib=3, # Number Max of features to show in summary
    threshold=5000,
)
xpl.save('./xpl.pkl')

In [21]:
summary_df.head()

Unnamed: 0,pred,feature_1,value_1,contribution_1,feature_2,value_2,contribution_2,feature_3,value_3,contribution_3
259,209141.256921,Ground living area square feet,1792,13710.4,Overall material and finish of the house,7,12776.3,Total square feet of basement area,963,-5103.03
268,178734.474531,Ground living area square feet,2192,29747.0,Overall material and finish of the house,5,-26151.3,Overall condition of the house,8,9190.84
289,113950.84457,Overall material and finish of the house,5,-24730.0,Ground living area square feet,900,-16342.6,Total square feet of basement area,882,-5922.64
650,74957.162142,Overall material and finish of the house,4,-33927.7,Ground living area square feet,630,-23234.4,Total square feet of basement area,630,-11687.9
1234,135305.2435,Overall material and finish of the house,5,-25445.7,Ground living area square feet,1188,-11476.6,Condition of sale,Abnormal Sale,-5071.82
