# Explainer examples

This notebook shows how you can use the `Explainer` object for interactive visualization in your jupyter notebook.

All this plotting functionality gets called by the `ExplainerDashboard` to construct the interactive dashboard.

# Google colab link:

[https://colab.research.google.com/github/oegedijk/explainerdashboard/blob/master/explainer_examples.ipynb](https://colab.research.google.com/github/oegedijk/explainerdashboard/blob/master/explainer_examples.ipynb)

uncomment and run to install explainerdashboard:

In [1]:
#!pip install explainerdashboard

# notebook properties

Display multiple outputs per cell:

In [2]:
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

# ClassifierExplainer:

## train model

In [3]:
from sklearn.ensemble import RandomForestClassifier
from explainerdashboard.datasets import titanic_survive

X_train, y_train, X_test, y_test = titanic_survive()

model = RandomForestClassifier(n_estimators=50, max_depth=5)
model.fit(X_train, y_train)

RandomForestClassifier(max_depth=5, n_estimators=50)

## build explainer

In [4]:
from explainerdashboard import ClassifierExplainer
from explainerdashboard.datasets import titanic_names, feature_descriptions

_, test_names = titanic_names() # names of passengers 

explainer = ClassifierExplainer(model, X_test, y_test, 
                                cats=['Sex', 'Deck', 'Embarked'],
                                idxs=test_names, 
                                descriptions=feature_descriptions,
                                target='Survival',
                                labels=['Not survived', 'Survived'])

Note: shap=='guess' so guessing for RandomForestClassifier shap='tree'...
Detected RandomForestClassifier model: Changing class type to RandomForestClassifierExplainer...
Note: model_output=='probability', so assuming that raw shap output of RandomForestClassifier is in probability space...
Generating self.shap_explainer = shap.TreeExplainer(model)


## Importances

Get a dataframe of mean absolute shap value per feature and with a cutoff value of 0.01:

In [5]:
explainer.get_mean_abs_shap_df(cutoff=0.01)

Calculating shap values...


Unnamed: 0,Feature,MEAN_ABS_SHAP
0,Sex,0.184124
1,Deck,0.052336
2,PassengerClass,0.042172
3,Fare,0.030872
4,Embarked,0.018526
5,Age,0.013807
6,No_of_parents_plus_children_on_board,0.011674


 Get permutation importances (decrease in metric when randomly permuting a particular feature):

In [7]:
explainer.get_permutation_importances_df(topx=5)

Unnamed: 0,Feature,Importance,Score
0,Sex,0.218207,0.670747
5,PassengerClass,0.03759,0.851364
1,Deck,0.025402,0.863553
3,Fare,0.015964,0.872991
2,Embarked,0.009977,0.878977


### Plot mean absolute shap importances:

In [8]:
explainer.plot_importances(kind='shap', topx=5)

### Permutation importances showing top 6

In [10]:
explainer.plot_importances(kind='permutation', topx=6)

## detailed shap summary

Only show top 10 features, group onehot-encoded categorical features:

In [11]:
explainer.plot_shap_detailed(topx=10)

## interaction importances

### mean absolute shap interaction values for interactions with 'Sex' 
- the direct effect is usually the largest
- in this case PassengerClass shows the biggest interaction with gender

In [12]:
explainer.plot_interactions_importance('Sex', topx=5)

Calculating shap interaction values...
Reminder: TreeShap computational complexity is O(TLD^2), where T is the number of trees, L is the maximum number of leaves in any tree and D the maximal depth of any tree. So reducing these will speed up the calculation.


In [13]:
explainer.plot_interactions_importance('Fare', topx=5)

### Detailed shap interactions summary:

In [14]:
explainer.plot_interactions_detailed("Sex")

## Contributions

In [15]:
index = 0 # explain prediction for first row of X_test
explainer.contrib_df(index, topx=8)

Unnamed: 0,col,contribution,value,cumulative,base
0,_BASE,0.393227,,0.393227,0.0
1,Sex,0.279251,female,0.672478,0.393227
2,PassengerClass,0.069782,1,0.74226,0.672478
3,Fare,0.060863,71.2833,0.803123,0.74226
4,Deck,0.042137,C,0.845261,0.803123
5,Embarked,0.039712,Cherbourg,0.884973,0.845261
6,Age,-0.015436,38.0000,0.869537,0.884973
7,No_of_parents_plus_children_on_board,0.006844,0,0.876381,0.869537
8,No_of_siblings_plus_spouses_on_board,0.005441,1,0.881822,0.876381
9,_REST,0.0,,0.881822,0.881822


In [16]:
explainer.plot_contributions(index, topx=8)

In [17]:
name = test_names[6] # explainer prediction for name
print(name)
explainer.plot_contributions(name)

Vestrom, Miss. Hulda Amanda Adolfina


In [18]:
explainer.plot_contributions(name, topx=10, sort='high-to-low', orientation='horizontal')

## Shap dependence plots

In [19]:
explainer.plot_dependence("Age")

In [21]:
explainer.plot_dependence("Embarked")

### color by sex

In [20]:
explainer.plot_dependence("Age", color_col="Sex")

In [22]:
explainer.plot_dependence("Embarked", color_col="Sex")

### Highlight particular index

In [23]:

explainer.plot_dependence("Age", color_col="Sex", highlight_index=5)

## Shap interactions plots

In [24]:
explainer.plot_interaction("Sex", "PassengerClass")

In [25]:
explainer.plot_interaction("PassengerClass", "Sex")

In [26]:
explainer.plot_interaction("Fare", "Age", highlight_index=name)

In [27]:
explainer.plot_interaction("Age", "Fare")

## partial dependence plots (pdp)

### Plot average general partial dependence plot with ice lines for specific observations

In [28]:
explainer.plot_pdp("Fare")

In [31]:
explainer.plot_pdp("Deck", sort='alphabet')

In [32]:
explainer.plot_pdp("Deck", sort='freq')

### highlight pdp for specific observation

In [33]:
name = test_names[5]
print(name)
explainer.plot_pdp("Fare", name)

Saundercock, Mr. William Henry


### with default parameters:

In [34]:
explainer.plot_pdp("Age", index=5, drop_na=True, sample=100,
                    gridlines=100, gridpoints=10)

### adjusting parameters:

- `drop_na=False` no longer drop values equal to self.na_fill (-999 by default)
- `sample=200` sample 200 samples for calculating the average
- `gridlines=10`  display 10 additional grid lines
- `gridpoints=50` take 50 points along the x axis to calculate the lines

In [35]:
explainer.plot_pdp("Age", index=5, drop_na=False, sample=200,
                    gridlines=10, gridpoints=50)

## Classification validation plots:

In [36]:
explainer.metrics(cutoff=0.8)

Calculating prediction probabilities...


{'accuracy': 0.75,
 'precision': 1.0,
 'recall': 0.3150684931506849,
 'f1': 0.4791666666666667,
 'roc_auc_score': 0.8889548053068708,
 'pr_auc_score': 0.8594408754346096,
 'log_loss': 0.415704739488635}

In [37]:
explainer.prediction_result_df(test_names[3])

Unnamed: 0,label,probability
0,Not survived,0.523
1,Survived*,0.477


### confusion matrix

In [38]:
explainer.plot_confusion_matrix(cutoff=0.5, normalized=False, binary=True)

#### For multiclass classifiers, `binary=False` would display e.g. a 3x3 confusion matrix
- in this case it's a binary classifier, so binary=False makes no difference

### precision plot
- if the classifier works well the predicted probability should be the same as the observed probability per bin, so we would expect a nice straight line from 0 to 1

#### based on bin size:

In [39]:
explainer.plot_precision(bin_size=0.1)

#### based on quantiles, showing all classes, adding in a cutoff value

In [40]:
explainer.plot_precision(quantiles=10, cutoff=0.75, multiclass=True)

### Cumulative precision

In [41]:
explainer.plot_cumulative_precision()

### lift curve

In [42]:
explainer.plot_lift_curve(cutoff=None, percentage=False, round=2)

In [43]:
explainer.plot_lift_curve(cutoff=0.75, percentage=True, round=2)

### Plot classification:

In [44]:
explainer.plot_classification()

In [45]:
explainer.plot_classification(cutoff=0.75, percentage=False)

### ROC AUC Curve

In [46]:
explainer.plot_roc_auc(cutoff=0.75)

### Plot PR AUC

In [47]:
explainer.plot_pr_auc(cutoff=0.25)

# RegressionExplainer

In [48]:
from explainerdashboard.datasets import titanic_fare
from sklearn.ensemble import RandomForestRegressor

X_train, y_train, X_test, y_test = titanic_fare()

model = RandomForestRegressor(n_estimators=50, max_depth=5)
model.fit(X_train, y_train)

train_names, test_names = titanic_names()

RandomForestRegressor(max_depth=5, n_estimators=50)

In [49]:
from explainerdashboard.datasets import titanic_fare, titanic_names, feature_descriptions
from explainerdashboard import RegressionExplainer

train_names, test_names = titanic_names()

explainer = RegressionExplainer(model, X_test, y_test, 
                                cats=['Sex', 'Deck', 'Embarked'], 
                                idxs=test_names, 
                                target='Fare',
                                descriptions=feature_descriptions,
                                units="$")

Note: shap=='guess' so guessing for RandomForestRegressor shap='tree'...
Changing class type to RandomForestRegressionExplainer...
Generating self.shap_explainer = shap.TreeExplainer(model)


## Importances

### Mean absolute shap importances:

In [51]:
explainer.plot_importances(kind='shap', topx=5, round=3)

### Permutation importances,  showing top 4

In [52]:
explainer.plot_importances(kind='permutation', topx=4, round=3)

Calculating importances...


## detailed shap summary

In [53]:
explainer.plot_shap_detailed(topx=10)

## interaction importances

### mean absolute shap interaction values for interactions with 'Sex' 
- the direct effect is usually the largest
- in this case PassengerClass shows the biggest interaction with gender

In [54]:
explainer.plot_interactions_importance('Sex', topx=5)

Calculating shap interaction values...
Reminder: TreeShap computational complexity is O(TLD^2), where T is the number of trees, L is the maximum number of leaves in any tree and D the maximal depth of any tree. So reducing these will speed up the calculation.


In [55]:
explainer.plot_interactions_importance('Age', topx=5)

### Detailed shap interactions summary:

In [56]:
explainer.plot_interactions_detailed("Sex")

## Contributions

In [58]:
index = 0 # explain prediction for first row of X_test
explainer.plot_contributions(index, topx=5, round=2)

In [59]:
name = test_names[0] # explainer prediction for name
print(name)
explainer.plot_contributions(name, sort='low-to-high', orientation='horizontal')

Cumings, Mrs. John Bradley (Florence Briggs Thayer)


## Shap dependence plots

In [60]:
explainer.plot_dependence("Age")

### color by sex

In [61]:
explainer.plot_dependence("Age", color_col="Sex")

### Highlight particular index

In [63]:

explainer.plot_dependence("Deck", color_col="Sex", highlight_index=5)

## Shap interactions plots

In [64]:
explainer.plot_interaction("Sex", "PassengerClass")

In [65]:
explainer.plot_interaction("PassengerClass", "Sex")

In [66]:
explainer.plot_interaction("PassengerClass", "Age", highlight_index=5)

## partial dependence plots (pdp)

### Plot average general partial dependence plot with ice lines for specific observations

In [67]:
explainer.plot_pdp("PassengerClass")

In [68]:
explainer.plot_pdp("Deck")

### highlight pdp for specific observation

In [69]:
name = test_names[17]
print(name)
explainer.plot_pdp("PassengerClass", name)

Hood, Mr. Ambrose Jr


### with default parameters:

In [70]:
explainer.plot_pdp("PassengerClass", index=17, drop_na=True, sample=100,
                    gridlines=100, gridpoints=10)

### adjusting parameters:

- `drop_na=False` no longer drop values equal to self.na_fill (-999 by default)
- `sample=200` sample 200 samples for calculating the average
- `gridlines=10`  display 10 additional grid lines
- `gridpoints=50` take 50 points along the x axis to calculate the lines

In [71]:
explainer.plot_pdp("PassengerClass", index=17, drop_na=False, sample=200,
                    gridlines=10, gridpoints=50)

## Regression validation plots:

In [72]:
explainer.metrics()

Calculating predictions...


{'root_mean_squared_error': 26.86549079230947,
 'mean_absolute_error': 12.4349403655276,
 'R-squared': 0.5060120132123052}

In [73]:
explainer.prediction_result_df(test_names[3])

Calculating residuals...


Unnamed: 0,Unnamed: 1,Fare
0,Predicted,18.776 $
1,Observed,11.133 $
2,Residual,-7.642 $


### predicted vs actual

In [74]:
explainer.plot_predicted_vs_actual()

In [75]:
explainer.plot_predicted_vs_actual(log_x=True, log_y=True)

### plot residuals

In [76]:
explainer.plot_residuals()

In [77]:
explainer.plot_residuals(vs_actual=True, residuals='ratio')

In [78]:
explainer.plot_residuals(vs_actual=True, residuals='log-ratio')


divide by zero encountered in log



### residuals vs specific feature

In [79]:
explainer.plot_residuals_vs_feature("Age")

# RandomForestExplainer

For RandomForest models, the class type gets recast to either a `RandomForestClassifierExplainer` or a `RandomForestRegressionExplainer`, which provide some additional functionality to visualize the individual trees in the RandomForest.

In [80]:
from sklearn.ensemble import RandomForestClassifier
from explainerdashboard.datasets import titanic_survive, titanic_names, feature_descriptions

X_train, y_train, X_test, y_test = titanic_survive()
train_names, test_names = titanic_names()

model = RandomForestClassifier(n_estimators=50, max_depth=5)
model.fit(X_train, y_train)

explainer = ClassifierExplainer(model, X_test, y_test, 
                                cats=['Sex', 'Deck', 'Embarked'], 
                                idxs=test_names)

RandomForestClassifier(max_depth=5, n_estimators=50)

Note: shap=='guess' so guessing for RandomForestClassifier shap='tree'...
Detected RandomForestClassifier model: Changing class type to RandomForestClassifierExplainer...
Note: model_output=='probability', so assuming that raw shap output of RandomForestClassifier is in probability space...
Generating self.shap_explainer = shap.TreeExplainer(model)


In [81]:
name = test_names[20]
print(name)# first row of X_test
explainer.plot_trees(name, highlight_tree=20)

Christmann, Mr. Emil


In [82]:
explainer.decisionpath_df(tree_idx=20, index=name)

Calculating ShadowDecTree for each individual decision tree...


Unnamed: 0,node_id,average,feature,value,split,direction,left,right,diff
0,0,0.380608,Sex_male,1.0,0.5,right,0.701195,0.197727,-0.182881
1,24,0.197727,Deck_C,0.0,0.5,left,0.184019,0.407407,-0.013708
2,25,0.184019,Deck_A,0.0,0.5,left,0.179361,0.5,-0.004658
3,26,0.179361,Age,29.0,45.5,left,0.203911,0.0,0.024549
4,27,0.203911,No_of_parents_plus_children_on_board,0.0,0.5,left,0.167785,0.383333,-0.036125


In [83]:
explainer.decisionpath_summary_df(tree_idx=5, index=name)

Unnamed: 0,Feature,Condition,Adjustment,New Prediction
0,,,Starting average,37.48%
1,Sex_female,0.0 < 0.5,-18.06%,19.43%
2,No_of_siblings_plus_spouses_on_board,0.0 < 3.0,+0.8%,20.23%
3,Fare,8.05 < 26.268750190734863,-7.34%,12.89%
4,Deck_Unkown,1.0 >= 0.5,-0.47%,12.42%
5,Embarked_Cherbourg,0.0 < 0.5,-2.23%,10.19%
6,,,Final Prediction,10.19%


## decision_path
- this graphic is generated by dtreeviz
- See https://explained.ai/decision-tree-viz/index.html for more of the thinking behind this visualization
- dtreeviz generates an SVG gile that gets saved to disk
- but you need a working installation of graphviz for this to work
- Would be nice to turn this into a react component!


In [85]:
explainer.decisiontree(tree_idx=5, index=name)

IndexError: only integers, slices (`:`), ellipsis (`...`), numpy.newaxis (`None`) and integer or boolean arrays are valid indices