In this notebook `XGBoost` and `LogisticRegressionCV` application to Mushroom Classification Dataset ([link](https://www.openml.org/d/24) ) is explained using BreakDown method.

In [0]:
import pandas as pd
import xgboost as xgb
import sklearn
import numpy as np
import matplotlib.pyplot as plt
import dalex
import plotly

url = 'https://www.openml.org/data/get_csv/24/dataset_24_mushroom.arff'
np.random.seed(0)

In [3]:
# download, clean and show data
data = pd.read_csv(url).applymap(lambda s: s.replace("'", ""))
data

Unnamed: 0,cap-shape,cap-surface,cap-color,bruises%3F,odor,gill-attachment,gill-spacing,gill-size,gill-color,stalk-shape,stalk-root,stalk-surface-above-ring,stalk-surface-below-ring,stalk-color-above-ring,stalk-color-below-ring,veil-type,veil-color,ring-number,ring-type,spore-print-color,population,habitat,class
0,x,s,n,t,p,f,c,n,k,e,e,s,s,w,w,p,w,o,p,k,s,u,p
1,x,s,y,t,a,f,c,b,k,e,c,s,s,w,w,p,w,o,p,n,n,g,e
2,b,s,w,t,l,f,c,b,n,e,c,s,s,w,w,p,w,o,p,n,n,m,e
3,x,y,w,t,p,f,c,n,n,e,e,s,s,w,w,p,w,o,p,k,s,u,p
4,x,s,g,f,n,f,w,b,k,t,e,s,s,w,w,p,w,o,e,n,a,g,e
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
8119,k,s,n,f,n,a,c,b,y,e,?,s,s,o,o,p,o,o,p,b,c,l,e
8120,x,s,n,f,n,a,c,b,y,e,?,s,s,o,o,p,n,o,p,b,v,l,e
8121,f,s,n,f,n,a,c,b,n,e,?,s,s,o,o,p,o,o,p,b,c,l,e
8122,k,y,n,f,y,f,c,n,b,t,?,s,k,w,w,p,w,o,e,w,v,l,p


In [4]:
# performing one hot encoding as all variables are categorical 
X = pd.get_dummies(data.drop(columns='class'))

# mushroom being poisonous is target variable
y = data['class'] == 'p'

# visulize input data
X

Unnamed: 0,cap-shape_b,cap-shape_c,cap-shape_f,cap-shape_k,cap-shape_s,cap-shape_x,cap-surface_f,cap-surface_g,cap-surface_s,cap-surface_y,cap-color_b,cap-color_c,cap-color_e,cap-color_g,cap-color_n,cap-color_p,cap-color_r,cap-color_u,cap-color_w,cap-color_y,bruises%3F_f,bruises%3F_t,odor_a,odor_c,odor_f,odor_l,odor_m,odor_n,odor_p,odor_s,odor_y,gill-attachment_a,gill-attachment_f,gill-spacing_c,gill-spacing_w,gill-size_b,gill-size_n,gill-color_b,gill-color_e,gill-color_g,...,stalk-color-below-ring_n,stalk-color-below-ring_o,stalk-color-below-ring_p,stalk-color-below-ring_w,stalk-color-below-ring_y,veil-type_p,veil-color_n,veil-color_o,veil-color_w,veil-color_y,ring-number_n,ring-number_o,ring-number_t,ring-type_e,ring-type_f,ring-type_l,ring-type_n,ring-type_p,spore-print-color_b,spore-print-color_h,spore-print-color_k,spore-print-color_n,spore-print-color_o,spore-print-color_r,spore-print-color_u,spore-print-color_w,spore-print-color_y,population_a,population_c,population_n,population_s,population_v,population_y,habitat_d,habitat_g,habitat_l,habitat_m,habitat_p,habitat_u,habitat_w
0,0,0,0,0,0,1,0,0,1,0,0,0,0,0,1,0,0,0,0,0,0,1,0,0,0,0,0,0,1,0,0,0,1,1,0,0,1,0,0,0,...,0,0,0,1,0,1,0,0,1,0,0,1,0,0,0,0,0,1,0,0,1,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,1,0
1,0,0,0,0,0,1,0,0,1,0,0,0,0,0,0,0,0,0,0,1,0,1,1,0,0,0,0,0,0,0,0,0,1,1,0,1,0,0,0,0,...,0,0,0,1,0,1,0,0,1,0,0,1,0,0,0,0,0,1,0,0,0,1,0,0,0,0,0,0,0,1,0,0,0,0,1,0,0,0,0,0
2,1,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,1,0,0,1,0,0,0,1,0,0,0,0,0,0,1,1,0,1,0,0,0,0,...,0,0,0,1,0,1,0,0,1,0,0,1,0,0,0,0,0,1,0,0,0,1,0,0,0,0,0,0,0,1,0,0,0,0,0,0,1,0,0,0
3,0,0,0,0,0,1,0,0,0,1,0,0,0,0,0,0,0,0,1,0,0,1,0,0,0,0,0,0,1,0,0,0,1,1,0,0,1,0,0,0,...,0,0,0,1,0,1,0,0,1,0,0,1,0,0,0,0,0,1,0,0,1,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,1,0
4,0,0,0,0,0,1,0,0,1,0,0,0,0,1,0,0,0,0,0,0,1,0,0,0,0,0,0,1,0,0,0,0,1,0,1,1,0,0,0,0,...,0,0,0,1,0,1,0,0,1,0,0,1,0,1,0,0,0,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0,0,0,1,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
8119,0,0,0,1,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0,0,0,1,0,0,0,1,0,1,0,1,0,0,0,0,...,0,1,0,0,0,1,0,1,0,0,0,1,0,0,0,0,0,1,1,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,1,0,0,0,0
8120,0,0,0,0,0,1,0,0,1,0,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0,0,0,1,0,0,0,1,0,1,0,1,0,0,0,0,...,0,1,0,0,0,1,1,0,0,0,0,1,0,0,0,0,0,1,1,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,1,0,0,0,0
8121,0,0,1,0,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0,0,0,1,0,0,0,1,0,1,0,1,0,0,0,0,...,0,1,0,0,0,1,0,1,0,0,0,1,0,0,0,0,0,1,1,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,1,0,0,0,0
8122,0,0,0,1,0,0,0,0,0,1,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,1,0,1,1,0,0,1,1,0,0,...,0,0,0,1,0,1,0,0,1,0,0,1,0,1,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,1,0,0,0,1,0,0,0,0


In [5]:
# 1. For the selected data set, train at least one tree-based ensemble model (random forest, gbm, catboost or any other boosting) 
X_train, X_test, y_train, y_test = sklearn.model_selection.train_test_split(X, y, test_size=0.2)
model = xgb.XGBClassifier().fit(X_train, y_train)

# reducing number of features to 20 by selecting most important features according to xgboost
most_important_features = X.columns[np.argsort(-model.feature_importances_)[:20]]
X_train = X_train[most_important_features]
X_test  = X_test[most_important_features]
model = xgb.XGBClassifier().fit(X_train, y_train)
print('selected features:', X_train.columns)

selected features: Index(['stalk-root_c', 'odor_n', 'stalk-root_r', 'odor_a', 'odor_l',
       'bruises%3F_f', 'stalk-color-below-ring_y', 'cap-color_y',
       'spore-print-color_r', 'stalk-surface-below-ring_y', 'odor_p', 'odor_f',
       'gill-size_b', 'spore-print-color_w', 'population_c',
       'stalk-surface-above-ring_k', 'gill-color_b',
       'stalk-surface-below-ring_f', 'cap-surface_f', 'cap-color_w'],
      dtype='object')


In [6]:
# 2. for some selected observation from this dataset, calculate the model predictions for model (1)
x1 = X_test.sample()
prediction = model.predict_proba(x1)
prediction

array([[0.9967704 , 0.00322963]], dtype=float32)

In [7]:
# 3. for an observation selected in (2), calculate the decomposition of model prediction using SHAP, Break Down or both (packages for R: DALEX, iml, packages for python: shap, dalex, piBreakDown).
explainer = dalex.Explainer(model, X_train, y_train)
breakdown = dalex.BreakDown()
breakdown.fit(explainer, x1)
breakdown.plot()

Preparation of a new explainer is initiated

  -> label             : not specified, model's class taken instead!
  -> data              : 6499 rows 20 cols
  -> target variable   :  Argument 'y' was a pandas.Series. Converted to a numpy.ndarray.
  -> target variable   : 6499 values
  -> predict function  : <function yhat.<locals>.<lambda> at 0x7fac510aad90> will be used
  -> predicted values  : min = 0.0002348676, mean = 0.48357436, max = 0.99986565
  -> residual function : difference between y and yhat
  -> residuals         : min = -0.038753804, mean = 3.849652e-05, max = 0.11504048
  -> model_info        : package xgboost

A new explainer has been created!


![alt text](https://students.mimuw.edu.pl/~mb384493/plots/newplot.png)

In [8]:
# 4. find two observations in the data set, such that they have different most important variables (e.g. age and gender are the most important for observation A, but race and class for observation B)
features1 = set(breakdown.result.iloc[1:3].variable_name)
while True:
  x2 = X_test.sample()
  breakdown = dalex.BreakDown()
  breakdown.fit(explainer, x2)
  features2 = set(breakdown.result.iloc[1:3].variable_name)
  if features1 != features2:
    print(features1, features2)
    breakdown.plot()
    break
  model.predict_proba(x2)

{'odor_l', 'odor_a'} {'odor_n', 'odor_l'}


![alt text](https://students.mimuw.edu.pl/~mb384493/plots/newplot1.png)

In [9]:
# 5. select one variable and find two observations in the data set such that for one observation this variable has a positive effect and for the other a negative effect
feature = 'odor_n'

x_positive_effect = X_test.loc[2149]
breakdown.fit(explainer, x_positive_effect)
breakdown.result
contirubution = breakdown.result.set_index('variable_name').loc[feature].contribution
print(f'feature: {feature} value: {x_positive_effect[feature]} contirubution: {contirubution}')    
breakdown.plot()

x_negative_effect = X_test.loc[3756]
breakdown.fit(explainer, x_negative_effect)
breakdown.result
contirubution = breakdown.result.set_index('variable_name').loc[feature].contribution
print(f'feature: {feature} value: {x_negative_effect[feature]} contirubution: {contirubution}')    
breakdown.plot()

feature: odor_n value: 0 contirubution: 0.3966618478298187


feature: odor_n value: 1 contirubution: -0.5178038477897644


![alt text](https://students.mimuw.edu.pl/~mb384493/plots/newplot2.png)
![alt text](https://students.mimuw.edu.pl/~mb384493/plots/newplot3.png)

In [10]:
# 6. train a second model (of any class, neural nets, linear, other boosting) and find an observation for which BD/shap attributions are different between the models

import sklearn.linear_model
model2 = sklearn.linear_model.LogisticRegressionCV()
model2.fit(X_train, y_train)

LogisticRegressionCV(Cs=10, class_weight=None, cv=None, dual=False,
                     fit_intercept=True, intercept_scaling=1.0, l1_ratios=None,
                     max_iter=100, multi_class='auto', n_jobs=None,
                     penalty='l2', random_state=None, refit=True, scoring=None,
                     solver='lbfgs', tol=0.0001, verbose=0)

In [11]:
explainer2 = dalex.Explainer(model2, X_train, y_train)

#x3 = X_test.sample()
x3 = X_test.loc[6283:6283]
print(x3.index)

breakdown.fit(explainer, x3)
breakdown.plot()

breakdown.fit(explainer2, x3)
breakdown.plot()

Preparation of a new explainer is initiated

  -> label             : not specified, model's class taken instead!
  -> data              : 6499 rows 20 cols
  -> target variable   :  Argument 'y' was a pandas.Series. Converted to a numpy.ndarray.
  -> target variable   : 6499 values
  -> predict function  : <function yhat.<locals>.<lambda> at 0x7fac4d4a6268> will be used
  -> predicted values  : min = 0.00012908499170542404, mean = 0.4836128825520352, max = 0.9999995435001882
  -> residual function : difference between y and yhat
  -> residuals         : min = -0.11113481338018112, mean = -1.9034570995308692e-08, max = 0.38274214904625015
  -> model_info        : package sklearn

A new explainer has been created!
Int64Index([6283], dtype='int64')


![alt text](https://students.mimuw.edu.pl/~mb384493/plots/newplot4.png)
![alt text](https://students.mimuw.edu.pl/~mb384493/plots/newplot5.png)

In [0]:
# 7. Comment on the results for points (4), (5) and (6)

Ad 4. Two break down plots are telling us that odor variable is important. Order of appearance of odor indicator is rather random due to random nature of ensembling models and high corelation of odor indicators.

Ad 5. Explanations seem to be consistent as odor is one on the most important variables so for `odor_n` values 0 and 1 should have opposite effect. However, big contirubution (-0.52) of `odor_n = 1` in the nagative example is balanced by other factors. One may obtain better model with more careful features preparation.

Ad 6. Both models assign high importance to odor, agreeing that `odor_f=1` has positive influence on prediction. However, indicators of odor are highly correlated and due to random nature of ensembling all appear at the top of `XGBClassfier` breakdown. In the case of `LogisticRegressionCV` it is a little bit better due to regularization. Regularization is a good way to handle dependent variables.