<a href="https://colab.research.google.com/github/pmontman/pub-choicemodels/blob/main/nb/tuto_08_lime.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Tutorial 8: Using LIME to explain Machine Learning models




In this tutorial we will connect the 'black box' machine learning models to the 
linear multinomial logit.

Specifically we will use the more interpretable multinomial logit to explain 'parts' of a black box model. The idea is brilliant, and quite recent!.

What we do is:
 * simulate a 'what-if' scenario that is close to a particular observation that we want to explain.
 * Use the 'black box model' as groundth truth, basically the predictions of the model for the what if scenario are taken as the true choices.
 * Fit an explainable linear model to this 'what-if', and then interpret it.

We will use the Swissmetro dataset, fit a ML model (Tutorial 7!) and then explain a particular observation with a multiomial logit.

---
---

# Preparing the environment
*The preparation and dataset loading code is given to the students*

In [1]:
!pip install biogeme



Load the packages, feel free to change the names.

In [2]:
import pandas  as pd
import numpy as np
import matplotlib.pyplot as plt

import biogeme.database as db
import biogeme.biogeme as bio
import biogeme.models as models
import biogeme.expressions as exp
import biogeme.tools as tools

---
---

# Auxiliary functions

The first function takes the dictionary of utilities, a pandas dataframe, and the name of the variable that contains the variable with the results of the choice. It returns the biogeme object with the model and the estimated 'results' object (the one we get the values, likelihoods, etc.)
We have added the dictionary with the utilities to the biogeme object, in case we use it later.

In [3]:
def qbus_estimate_bgm(V, pd_df, tgtvar_name, modelname='bgmdef'):
 av_auto = V.copy()
 for key, value in av_auto.items():
   av_auto[key] = 1
 bgm_db = db.Database(modelname + '_db', pd_df)
 globals().update(bgm_db.variables)
 logprob = models.loglogit (V , av_auto , bgm_db.variables[tgtvar_name] )
 bgm_model = bio.BIOGEME ( bgm_db, logprob )
 bgm_model.utility_dic = V.copy()
 return bgm_model, bgm_model.estimate()

The next function will calculate the predictions for a given biogeme object that was estimated with `qbus_estimate_bgm`. The output is the array with the choice probabilities. From the choice probabilities, this can be used to calculate accuracies, confusion matrices and the output of what-if scenarios.

In [4]:
def qbus_simulate_bgm(qbus_bgm_model, betas, pred_pd_df):
  av_auto = qbus_bgm_model.utility_dic.copy()
  for key, value in av_auto.items():
   av_auto[key] = 1

  targets = qbus_bgm_model.utility_dic.copy()
  for key, value in targets.items():
   targets[key] = models.logit(qbus_bgm_model.utility_dic, av_auto, key)

  bgm_db = db.Database('simul', pred_pd_df)
  globals().update(bgm_db.variables)
  bgm_pred_model = bio.BIOGEME(bgm_db, targets)
  simulatedValues = bgm_pred_model.simulate(betas)
  return simulatedValues

The function `qbus_calc_accu_confusion` calculates the accuracies given the choice probability predictions a pandas dataset and the specification of the name that contains the actual choices in the input dataset.

In [5]:
def qbus_calc_accu_confusion(sim_probs, pd_df, choice_var):
  which_max = sim_probs.idxmax(axis=1)
  data = {'y_Actual':   pd_df[choice_var],
          'y_Predicted': which_max
        }

  df = pd.DataFrame(data, columns=['y_Actual','y_Predicted'])
  confusion_matrix = pd.crosstab(df['y_Actual'], df['y_Predicted'], rownames=['Actual'], colnames=['Predicted'])
  accu = np.mean(which_max == pd_df[choice_var])
  return accu, confusion_matrix 

The next function calculates the likelihood ratio test having to write a bit less code that the default biogeme function. The arguments are the results objects of the two models to be compared. The first is the more complex and the second is the reference model (**the order is important!**). The third argument is the significance level for the test.

In [6]:
def qbus_likeli_ratio_test_bgm(results_complex, results_reference, signif_level):
  return tools.likelihood_ratio_test( (results_complex.data.logLike, results_complex.data.nparam),
                                     (results_reference.data.logLike, results_reference.data.nparam), signif_level)

The next function just updates the globals so we can use it 

In [7]:
def qbus_update_globals_bgm(pd_df):
   globals().update(db.Database('tmp_bg_bgm_for_glob', pd_df).variables)

# Exercise: Try a baseline multinomial logit, decision tree and multilayer perceptron models on the Swismetro dataset and compare the results.

Load the daaset as usual

In [8]:
swissmetro = pd.read_csv('http://transp-or.epfl.ch/data/swissmetro.dat', sep='\t')

Clean the dataset as instructed in Biogeme's example

In [9]:
swissmetro = swissmetro.loc[ swissmetro['CHOICE'] != 0, :]
swissmetro['TRAIN_CO_GA'] = swissmetro['TRAIN_CO'] * (swissmetro['GA'] ==0 )
swissmetro['SM_CO_GA'] = swissmetro['SM_CO'] * (swissmetro['GA'] ==0 )

In [10]:
swissmetro = swissmetro.drop(['TRAIN_AV', 'SM_AV', 'CAR_AV', 'ID'], axis=1)

Fit a base MNL model

In [11]:
qbus_update_globals_bgm(swissmetro)

In [12]:
from sklearn.model_selection import train_test_split
sw_train, sw_test = train_test_split(swissmetro, test_size = 0.25, random_state = 3840)

In [13]:
ASC_CAR = exp.Beta ( 'ASC_CAR' ,0, None , None ,0)
ASC_TRAIN = exp.Beta ( 'ASC_TRAIN' ,0, None , None ,0)
ASC_SM = exp.Beta ( 'ASC_SM' ,0, None , None ,1)
B_TIME = exp.Beta ( 'B_TIME' ,0, None , None ,0)
B_COST = exp.Beta ( 'B_COST' ,0, None , None ,0)
B_MALE_TR = exp.Beta( 'B_MALE_TR', 0, None, None, 0)
B_MALE_SM = exp.Beta( 'B_MALE_SM', 0, None, None, 0)
B_MALE_CAR = exp.Beta( 'B_MALE_CAR', 0, None, None, 0)
B_INCOME_TR = exp.Beta( 'B_INCOME_TR', 0, None, None, 0)
B_INCOME_SM = exp.Beta( 'B_INCOME_SM', 0, None, None, 0)
B_INCOME_CAR = exp.Beta( 'B_INCOME_CAR', 0, None, None, 0)

V1 = ASC_TRAIN + B_TIME * TRAIN_TT + B_COST * TRAIN_CO_GA + B_MALE_TR*MALE + B_INCOME_TR*INCOME
V2 = ASC_SM + B_TIME * SM_TT + B_COST * SM_CO_GA + B_MALE_SM*MALE + B_INCOME_SM*INCOME
V3 = ASC_CAR + B_TIME * CAR_TT + B_COST * CAR_CO + B_MALE_CAR*MALE+ B_INCOME_CAR*INCOME

V_sw = {1: V1, 2:V2, 3:V3}

In [14]:
model_sw, results_sw = qbus_estimate_bgm(V_sw, sw_train, 'CHOICE', 'swismmetro_mnl')

In [15]:
results_sw.getEstimatedParameters()

Unnamed: 0,Value,Std err,t-test,p-value,Rob. Std err,Rob. t-test,Rob. p-value
ASC_CAR,-1.256474,0.086796,-14.476208,0.0,0.089057,-14.108591,0.0
ASC_TRAIN,-0.428599,0.092994,-4.608901,4e-06,0.09751,-4.395431,1.1e-05
B_COST,-0.004257,0.000328,-12.989409,0.0,0.00036,-11.837294,0.0
B_INCOME_CAR,-0.189247,0.019507,-9.701569,0.0,0.019964,-9.479468,0.0
B_INCOME_SM,-0.216977,0.016702,-12.990706,0.0,0.017415,-12.45921,0.0
B_INCOME_TR,-0.318761,0.022958,-13.884293,0.0,0.026131,-12.198604,0.0
B_MALE_CAR,0.582749,0.045574,12.786766,0.0,0.048761,11.951064,0.0
B_MALE_SM,-0.04555,0.036766,-1.238902,0.215382,0.037668,-1.209239,0.226571
B_MALE_TR,-0.92458,0.046589,-19.84548,0.0,0.048017,-19.255273,0.0
B_TIME,-0.003403,0.000332,-10.246657,0.0,0.000358,-9.497965,0.0


Accuracy in the train set

In [16]:
sw_train_sim = qbus_simulate_bgm(model_sw, results_sw.getBetaValues(), sw_train)
qbus_calc_accu_confusion(sw_train_sim, sw_train, 'CHOICE')

(0.5610150516233362, Predicted     2    3
 Actual              
 1          1023   80
 2          4411  215
 3          2211   99)

Accuracy in the test set

In [17]:
sw_test_sim = qbus_simulate_bgm(model_sw, results_sw.getBetaValues(), sw_test)
qbus_calc_accu_confusion(sw_test_sim, sw_test, 'CHOICE')

(0.5809701492537314, Predicted     2   3
 Actual             
 1           290  30
 2          1517  73
 3           730  40)

#Decision tree

In [18]:
# Import the model we are using
from sklearn.ensemble import RandomForestClassifier
from sklearn.tree import DecisionTreeClassifier

# Instantiate model with 1000 decision trees
#dec_tree = RandomForestClassifier(n_estimators = 1000, random_state = 42, max_features= None)

dec_tree = DecisionTreeClassifier(max_features= None, max_depth=14, random_state=3840 )

In [19]:
dec_tree.fit(sw_train.drop('CHOICE', axis=1), pd.get_dummies(sw_train['CHOICE']));

In [20]:
# Use the predict method on the test data
predictions = dec_tree.predict(sw_test.drop('CHOICE', axis=1))
predictions

array([[1, 0, 0],
       [0, 1, 0],
       [0, 0, 1],
       ...,
       [0, 1, 0],
       [0, 1, 0],
       [1, 0, 0]], dtype=uint8)

In [21]:
dec_tree_sim = pd.DataFrame(predictions, columns=[1, 2, 3])

In [22]:
dec_tree_sim.index = sw_test.index

And finally, we can compute the accuracy, we see that is considerably higher thatn the MNL model

In [23]:
qbus_calc_accu_confusion(dec_tree_sim, sw_test, 'CHOICE')

(0.7067164179104478, Predicted    1     2    3
 Actual                   
 1          173   129   18
 2          154  1225  211
 3           40   234  496)

# Neural Network

In [24]:
from sklearn.neural_network import MLPClassifier

neurnet = MLPClassifier(hidden_layer_sizes = (256),
                        activation='logistic',  max_iter=12000, random_state=3840)

In [25]:
neurnet.fit(sw_train.drop('CHOICE', axis=1), pd.get_dummies(sw_train['CHOICE']));

In [26]:
sw_test = sw_train

In [27]:
# Use the predict method on the test data
predictions = neurnet.predict(sw_test.drop('CHOICE', axis=1))
predictions

array([[0, 1, 0],
       [0, 0, 1],
       [0, 1, 0],
       ...,
       [0, 1, 0],
       [0, 1, 0],
       [0, 0, 0]])

In [28]:
neurnet_sim = pd.DataFrame(predictions, columns=[1, 2, 3])

In [29]:
neurnet_sim.index = sw_test.index

The accuracy of the neural network is also higher than the MNL (but remember that is is a special case of mixed logit!)

In [30]:
qbus_calc_accu_confusion(neurnet_sim, sw_test, 'CHOICE')

(0.7706182360990173, Predicted    1     2     3
 Actual                    
 1          788   275    40
 2          508  3638   480
 3          198   343  1769)

# LIME


1) Select your observation of interest for which you want to have an explanation of its black box prediction.
2) Perturb your dataset and get the black box predictions for these new points.
3) Train  interpretable model on the dataset with the variations.
4) Explain the prediction of the original by interpreting the local model.

*Material from [Interpretable Machine Learning](https://christophm.github.io/interpretable-ml-book/lime.html)*(Christoph Molnar).

The observation we want to explain

In [68]:
observ = sw_test.iloc[[0,]]
observ

Unnamed: 0,GROUP,SURVEY,SP,PURPOSE,FIRST,TICKET,WHO,LUGGAGE,AGE,MALE,INCOME,GA,ORIGIN,DEST,TRAIN_TT,TRAIN_CO,TRAIN_HE,SM_TT,SM_CO,SM_HE,SM_SEATS,CAR_TT,CAR_CO,CHOICE,TRAIN_CO_GA,SM_CO_GA
8273,3,1,1,3,1,3,2,1,4,1,3,0,22,17,300,202,30,143,274,20,0,312,140,2,202,274


The prediction of the model, it predicts 2=Swismetro

In [69]:
dec_tree.predict( observ.drop('CHOICE', axis=1))

array([[0, 1, 0]], dtype=uint8)

Create a 'new dataset' by creating synthetic observations that are 'close' to the observation we want to explain, but not exactly the same.

In [70]:
#sw_train.mean(axis=0)

In [71]:
stdcols = sw_train.std(axis=0)
stdcols

GROUP             0.481452
SURVEY            0.481452
SP                0.000000
PURPOSE           1.158699
FIRST             0.499305
TICKET            2.199147
WHO               0.709973
LUGGAGE           0.605438
AGE               1.030948
MALE              0.432337
INCOME            0.942083
GA                0.347465
ORIGIN           10.156168
DEST              9.735157
TRAIN_TT         76.908112
TRAIN_CO       1080.259882
TRAIN_HE         37.463203
SM_TT            52.761652
SM_CO          1437.426119
SM_HE             8.148195
SM_SEATS          0.323128
CAR_TT           86.710388
CAR_CO           55.099929
CHOICE            0.634084
TRAIN_CO_GA      68.550145
SM_CO_GA         84.500802
dtype: float64

In [72]:
random.gauss(0, stdcols['SURVEY'])*0.1

0.01457504689270748

In [88]:
NEIGHBOURHOOD = 0.3


In [89]:

neighb_observ = pd.DataFrame(np.repeat(observ.values, 1000, axis=0))
neighb_observ.columns = observ.columns

neighb_observ

Unnamed: 0,GROUP,SURVEY,SP,PURPOSE,FIRST,TICKET,WHO,LUGGAGE,AGE,MALE,INCOME,GA,ORIGIN,DEST,TRAIN_TT,TRAIN_CO,TRAIN_HE,SM_TT,SM_CO,SM_HE,SM_SEATS,CAR_TT,CAR_CO,CHOICE,TRAIN_CO_GA,SM_CO_GA
0,3,1,1,3,1,3,2,1,4,1,3,0,22,17,300,202,30,143,274,20,0,312,140,2,202,274
1,3,1,1,3,1,3,2,1,4,1,3,0,22,17,300,202,30,143,274,20,0,312,140,2,202,274
2,3,1,1,3,1,3,2,1,4,1,3,0,22,17,300,202,30,143,274,20,0,312,140,2,202,274
3,3,1,1,3,1,3,2,1,4,1,3,0,22,17,300,202,30,143,274,20,0,312,140,2,202,274
4,3,1,1,3,1,3,2,1,4,1,3,0,22,17,300,202,30,143,274,20,0,312,140,2,202,274
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
995,3,1,1,3,1,3,2,1,4,1,3,0,22,17,300,202,30,143,274,20,0,312,140,2,202,274
996,3,1,1,3,1,3,2,1,4,1,3,0,22,17,300,202,30,143,274,20,0,312,140,2,202,274
997,3,1,1,3,1,3,2,1,4,1,3,0,22,17,300,202,30,143,274,20,0,312,140,2,202,274
998,3,1,1,3,1,3,2,1,4,1,3,0,22,17,300,202,30,143,274,20,0,312,140,2,202,274


In [90]:
import random


for i, row in neighb_observ.iterrows():
    #print(i)
    #observ2.at[i,'ifor'] = 2
    for column in neighb_observ.iloc[[i,]]:
     #print(column)
     neighb_observ.at[i, column] += random.gauss(0,stdcols[column]*NEIGHBOURHOOD)
     #print(random.gauss(0,stdcols[column]*NEIGHBOURHOOD))
     #print(observ2.at[i, column])

In [91]:
neighb_preds = dec_tree.predict( neighb_observ.drop('CHOICE', axis=1))

In [92]:
neighb_choices = pd.DataFrame(neighb_preds).idxmax(axis=1) + 1
neighb_observ['CHOICE'] = neighb_choices

In [93]:
model_lime, results_lime = qbus_estimate_bgm(V_sw, neighb_observ, 'CHOICE', 'lime_mnl')

In [94]:
results_lime.getEstimatedParameters()

Unnamed: 0,Value,Std err,t-test,p-value,Rob. Std err,Rob. t-test,Rob. p-value
ASC_CAR,-1.816508,0.592694,-3.064834,0.002177905,0.602709,-3.013907,0.002579066
ASC_TRAIN,-3.640175,0.686685,-5.301085,1.151166e-07,0.706161,-5.15488,2.537928e-07
B_COST,-0.007796,0.002129,-3.662012,0.0002502423,0.002157,-3.614231,0.0003012407
B_INCOME_CAR,-0.529223,0.112977,-4.684329,2.808779e-06,0.113125,-4.678192,2.894152e-06
B_INCOME_SM,-0.283251,0.096162,-2.945561,0.003223691,0.096155,-2.945778,0.003221435
B_INCOME_TR,0.087489,0.144891,0.603826,0.5459593,0.145234,0.602399,0.5469083
B_MALE_CAR,0.034569,0.112699,0.306732,0.7590473,0.112779,0.306515,0.7592128
B_MALE_SM,-0.18508,0.095914,-1.929643,0.05365106,0.09581,-1.931739,0.05339167
B_MALE_TR,-0.236869,0.144347,-1.640965,0.1008048,0.144495,-1.63929,0.1011528
B_TIME,0.001672,0.002035,0.821383,0.411428,0.002055,0.813562,0.4158958


In [95]:
observ_sim = qbus_simulate_bgm(model_lime, results_lime.getBetaValues(), observ)
observ_sim

Unnamed: 0,1,2,3
8273,0.112322,0.650284,0.237394
