# Benchmarking different CF explanation methods

In this notebook, we show runtimes of different model-agnostic explanation methods. Currently, we support three model-agnostic explanation methods:
1. Random-Sampling
2. Genetic Algorithm
3. Querying a KD tree

In [1]:
import dice_ml
from dice_ml.utils import helpers # helper functions
from dice_ml import Dice

import numpy as np
import pandas as pd
import timeit
import random
from sklearn.neural_network import MLPClassifier
from sklearn.metrics import classification_report, accuracy_score

## Loading dataset

We use the "adult" income dataset from UCI Machine Learning Repository (https://archive.ics.uci.edu/ml/datasets/adult). For demonstration purposes, we transform the data as described in dice_ml.utils.helpers module.

In [2]:
dataset = helpers.load_adult_income_dataset()

In [3]:
dataset.head()

Unnamed: 0,age,workclass,education,marital_status,occupation,race,gender,hours_per_week,income
0,39,Government,Bachelors,Single,White-Collar,White,Male,40,0
1,50,Self-Employed,Bachelors,Married,White-Collar,White,Male,13,0
2,38,Private,HS-grad,Divorced,Blue-Collar,White,Male,40,0
3,53,Private,School,Married,Blue-Collar,Other,Male,40,0
4,28,Private,Bachelors,Married,Professional,Other,Female,40,0


In [4]:
d = dice_ml.Data(dataframe=dataset, continuous_features=['age', 'hours_per_week'], outcome_name='income')

## Training the ML model

Currently, the genetic algorithm & KD tree methods work with scikit-learn models. Support for Tensorflow 1&2 and Pytorch will be implemented soon.

We train two sklearn MLP's here: one by one-hot-encoding the categorical variables, and another by label-encoding the categorical variables. This is done because the random-sampling and DiceKD explanation methods use one-hot encoding, while DiceGenetic explanation method uses label-encoding. We plan to support other types of encoding in the near future. 

### One-hot-encoding

In [5]:
train_ohe, test_ohe = d.split_data(d.normalize_data(d.one_hot_encoded_data))
X_train_ohe = train_ohe.loc[:, train_ohe.columns != 'income']
y_train_ohe = train_ohe.loc[:, train_ohe.columns == 'income']
X_test_ohe = test_ohe.loc[:, test_ohe.columns != 'income']
y_test_ohe = test_ohe.loc[:, test_ohe.columns == 'income']

In [6]:
mlp_ohe = MLPClassifier(hidden_layer_sizes=(20), alpha=0.001, learning_rate_init=0.01, batch_size=32, random_state=17,
                    max_iter=20, verbose=False, validation_fraction=0.2, ) #max_iter is epochs in TF
mlp_ohe.fit(X_train_ohe, y_train_ohe.values.ravel())



MLPClassifier(alpha=0.001, batch_size=32, hidden_layer_sizes=20,
              learning_rate_init=0.01, max_iter=20, random_state=17,
              validation_fraction=0.2)

In [7]:
backend = 'sklearn'

In [8]:
# provide the trained ML model to DiCE's model object
m_ohe = dice_ml.Model(model=mlp_ohe, backend=backend)

### Label-encoding

In [9]:
train_lbl, test_lbl = d.split_data(d.normalize_data(d.label_encoded_data, encoding='label'))
X_train_lbl = train_lbl.loc[:, train_lbl.columns != 'income']
y_train_lbl = train_lbl.loc[:, train_lbl.columns == 'income']
X_test_lbl = test_lbl.loc[:, test_lbl.columns != 'income']
y_test_lbl = test_lbl.loc[:, test_lbl.columns == 'income']

In [10]:
mlp_lbl = MLPClassifier(hidden_layer_sizes=(20), alpha=0.001, learning_rate_init=0.01, batch_size=32, random_state=17,
                    max_iter=20, verbose=False, validation_fraction=0.2, ) #max_iter is epochs in TF
mlp_lbl.fit(X_train_lbl, y_train_lbl.values.ravel())



MLPClassifier(alpha=0.001, batch_size=32, hidden_layer_sizes=20,
              learning_rate_init=0.01, max_iter=20, random_state=17,
              validation_fraction=0.2)

In [11]:
# provide the trained ML model to DiCE's model object
m_lbl = dice_ml.Model(model=mlp_lbl, backend=backend)

## Initialize counterfactual generation methods

We now initialize all three counterfactuals generation methods

In [12]:
exp = Dice(d, m_ohe, method="random")

In [13]:
exp_genetic = Dice(d, m_lbl, method="genetic")

In [14]:
exp_KD = Dice(d, m_ohe, method="kdtree")

In [15]:
query_instance = {'age':22, 
                  'workclass':'Private', 
                  'education':'HS-grad', 
                  'marital_status':'Single', 
                  'occupation':'Service',
                  'race': 'White', 
                  'gender':'Female', 
                  'hours_per_week': 45}

## Generate Counterfactuals

We now generate counterfactuals using all three different methods and check the runtime. You can modify the number of loops (```num_loops```), and the number of diverse counterfactuals to generate (```k```). 

In [16]:
num_loops = 10
k = 3

In [17]:
elapsed_random = 0
elapsed_kd = 0
elapsed_genetic = 0

for i in range(num_loops):    
    for q in query_instance:
        if q in d.categorical_feature_names:
            query_instance[q] = random.choice(dataset[q].values.unique())
        else:
            query_instance[q] = np.random.uniform(dataset[q].min(), dataset[q].max())
    
    start_time = timeit.default_timer()
    dice_exp = exp.generate_counterfactuals(query_instance, total_CFs=k, desired_class="opposite", verbose=False)
    elapsed_random += timeit.default_timer() - start_time    
    
    start_time = timeit.default_timer()
    dice_exp = exp_genetic.generate_counterfactuals(query_instance, total_CFs=k, desired_class="opposite", yloss_type="hinge_loss", verbose=False)
    elapsed_genetic += timeit.default_timer() - start_time  
    
    start_time = timeit.default_timer()
    dice_kd = exp_KD.generate_counterfactuals(query_instance, total_CFs=k, desired_class="opposite", verbose=False)
    elapsed_kd += timeit.default_timer() - start_time  
    
m_random, s_random = divmod(elapsed_random, 60)
print('For Independent random sampling of features: Total time taken to generate %d' %num_loops, 'sets of %d' %k, 'counterfactuals each: %02d' %m_random, 'min %02d' % s_random, 'sec')

m_kd, s_kd = divmod(elapsed_kd, 60)
print('For querying from a KD tree: Total time taken to generate %d' %num_loops, 'sets of %d' %k, 'counterfactuals each: %02d' %m_kd, 'min %02d' % s_kd, 'sec')

m_genetic, s_genetic = divmod(elapsed_genetic, 60)
print('For genetic algorithm: Total time taken to generate %d' %num_loops, 'sets of %d' %k, 'counterfactuals each: %02d' %m_genetic, 'min %02d' % s_genetic, 'sec')

For Independent random sampling of features: Total time taken to generate 10 sets of 3 counterfactuals each: 00 min 01 sec
For querying from a KD tree: Total time taken to generate 10 sets of 3 counterfactuals each: 00 min 01 sec
For genetic algorithm: Total time taken to generate 10 sets of 3 counterfactuals each: 00 min 03 sec
