# Estimating local and global feature importance scores using DiCE

Summaries of counterfactual examples can be used to estimate importance of features. Intuitively, a feature that is changed more often to generate a proximal counterfactual is an important feature. We use this intuition to build a feature importance score. 

This score can be interpreted as a measure of the **necessity** of a feature to cause a particular model output. That is, if the feature's value changes, then it is likely that the model's output class will also change (or the model's output will significantly change in case of regression model).  

Below we show how counterfactuals can be used to provide local feature importance scores for any input, and how those scores can be combined to yield a global importance score for each feature.

In [1]:
%load_ext autoreload
%autoreload 2

In [2]:
import tensorflow as tf
from sklearn.neural_network import MLPClassifier
import dice_ml
from dice_ml import Dice
from dice_ml.utils import helpers # helper functions


## Preliminaries: Loading the data and ML model

In [3]:
dataset = helpers.load_adult_income_dataset()
helpers.get_adult_data_info()

{'age': 'age',
 'workclass': 'type of industry (Government, Other/Unknown, Private, Self-Employed)',
 'education': 'education level (Assoc, Bachelors, Doctorate, HS-grad, Masters, Prof-school, School, Some-college)',
 'marital_status': 'marital status (Divorced, Married, Separated, Single, Widowed)',
 'occupation': 'occupation (Blue-Collar, Other/Unknown, Professional, Sales, Service, White-Collar)',
 'race': 'white or other race?',
 'gender': 'male or female?',
 'hours_per_week': 'total work hours per week',
 'income': '0 (<=50K) vs 1 (>50K)'}

In [4]:
d = dice_ml.Data(dataframe=dataset, continuous_features=['age', 'hours_per_week'], outcome_name='income')
train_lbl, test_lbl = d.split_data(d.normalize_data(d.label_encoded_data, encoding='label'))
X_train_lbl = train_lbl.loc[:, train_lbl.columns != 'income']
y_train_lbl = train_lbl.loc[:, train_lbl.columns == 'income']
X_test_lbl = test_lbl.loc[:, test_lbl.columns != 'income']
y_test_lbl = test_lbl.loc[:, test_lbl.columns == 'income']
mlp_lbl = MLPClassifier(hidden_layer_sizes=(20), alpha=0.001, learning_rate_init=0.01, batch_size=32, random_state=17,
                    max_iter=10, verbose=False, validation_fraction=0.2, ) #max_iter is epochs in TF
mlp_lbl.fit(X_train_lbl, y_train_lbl.values.ravel())
m = dice_ml.Model(model=mlp_lbl, backend="sklearn")



## Local feature importance

We first generate counterfactuals for a given input point. 

In [23]:
exp = Dice(d, m, method="genetic")
query_instance = {'age':22, 
                  'workclass':'Private', 
                  'education':'HS-grad', 
                  'marital_status':'Single', 
                  'occupation':'Service',
                  'race': 'White', 
                  'gender':'Female', 
                  'hours_per_week': 45}
e1 = exp.generate_counterfactuals(query_instance, total_CFs=50)
e1.visualize_as_dataframe()

Initializing initial parameters to the genetic algorithm...
Initialization complete! Generating counterfactuals...
Diverse Counterfactuals found! total time taken: 00 min 07 sec
Query instance (original outcome : 0)


Unnamed: 0,age,workclass,education,marital_status,occupation,race,gender,hours_per_week,income
0,22.0,Other/Unknown,Assoc,Married,Other/Unknown,White,Female,45.0,0.0



Diverse Counterfactual set (new outcome : 1)


Unnamed: 0,age,workclass,education,marital_status,occupation,race,gender,hours_per_week,income
0,59.0,Government,Bachelors,Divorced,Blue-Collar,Other,Female,69.0,1
1,72.0,Other/Unknown,Bachelors,Divorced,Other/Unknown,White,Female,98.0,1
2,38.0,Government,Assoc,Divorced,Other/Unknown,White,Female,73.0,1
3,65.0,Other/Unknown,Bachelors,Divorced,Blue-Collar,Other,Male,57.0,1
4,70.0,Government,Assoc,Married,Other/Unknown,White,Female,98.0,1
5,62.0,Government,Bachelors,Divorced,Blue-Collar,Other,Male,58.0,1
6,54.0,Government,Assoc,Married,Other/Unknown,Other,Male,67.0,1
7,52.0,Other/Unknown,Assoc,Divorced,Other/Unknown,Other,Female,77.0,1
8,55.0,Other/Unknown,Assoc,Divorced,Other/Unknown,White,Male,50.0,1
9,78.0,Other/Unknown,Assoc,Divorced,Other/Unknown,White,Female,61.0,1


These can now be used to calculate the feature importance scores. 

In [29]:
imp = exp.feature_importance([query_instance], cf_examples_list=[e1])
print(imp.local_importance)

[{'workclass': 0.62, 'education': 0.38, 'marital_status': 0.86, 'occupation': 0.4, 'race': 0.34, 'gender': 0.52, 'age': 0.98, 'hours_per_week': 1.0}]


Feature importance can also be estimated directly, by leaving the `cf_examples_list` argument blank.

In [30]:
imp = exp.feature_importance([query_instance])
print(imp.local_importance)

Initializing initial parameters to the genetic algorithm...
Initialization complete! Generating counterfactuals...
Diverse Counterfactuals found! total time taken: 00 min 00 sec
[{'workclass': 0.4, 'education': 0.2, 'marital_status': 0.9, 'occupation': 0.5, 'race': 0.3, 'gender': 0.7, 'age': 1.0, 'hours_per_week': 1.0}]


## Global importance

For global importance, we need to generate counterfactuals for a representative sample of the dataset. 

In [33]:
cobj=exp.feature_importance(dataset.iloc[0:10,:].to_dict('records'), total_CFs=10)
print(cobj.summary_importance)

Initializing initial parameters to the genetic algorithm...
Initialization complete! Generating counterfactuals...
Diverse Counterfactuals found! total time taken: 00 min 00 sec
Initializing initial parameters to the genetic algorithm...
Initialization complete! Generating counterfactuals...
Diverse Counterfactuals found! total time taken: 00 min 00 sec
Initializing initial parameters to the genetic algorithm...
Initialization complete! Generating counterfactuals...
Diverse Counterfactuals found! total time taken: 00 min 00 sec
Initializing initial parameters to the genetic algorithm...
Initialization complete! Generating counterfactuals...
Diverse Counterfactuals found! total time taken: 00 min 00 sec
Initializing initial parameters to the genetic algorithm...
Initialization complete! Generating counterfactuals...
Diverse Counterfactuals found! total time taken: 00 min 00 sec
Initializing initial parameters to the genetic algorithm...
Initialization complete! Generating counterfactual

In [7]:
cobj.to_json()

'{\n  "cf_examples_list": [\n    "{\\"age\\":{\\"0\\":41.0,\\"1\\":29.0},\\"workclass\\":{\\"0\\":\\"Other\\\\/Unknown\\",\\"1\\":\\"Other\\\\/Unknown\\"},\\"education\\":{\\"0\\":\\"Assoc\\",\\"1\\":\\"Assoc\\"},\\"marital_status\\":{\\"0\\":\\"Divorced\\",\\"1\\":\\"Divorced\\"},\\"occupation\\":{\\"0\\":\\"Blue-Collar\\",\\"1\\":\\"Other\\\\/Unknown\\"},\\"race\\":{\\"0\\":\\"Other\\",\\"1\\":\\"Other\\"},\\"gender\\":{\\"0\\":\\"Female\\",\\"1\\":\\"Male\\"},\\"hours_per_week\\":{\\"0\\":45.0,\\"1\\":74.0},\\"income\\":{\\"0\\":1,\\"1\\":1}}",\n    "{\\"age\\":{\\"0\\":43.0,\\"1\\":27.0},\\"workclass\\":{\\"0\\":\\"Other\\\\/Unknown\\",\\"1\\":\\"Other\\\\/Unknown\\"},\\"education\\":{\\"0\\":\\"Assoc\\",\\"1\\":\\"Assoc\\"},\\"marital_status\\":{\\"0\\":\\"Divorced\\",\\"1\\":\\"Divorced\\"},\\"occupation\\":{\\"0\\":\\"Blue-Collar\\",\\"1\\":\\"Other\\\\/Unknown\\"},\\"race\\":{\\"0\\":\\"White\\",\\"1\\":\\"White\\"},\\"gender\\":{\\"0\\":\\"Female\\",\\"1\\":\\"Female\\"},\\"ho

In [9]:
print(imp.cf_examples_list)
print(imp.local_importance)
print(imp.summary_importance)
json_str = imp.to_json()
print(json_str)

[<dice_ml.diverse_counterfactuals.CounterfactualExamples object at 0x7fc6703cfc70>, <dice_ml.diverse_counterfactuals.CounterfactualExamples object at 0x7fc5835ac1c0>]
[{'workclass': 0.4, 'education': 0.3, 'marital_status': 0.8, 'occupation': 0.3, 'race': 0.4, 'gender': 0.6, 'age': 1.0, 'hours_per_week': 1.0}, {'workclass': 0.7, 'education': 0.5, 'marital_status': 1.0, 'occupation': 0.3, 'race': 0.4, 'gender': 0.5, 'age': 1.0, 'hours_per_week': 1.0}]
{'workclass': 0.55, 'education': 0.4, 'marital_status': 0.9, 'occupation': 0.3, 'race': 0.4, 'gender': 0.55, 'age': 1.0, 'hours_per_week': 1.0}
{
  "cf_examples_list": [
    "{\"age\":{\"0\":42.0,\"1\":40.0,\"2\":78.0,\"3\":66.0,\"4\":31.0,\"5\":70.0,\"6\":89.0,\"7\":62.0,\"8\":44.0,\"9\":84.0},\"workclass\":{\"0\":\"Government\",\"1\":\"Government\",\"2\":\"Government\",\"3\":\"Other\\/Unknown\",\"4\":\"Other\\/Unknown\",\"5\":\"Other\\/Unknown\",\"6\":\"Other\\/Unknown\",\"7\":\"Other\\/Unknown\",\"8\":\"Government\",\"9\":\"Other\\/Unkno

In [20]:
imp_r=imp.from_json(json_str)

In [21]:
print(imp_r.cf_examples_list)
print(imp_r.local_importance)
print(imp_r.summary_importance)

[   age      workclass  education marital_status     occupation   race  gender  \
0   42     Government  Bachelors       Divorced  Other/Unknown  Other    Male   
1   40     Government  Bachelors       Divorced  Other/Unknown  Other  Female   
2   78     Government      Assoc       Divorced  Other/Unknown  White    Male   
3   66  Other/Unknown      Assoc       Divorced    Blue-Collar  White    Male   
4   31  Other/Unknown  Bachelors       Divorced    Blue-Collar  White  Female   
5   70  Other/Unknown      Assoc       Divorced    Blue-Collar  White    Male   
6   89  Other/Unknown      Assoc        Married  Other/Unknown  White    Male   
7   62  Other/Unknown      Assoc       Divorced  Other/Unknown  Other  Female   
8   44     Government      Assoc       Divorced  Other/Unknown  White    Male   
9   84  Other/Unknown      Assoc        Married  Other/Unknown  Other  Female   

   hours_per_week  income  
0              86       1  
1              60       1  
2              60      