This notebook will demonstrate how the DiCE library can be used for multiclass classification and regression for scikit-learn models. 
Here, we first show all of the features with dice_genetic

In [2]:
import dice_ml
from dice_ml import Dice

from dice_ml.utils import helpers # helper functions
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from lightgbm import LGBMClassifier, LGBMRegressor
import pandas as pd

We will use sklearn's internal datasets to demonstrate DiCE's features in this notebook

In [3]:
# Function to process sklearn's internal datasets
def sklearn_to_df(sklearn_dataset):
    df = pd.DataFrame(sklearn_dataset.data, columns=sklearn_dataset.feature_names)
    df['target'] = pd.Series(sklearn_dataset.target)
    return df

In [4]:
outcome_name = 'target'

# Multiclass Classification

For multiclass classification, we will use sklearn's Iris dataset. This data set consists of 3 different types of irises’ (Setosa, Versicolour, and Virginica) petal and sepal length. More information at https://scikit-learn.org/stable/datasets/toy_dataset.html#iris-plants-dataset

In [5]:
from sklearn.datasets import load_iris

In [6]:
df_iris = sklearn_to_df(load_iris())

In [7]:
df_iris.head()

Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm),target
0,5.1,3.5,1.4,0.2,0
1,4.9,3.0,1.4,0.2,0
2,4.7,3.2,1.3,0.2,0
3,4.6,3.1,1.5,0.2,0
4,5.0,3.6,1.4,0.2,0


In [8]:
df_iris.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 150 entries, 0 to 149
Data columns (total 5 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   sepal length (cm)  150 non-null    float64
 1   sepal width (cm)   150 non-null    float64
 2   petal length (cm)  150 non-null    float64
 3   petal width (cm)   150 non-null    float64
 4   target             150 non-null    int64  
dtypes: float64(4), int64(1)
memory usage: 6.0 KB


In [9]:
continuous_features_iris = df_iris.drop(outcome_name, axis=1).columns.tolist()

In [10]:
target = df_iris[outcome_name]
# Split data into train and test
from sklearn.model_selection import train_test_split
datasetX = df_iris.drop(outcome_name, axis=1)
x_train, x_test, y_train, y_test = train_test_split(datasetX, 
                                                    target, 
                                                    test_size = 0.2,
                                                    random_state=0,
                                                    stratify=target)

categorical_features = x_train.columns.difference(continuous_features_iris) 
from sklearn.compose import ColumnTransformer

# We create the preprocessing pipelines for both numeric and categorical data.
numeric_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='median')),
    ('scaler', StandardScaler())])

categorical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='constant', fill_value='missing')),
    ('onehot', OneHotEncoder(handle_unknown='ignore'))])

transformations = ColumnTransformer(
    transformers=[
        ('num', numeric_transformer, continuous_features_iris),
        ('cat', categorical_transformer, categorical_features)])

# Append classifier to preprocessing pipeline.
# Now we have a full prediction pipeline.
clf_iris = Pipeline(steps=[('preprocessor', transformations),
                      ('classifier', LGBMClassifier())])

In [11]:
d_iris = dice_ml.Data(dataframe=df_iris, continuous_features=continuous_features_iris, outcome_name=outcome_name)
model_iris = clf_iris.fit(x_train, y_train)
# We provide the type of model as a parameter (model_type)
m_iris = dice_ml.Model(model=model_iris, backend="sklearn", model_type='classifier')

In [12]:
exp_genetic_iris = Dice(d_iris, m_iris, method="genetic")

As we can see below, all the target values will lie in the desired class

In [13]:
# Multiple queries can be given as input at once
query_instances_iris = x_train[2:3]
genetic_iris = exp_genetic_iris.generate_counterfactuals(query_instances_iris, total_CFs=4, desired_class = 1)
genetic_iris.visualize_as_dataframe(show_only_changes=True)

Initializing initial parameters to the genetic algorithm...
Initialization complete! Generating counterfactuals...
Diverse Counterfactuals found! total time taken: 00 min 01 sec
Query instance (original outcome : 0)


Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm),target
0,4.4,3.0,1.3,0.2,0



Diverse Counterfactual set (new outcome: 1)


Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm),target
0,4.8,3.6,2.1,0.8,1.0
1,4.7,3.2,2.9,0.9,1.0
2,5.2,4.1,2.0,1.2,1.0
3,6.5,2.9,4.0,0.7,1.0


In [14]:
# Multiple queries can be given as input at once
query_instances_iris = x_train[17:19]
genetic_iris = exp_genetic_iris.generate_counterfactuals(query_instances_iris, total_CFs=7, desired_class = 2)
genetic_iris.visualize_as_dataframe(show_only_changes=True)

Initializing initial parameters to the genetic algorithm...
Initialization complete! Generating counterfactuals...
Diverse Counterfactuals found! total time taken: 00 min 03 sec
Initializing initial parameters to the genetic algorithm...
Initialization complete! Generating counterfactuals...
Diverse Counterfactuals found! total time taken: 00 min 03 sec
Query instance (original outcome : 1)


Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm),target
0,5.7,2.9,4.2,1.3,1



Diverse Counterfactual set (new outcome: 2)


Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm),target
0,5.2,3.0,5.3,0.7,2.0
1,5.9,3.1,5.2,1.4,2.0
2,4.6,2.0,5.3,1.7,2.0
3,6.3,3.7,6.6,-,2.0
4,5.2,3.3,6.5,2.4,2.0
5,5.1,2.4,4.3,2.2,2.0
6,6.1,2.6,3.8,1.8,2.0


Query instance (original outcome : 1)


Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm),target
0,6.3,3.3,4.7,1.6,1



Diverse Counterfactual set (new outcome: 2)


Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm),target
0,4.6,3.1,5.6,2.0,2.0
1,6.8,4.2,6.2,1.2,2.0
2,6.6,2.4,3.5,2.0,2.0
3,6.6,2.9,5.9,1.0,2.0
4,5.8,3.1,6.4,1.4,2.0
5,6.4,3.5,5.7,2.1,2.0
6,5.8,4.3,5.3,1.4,2.0


# Regression

For regression, we will use sklearn's boston dataset. This dataset contains boston house-prices. More information at https://scikit-learn.org/stable/datasets/toy_dataset.html#boston-house-prices-dataset

In [15]:
from sklearn.datasets import load_boston

In [16]:
df_boston = sklearn_to_df(load_boston())

In [17]:
df_boston.head()

Unnamed: 0,CRIM,ZN,INDUS,CHAS,NOX,RM,AGE,DIS,RAD,TAX,PTRATIO,B,LSTAT,target
0,0.00632,18.0,2.31,0.0,0.538,6.575,65.2,4.09,1.0,296.0,15.3,396.9,4.98,24.0
1,0.02731,0.0,7.07,0.0,0.469,6.421,78.9,4.9671,2.0,242.0,17.8,396.9,9.14,21.6
2,0.02729,0.0,7.07,0.0,0.469,7.185,61.1,4.9671,2.0,242.0,17.8,392.83,4.03,34.7
3,0.03237,0.0,2.18,0.0,0.458,6.998,45.8,6.0622,3.0,222.0,18.7,394.63,2.94,33.4
4,0.06905,0.0,2.18,0.0,0.458,7.147,54.2,6.0622,3.0,222.0,18.7,396.9,5.33,36.2


In [18]:
df_boston.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 506 entries, 0 to 505
Data columns (total 14 columns):
 #   Column   Non-Null Count  Dtype  
---  ------   --------------  -----  
 0   CRIM     506 non-null    float64
 1   ZN       506 non-null    float64
 2   INDUS    506 non-null    float64
 3   CHAS     506 non-null    float64
 4   NOX      506 non-null    float64
 5   RM       506 non-null    float64
 6   AGE      506 non-null    float64
 7   DIS      506 non-null    float64
 8   RAD      506 non-null    float64
 9   TAX      506 non-null    float64
 10  PTRATIO  506 non-null    float64
 11  B        506 non-null    float64
 12  LSTAT    506 non-null    float64
 13  target   506 non-null    float64
dtypes: float64(14)
memory usage: 55.5 KB


In [19]:
continuous_features_boston = df_boston.drop(outcome_name, axis=1).columns.tolist()

In [20]:
target = df_boston[outcome_name]
# Split data into train and test
from sklearn.model_selection import train_test_split
datasetX = df_boston.drop(outcome_name, axis=1)
x_train, x_test, y_train, y_test = train_test_split(datasetX, 
                                                    target, 
                                                    test_size = 0.2,
                                                    random_state=0)

categorical_features = x_train.columns.difference(continuous_features_boston) 
from sklearn.compose import ColumnTransformer

# We create the preprocessing pipelines for both numeric and categorical data.
numeric_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='median')),
    ('scaler', StandardScaler())])

categorical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='constant', fill_value='missing')),
    ('onehot', OneHotEncoder(handle_unknown='ignore'))])

transformations = ColumnTransformer(
    transformers=[
        ('num', numeric_transformer, continuous_features_boston),
        ('cat', categorical_transformer, categorical_features)])

# Append classifier to preprocessing pipeline.
# Now we have a full prediction pipeline.
regr_boston = Pipeline(steps=[('preprocessor', transformations),
                      ('regressor', LGBMRegressor())])

In [21]:
d_boston = dice_ml.Data(dataframe=df_boston, continuous_features=continuous_features_boston, outcome_name=outcome_name)
model_boston = regr_boston.fit(x_train, y_train)
# We provide the type of model as a parameter (model_type)
m_boston = dice_ml.Model(model=model_boston, backend="sklearn", model_type='regressor')

In [22]:
exp_genetic_boston = Dice(d_boston, m_boston, method="genetic")

As we can see below, all the target values will lie in the desired range

In [23]:
# Multiple queries can be given as input at once
query_instances_boston = x_train[2:3]
genetic_boston = exp_genetic_boston.generate_counterfactuals(query_instances_boston, total_CFs=2, desired_range=[30, 45])
genetic_boston.visualize_as_dataframe(show_only_changes=True)




Initializing initial parameters to the genetic algorithm...
Initialization complete! Generating counterfactuals...
Diverse Counterfactuals found! total time taken: 00 min 01 sec
Query instance (original outcome : 24)


Unnamed: 0,CRIM,ZN,INDUS,CHAS,NOX,RM,AGE,DIS,RAD,TAX,PTRATIO,B,LSTAT,target
0,0.11329,30.0,4.93,0.0,0.428,6.897,54.3,6.3361,6.0,300.0,16.6,391.25,11.38,23.909689



Diverse Counterfactual set (new outcome: [30, 45])


Unnamed: 0,CRIM,ZN,INDUS,CHAS,NOX,RM,AGE,DIS,RAD,TAX,PTRATIO,B,LSTAT,target
0,11.73254,38.0,12.7,-,-,7.521,31.6,5.4798,-,645.8,18.0,318.6,4.2,43.287
1,4.0561,39.4,14.2,0.6,0.48,5.55,3.2,3.1415,17.7,289.7,12.7,248.7,4.53,30.571


In [24]:
# Multiple queries can be given as input at once
query_instances_boston = x_train[17:19]
genetic_boston = exp_genetic_boston.generate_counterfactuals(query_instances_boston, total_CFs=4, desired_range=[40, 50])
genetic_boston.visualize_as_dataframe(show_only_changes=True)



Initializing initial parameters to the genetic algorithm...
Initialization complete! Generating counterfactuals...




Diverse Counterfactuals found! total time taken: 00 min 05 sec
Initializing initial parameters to the genetic algorithm...
Initialization complete! Generating counterfactuals...
Diverse Counterfactuals found! total time taken: 00 min 03 sec
Query instance (original outcome : 46)


Unnamed: 0,CRIM,ZN,INDUS,CHAS,NOX,RM,AGE,DIS,RAD,TAX,PTRATIO,B,LSTAT,target
0,0.01501,90.0,1.21,1.0,0.401,7.923,24.8,5.885,1.0,198.0,13.6,395.52,3.16,46.350821



Diverse Counterfactual set (new outcome: [40, 50])


Unnamed: 0,CRIM,ZN,INDUS,CHAS,NOX,RM,AGE,DIS,RAD,TAX,PTRATIO,B,LSTAT,target
0,1.02696,21.4,10.5,0.3,0.726,8.558,48.2,5.884,12.0,515.7,-,331.0,3.15,47.684
1,5.75805,57.3,14.3,-,0.593,7.924,59.5,4.3271,3.6,406.0,16.1,357.5,3.15,43.213
2,19.14775,37.4,21.9,1.1,0.528,8.736,67.6,2.367,7.4,690.0,-,58.8,7.4,42.372
3,19.6895,68.0,6.4,0.3,0.502,8.698,66.7,4.7175,16.9,398.7,13.0,296.9,6.04,41.818


Query instance (original outcome : 31)


Unnamed: 0,CRIM,ZN,INDUS,CHAS,NOX,RM,AGE,DIS,RAD,TAX,PTRATIO,B,LSTAT,target
0,0.06911,45.0,3.44,0.0,0.437,6.739,30.8,6.4798,5.0,398.0,15.2,389.71,4.69,31.00116



Diverse Counterfactual set (new outcome: [40, 50])


Unnamed: 0,CRIM,ZN,INDUS,CHAS,NOX,RM,AGE,DIS,RAD,TAX,PTRATIO,B,LSTAT,target
0,36.37678,17.6,12.3,0.8,0.714,8.166,45.7,7.5411,19.5,245.2,19.2,241.2,4.63,40.121
1,17.83843,79.4,25.8,-0.1,0.412,8.289,67.2,11.2392,-,524.5,13.3,282.3,7.35,42.273
2,3.3943,2.7,4.2,-,0.591,7.892,31.3,1.8575,17.7,670.9,15.1,40.1,4.68,41.075
3,7.3876,56.4,22.1,-,0.392,8.001,41.5,5.4807,10.9,507.0,19.4,214.6,4.64,41.008
