# Altruist Quantitative Experiments

In this notebook we will provide a set of quantitative experiments using Altruist, on three different datasets, 4 different machine learning models and 4 different interpretation techniques.

Load few libraries we will need

In [1]:
import warnings
warnings.filterwarnings("ignore")

In [2]:
from sklearn.base import TransformerMixin
from sklearn.preprocessing import MinMaxScaler,StandardScaler, LabelEncoder
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.neural_network import MLPClassifier
from sklearn.model_selection import GridSearchCV
from sklearn.svm import SVC
from sklearn.model_selection import train_test_split
from altruist import Altruist
from fi_techniques import FeatureImportance
import pandas as pd 
import numpy as np
import urllib
import networkx as nx
import matplotlib.pyplot as plt
from ipywidgets import interact, interactive, fixed, interact_manual
import ipywidgets as widgets

Using TensorFlow backend.


# Banknote Dataset

About Banknote: It is a binary classification problem detecting real or fake banknotes. Data were extracted from images that were taken from genuine and forged banknote-like specimens. For digitization, an industrial camera usually used for print inspection was used. The final images have 400x 400 pixels. Due to the object lens and distance to the investigated object gray-scale pictures with a resolution of about 660 dpi were gained. Wavelet Transform tool were used to extract features from images. 

Attribute Information:

1. variance of Wavelet Transformed image (continuous)
2. skewness of Wavelet Transformed image (continuous)
3. curtosis of Wavelet Transformed image (continuous)
4. entropy of image (continuous) 

Source: https://archive.ics.uci.edu/ml/datasets/banknote+authentication

## Data Loading

Firstly, we load the dataset and we set the feature and class names

In [46]:
banknote_datadset = pd.read_csv('https://raw.githubusercontent.com/Kuntal-G/Machine-Learning/master/R-machine-learning/data/banknote-authentication.csv')
feature_names = ['variance','skew','curtosis','entropy']
class_names=['fake banknote','real banknote'] #0: no, 1: yes #or ['not authenticated banknote','authenticated banknote']

We can plot some instances to see the features and their values

In [47]:
banknote_datadset.head()

Unnamed: 0,variance,skew,curtosis,entropy,class
0,3.6216,8.6661,-2.8073,-0.44699,0
1,4.5459,8.1674,-2.4586,-1.4621,0
2,3.866,-2.6383,1.9242,0.10645,0
3,3.4566,9.5228,-4.0112,-3.5944,0
4,0.32924,-4.4552,4.5718,-0.9888,0


Moreover, we can use pandas.describe() to see the ranges of each feature. For example, we observe that curtosis's range is -5.286 to 17.927

In [48]:
banknote_datadset.describe()

Unnamed: 0,variance,skew,curtosis,entropy,class
count,1372.0,1372.0,1372.0,1372.0,1372.0
mean,0.433735,1.922353,1.397627,-1.191657,0.444606
std,2.842763,5.869047,4.31003,2.101013,0.497103
min,-7.0421,-13.7731,-5.2861,-8.5482,0.0
25%,-1.773,-1.7082,-1.574975,-2.41345,0.0
50%,0.49618,2.31965,0.61663,-0.58665,0.0
75%,2.821475,6.814625,3.17925,0.39481,1.0
max,6.8248,12.9516,17.9274,2.4495,1.0


Then We extract the train data from the dataframe

In [49]:
X = banknote_datadset.iloc[:, 0:4].values 
y = banknote_datadset.iloc[:, 4].values 

In [50]:
len(X)

1372

We have 1372 instances. We are going to use the build-in GridSearch of LionForests to find and train the best classifier for this dataset

## Machine Learning models training step

We will use a MinMax scaler to normalize the input

In [51]:
scaler = MinMaxScaler(feature_range=(-1,1))

We are going to use 4 different classifiers (Random Forests, SVMs, Logistic Regression, Neural Networks

In [52]:
classifiers = {}
scalers = {}

In [53]:
pipe = Pipeline(steps=[('scaler', scaler), ('rf', RandomForestClassifier(random_state=77))])
parameters =[{
    'rf__max_depth': [10],#1, 5, 7, 10
    'rf__max_features': [0.75], #'sqrt', 'log2', 0.75, None
    'rf__bootstrap': [True], #True, False
    'rf__min_samples_leaf' : [1], #1, 2, 5, 10, 0.10
    'rf__n_estimators': [500] #10, 100, 500, 1000
}]
clf = GridSearchCV(pipe, parameters, scoring='f1', cv=10, n_jobs=-1)
clf.fit(X, y)
scaler_rf = clf.best_estimator_.steps[0][1]
rf = clf.best_estimator_.steps[1][1]
classifiers[1] = [rf, str("Random Forests: "+ str(clf.best_score_))]
scalers[1] = scaler_rf

In [54]:
pipe = Pipeline(steps=[('scaler', scaler), ('svm', SVC(probability=True,random_state=77))])
#parameters = [
#  {'svm__C': [-3, 1, 3, 10, 100, 1000], 'svm__kernel': ['linear']},
#  {'svm__C': [-3, 1, 3, 10, 100, 1000], 'svm__gamma': [0.1, 0.01, 0.001, 0.0001], 'svm__kernel': ['rbf']},
#]
parameters = {'svm__C': [100], 'svm__gamma': [0.1], 'svm__kernel': ['rbf']} #best
clf = GridSearchCV(pipe, parameters, scoring='f1', cv=10, n_jobs=-1)
clf.fit(X, y)
scaler_svm = clf.best_estimator_.steps[0][1]
svm = clf.best_estimator_.steps[1][1]
classifiers[2] = [svm, str("SVM: "+ str(clf.best_score_))]
scalers[2] = scaler_svm

In [55]:
pipe = Pipeline(steps=[('scaler', scaler), ('lr', LogisticRegression(random_state=77))])
#parameters = [
#  {'lr__C': [-3, 1, 3, 10, 100, 1000], 'lr__penalty': ['l1'], 'lr__solver': ['liblinear', 'saga']},
#  {'lr__C': [-3, 1, 3, 10, 100, 1000], 'lr__penalty': ['l2'], 'lr__solver': ['newton-cg', 'lbfgs', 'sag','saga']}
#]
parameters = {'lr__C': [3], 'lr__penalty': ['l1'], 'lr__solver': ['liblinear']}#best
clf = GridSearchCV(pipe, parameters, scoring='f1', cv=10, n_jobs=-1)
clf.fit(X, y)
scaler_lr = clf.best_estimator_.steps[0][1]
lr = clf.best_estimator_.steps[1][1]
classifiers[3] = [lr, str("Logistic Regression: "+ str(clf.best_score_))]
scalers[3] = scaler_lr

In [56]:
pipe = Pipeline(steps=[('scaler', scaler), ('nn', MLPClassifier(early_stopping=True, random_state=77))])
#parameters = {
#    'nn__hidden_layer_sizes': [(2,10),(5,10),(10,100),(20,200),(50,500)], 
#    'nn__activation': ['logistic', 'tanh', 'relu'],
#    'nn__solver': ['sgd', 'adam'],
#    'nn__alpha': [0.000001,0.0001,0.001, 0.01, 0.1],
#    'nn__learning_rate': ['constant', 'invscaling', 'adaptive']}
parameters = {
    'nn__hidden_layer_sizes': [(100,1000)], 
    'nn__activation': ['relu'],
    'nn__solver': ['adam'],
    'nn__alpha': [0.0001],
    'nn__learning_rate': ['constant']}
clf = GridSearchCV(pipe, parameters, scoring='f1', cv=10, n_jobs=-1)
clf.fit(X, y)
scaler_nn = clf.best_estimator_.steps[0][1]
nn = clf.best_estimator_.steps[1][1]
classifiers[4] = [nn, str("Neural Network: "+ str(clf.best_score_))]
scalers[4] = scaler_nn

We can see the best classifiers of each algorithm and their scores. In this dataset SVMs work better than the others!

In [57]:
classifiers

{1: [RandomForestClassifier(max_depth=10, max_features=0.75, n_estimators=500,
                         random_state=77),
  'Random Forests: 0.9926289484206319'],
 2: [SVC(C=100, gamma=0.1, probability=True, random_state=77), 'SVM: 1.0'],
 3: [LogisticRegression(C=3, penalty='l1', random_state=77, solver='liblinear'),
  'Logistic Regression: 0.9886410119993929'],
 4: [MLPClassifier(early_stopping=True, hidden_layer_sizes=(100, 1000),
                random_state=77), 'Neural Network: 0.9943351691581432']}

## Quantitative Tests

Finally, we run the quantitative experimennts for these classifiers and 3 to 4 interpretation techniques. All the models and the techniques will be tested on a commno subset of the original data, in order to check which technique provided less untruthful features for each model. This is a way to select model and interpretation technique, as well as a way to benchmark different interpretation techniques. For a qualitative and more informative example of Altruist and its explanations please open the Qualitative.ipynb

In [58]:
@interact(eli_5=False, lime=True, shap=True, perm_importance=True, intristic=False, cl=(1,4))
def g(eli_5, lime, shap, perm_importance, intristic, cl=1):
    print(classifiers[cl][1],"*Please let it run, it will take time probably*")
    X_t = scalers[cl].transform(X)
    fi = FeatureImportance(X_t, y, feature_names, class_names)
    fi_names = {fi.fi_lime:'Lime',fi.fi_shap:'Shap',fi.fi_eli:'Eli5',fi.fi_perm_imp:'Permuation Importance',fi.fi_rf:'Pseudo-Intristic RFs', fi.fi_coef_lr:'Intristic LR'}
    fis = []
    if (eli_5 and not cl == 2 and not cl == 3):
        fis.append(fi.fi_eli)
    if lime:
        fis.append(fi.fi_lime)
    if shap:
        fis.append(fi.fi_shap)
    if perm_importance:
        fis.append(fi.fi_perm_imp)
    if intristic and cl == 1:
        fis.append(fi.fi_rf)
    if intristic and cl == 3:
        fis.append(fi.fi_coef_lr)
    fis_scores = []
    for i in fis:
        fis_scores.append([])
    count = 0
    for instance in X_t:
        if (count + 1) % 100 == 0:
            print(count+1,"/",len(X_t),"..",end=", ")
        count = count + 1
        altruistino = Altruist(classifiers[cl][0], X_t, fis, feature_names, None)
        untruthful_features = altruistino.find_untruthful_features(instance)
        for i in range(len(untruthful_features[0])):
            fis_scores[i].append(len(untruthful_features[0][i]))
    print(len(X_t),"/",len(X_t))
    count = 0
    for fis_score in fis_scores:
        fi = fis[count]
        count = count + 1
        print(fi_names[fi],np.array(fis_score).mean())
    fi_matrix = np.array(fis_scores)
    count = 0
    fi_all = []
    for instance in X_t:
        fi_all.append(fi_matrix[:,count].min())
        count = count + 1
    print("Altogether:",np.array(fi_all).mean())

interactive(children=(Checkbox(value=False, description='eli_5'), Checkbox(value=True, description='lime'), Ch…

# Heart Statlog Dataset

About Heart Statlog: This dataset is for binary classification tasks to predict the absence or presence of a heart disease. This dataset is a heart disease database similar to a database already present in the repository (Heart Disease databases) but in a slightly different form. 

Attribute Information:
1. age
2. sex
3. chest pain type
4. resting blood pressure
5. serum cholesterol in mg/dl
6. fasting blood sugar > 120 mg/dl
7. resting electrocardiographic results (values 0,1,2)
8. maximum heart rate achieved
9. exercise induced angina
10. oldpeak = ST depression induced by exercise relative to rest
11. the slope of the peak exercise ST segment
12. number of major vessels (0-3) colored by flourosopy
13. thal: 3 = normal; 6 = fixed defect; 7 = reversable defect 

Source: https://archive.ics.uci.edu/ml/datasets/statlog+(heart)

## Data Loading

Firstly, we load the dataset and we set the feature and class names

In [59]:
url="http://archive.ics.uci.edu/ml/machine-learning-databases/statlog/heart/heart.dat"
raw_data = urllib.request.urlopen(url)
credit=np.genfromtxt(raw_data)
X,y = credit[:,:-1], credit[:,-1].squeeze()
y = [int(i-1) for i in y]
feature_names = ['age','sex','chest pain','resting blood pressure','serum cholestoral',
               'fasting blood sugar','resting ecg results','maximum heart rate achieved','exercise induced angina','oldpeak',
               'the slope of the peak exercise','number of major vessels','reversable defect']
class_names = ['absence','presence']

heart_statlog = pd.DataFrame(X,columns=feature_names)

We can plot some instances to see the features and their values

In [60]:
heart_statlog.head()

Unnamed: 0,age,sex,chest pain,resting blood pressure,serum cholestoral,fasting blood sugar,resting ecg results,maximum heart rate achieved,exercise induced angina,oldpeak,the slope of the peak exercise,number of major vessels,reversable defect
0,70.0,1.0,4.0,130.0,322.0,0.0,2.0,109.0,0.0,2.4,2.0,3.0,3.0
1,67.0,0.0,3.0,115.0,564.0,0.0,2.0,160.0,0.0,1.6,2.0,0.0,7.0
2,57.0,1.0,2.0,124.0,261.0,0.0,0.0,141.0,0.0,0.3,1.0,0.0,7.0
3,64.0,1.0,4.0,128.0,263.0,0.0,0.0,105.0,1.0,0.2,2.0,1.0,7.0
4,74.0,0.0,2.0,120.0,269.0,0.0,2.0,121.0,1.0,0.2,1.0,1.0,3.0


Moreover, we can use pandas.describe() to see the ranges of each feature. For example, we observe that curtosis's range is -5.286 to 17.927

In [61]:
heart_statlog.describe()

Unnamed: 0,age,sex,chest pain,resting blood pressure,serum cholestoral,fasting blood sugar,resting ecg results,maximum heart rate achieved,exercise induced angina,oldpeak,the slope of the peak exercise,number of major vessels,reversable defect
count,270.0,270.0,270.0,270.0,270.0,270.0,270.0,270.0,270.0,270.0,270.0,270.0,270.0
mean,54.433333,0.677778,3.174074,131.344444,249.659259,0.148148,1.022222,149.677778,0.32963,1.05,1.585185,0.67037,4.696296
std,9.109067,0.468195,0.95009,17.861608,51.686237,0.355906,0.997891,23.165717,0.470952,1.14521,0.61439,0.943896,1.940659
min,29.0,0.0,1.0,94.0,126.0,0.0,0.0,71.0,0.0,0.0,1.0,0.0,3.0
25%,48.0,0.0,3.0,120.0,213.0,0.0,0.0,133.0,0.0,0.0,1.0,0.0,3.0
50%,55.0,1.0,3.0,130.0,245.0,0.0,2.0,153.5,0.0,0.8,2.0,0.0,3.0
75%,61.0,1.0,4.0,140.0,280.0,0.0,2.0,166.0,1.0,1.6,2.0,1.0,7.0
max,77.0,1.0,4.0,200.0,564.0,1.0,2.0,202.0,1.0,6.2,3.0,3.0,7.0


We have 270 instances. We are going to use the build-in GridSearch of LionForests to find and train the best classifier for this dataset

In [62]:
len(X)

270

## Machine Learning models training step

We will use a MinMax scaler to normalize the input

In [63]:
scaler = MinMaxScaler(feature_range=(-1,1))

We are going to use 4 different classifiers (Random Forests, SVMs, Logistic Regression, Neural Networks

In [64]:
classifiers = {}
scalers = {}

In [65]:
pipe = Pipeline(steps=[('scaler', scaler), ('rf', RandomForestClassifier(random_state=0))])
parameters =[{
    'rf__max_depth': [5],#1, 5, 7, 10
    'rf__max_features': ['sqrt'], #'sqrt', 'log2', 0.75, None
    'rf__bootstrap': [False], #True, False
    'rf__min_samples_leaf' : [5], #1, 2, 5, 10, 0.10
    'rf__n_estimators': [500] #10, 100, 500, 1000
}]
clf = GridSearchCV(pipe, parameters, scoring='f1', cv=10, n_jobs=-1)
clf.fit(X, y)
scaler_rf = clf.best_estimator_.steps[0][1]
rf = clf.best_estimator_.steps[1][1]
classifiers[1] = [rf, str("Random Forests: "+ str(clf.best_score_))]
scalers[1] = scaler_rf

In [66]:
pipe = Pipeline(steps=[('scaler', scaler), ('svm', SVC(probability=True,random_state=77))])
#parameters = [
#  {'svm__C': [-3, 1, 3, 10, 100, 1000], 'svm__kernel': ['linear']},
#  {'svm__C': [-3, 1, 3, 10, 100, 1000], 'svm__gamma': [0.1, 0.01, 0.001, 0.0001], 'svm__kernel': ['rbf']},
#]
parameters = {'svm__C': [100], 'svm__gamma': [0.001], 'svm__kernel': ['rbf']} #best
clf = GridSearchCV(pipe, parameters, scoring='f1', cv=10, n_jobs=-1)
clf.fit(X, y)
scaler_svm = clf.best_estimator_.steps[0][1]
svm = clf.best_estimator_.steps[1][1]
classifiers[2] = [svm, str("SVM: "+ str(clf.best_score_))]
scalers[2] = scaler_svm

In [67]:
pipe = Pipeline(steps=[('scaler', scaler), ('lr', LogisticRegression(random_state=77))])
#parameters = [
#  {'lr__C': [-3, 1, 3, 10, 100, 1000], 'lr__penalty': ['l1'], 'lr__solver': ['liblinear', 'saga']},
#  {'lr__C': [-3, 1, 3, 10, 100, 1000], 'lr__penalty': ['l2'], 'lr__solver': ['newton-cg', 'lbfgs', 'sag','saga']}
#]
parameters = {'lr__C': [1], 'lr__penalty': ['l1'], 'lr__solver': ['saga']}#best
clf = GridSearchCV(pipe, parameters, scoring='f1', cv=10, n_jobs=-1)
clf.fit(X, y)
scaler_lr = clf.best_estimator_.steps[0][1]
lr = clf.best_estimator_.steps[1][1]
classifiers[3] = [lr, str("Logistic Regression: "+ str(clf.best_score_))]
scalers[3] = scaler_lr

In [68]:
pipe = Pipeline(steps=[('scaler', scaler), ('nn', MLPClassifier(early_stopping=True, random_state=77))])
#parameters = {
#    'nn__hidden_layer_sizes': [(2,10),(5,10),(10,100),(20,200),(50,500), (100,1000)],
#    'nn__activation': ['logistic', 'tanh', 'relu'],
#    'nn__solver': ['sgd', 'adam'],
#    'nn__alpha': [0.000001,0.0001,0.001, 0.01, 0.1],
#    'nn__learning_rate': ['constant', 'invscaling', 'adaptive']}
parameters = {
    'nn__hidden_layer_sizes': [(100,1000)], 
    'nn__activation': ['tanh'],
    'nn__solver': ['adam'],
    'nn__alpha': [0.000001],
    'nn__learning_rate': ['constant']}
clf = GridSearchCV(pipe, parameters, scoring='f1', cv=10, n_jobs=-1)
clf.fit(X, y)
scaler_nn = clf.best_estimator_.steps[0][1]
nn = clf.best_estimator_.steps[1][1]
classifiers[4] = [nn, str("Neural Network: "+ str(clf.best_score_))]
scalers[4] = scaler_nn

We can see the best classifiers of each algorithm and their scores. In this dataset SVMs work better than the others!

In [69]:
classifiers

{1: [RandomForestClassifier(bootstrap=False, max_depth=5, max_features='sqrt',
                         min_samples_leaf=5, n_estimators=500, random_state=0),
  'Random Forests: 0.8188916011524707'],
 2: [SVC(C=100, gamma=0.001, probability=True, random_state=77),
  'SVM: 0.8195089355089354'],
 3: [LogisticRegression(C=1, penalty='l1', random_state=77, solver='saga'),
  'Logistic Regression: 0.8120600762065292'],
 4: [MLPClassifier(activation='tanh', alpha=1e-06, early_stopping=True,
                hidden_layer_sizes=(100, 1000), random_state=77),
  'Neural Network: 0.7701775669029673']}

## Quantitative Tests

Finally, we run the quantitative experimennts for these classifiers and 3 to 4 interpretation techniques. All the models and the techniques will be tested on a commno subset of the original data, in order to check which technique provided less untruthful features for each model. This is a way to select model and interpretation technique, as well as a way to benchmark different interpretation techniques. For a qualitative and more informative example of Altruist and its explanations please open the Qualitative.ipynb

In [70]:
@interact(eli_5=False, lime=True, shap=True, perm_importance=True, intristic=False, cl=(1,4))
def g(eli_5, lime, shap, perm_importance, intristic, cl=1):
    print(classifiers[cl][1])
    X_t = scalers[cl].transform(X)
    fi = FeatureImportance(X_t, y, feature_names, class_names)
    fi_names = {fi.fi_lime:'Lime',fi.fi_shap:'Shap',fi.fi_eli:'Eli5',fi.fi_perm_imp:'Permuation Importance',fi.fi_rf:'Pseudo-Intristic RFs', fi.fi_coef_lr:'Intristic LR'}
    fis = []
    if (eli_5 and not cl == 2 and not cl == 3):
        fis.append(fi.fi_eli)
    if lime:
        fis.append(fi.fi_lime)
    if shap:
        fis.append(fi.fi_shap)
    if perm_importance:
        fis.append(fi.fi_perm_imp)
    if intristic and cl == 1:
        fis.append(fi.fi_rf)
    if intristic and cl == 3:
        fis.append(fi.fi_coef_lr)
    fis_scores = []
    for i in fis:
        fis_scores.append([])
    count = 0
    for instance in X_t:
        if (count + 1) % 50 == 0:
            print(count+1,"/",len(X_t),"..",end=", ")
        count = count + 1
        altruistino = Altruist(classifiers[cl][0], X_t, fis, feature_names, None)
        untruthful_features = altruistino.find_untruthful_features(instance)
        for i in range(len(untruthful_features[0])):
            fis_scores[i].append(len(untruthful_features[0][i]))
    print(len(X_t),"/",len(X_t))
    count = 0
    for fis_score in fis_scores:
        fi = fis[count]
        count = count + 1
        print(fi_names[fi],np.array(fis_score).mean())
    fi_matrix = np.array(fis_scores)
    count = 0
    fi_all = []
    for instance in X_t:
        fi_all.append(fi_matrix[:,count].min())
        count = count + 1
    print("Altogether:",np.array(fi_all).mean())

interactive(children=(Checkbox(value=False, description='eli_5'), Checkbox(value=True, description='lime'), Ch…

## Adult Census Dataset

About Adult Census: This dataset is for a binary classification task. Predict whether income exceeds $50K/yr based on census data. Also known as "Census Income" dataset.

Attribute Information:

1. age: continuous.
2. workclass: Private, Self-emp-not-inc, Self-emp-inc, Federal-gov, Local-gov, State-gov, Without-pay, Never-worked.
3. fnlwgt: continuous.
4. education: Bachelors, Some-college, 11th, HS-grad, Prof-school, Assoc-acdm, Assoc-voc, 9th, 7th-8th, 12th, Masters, 1st-4th, 10th, Doctorate, 5th-6th, Preschool.
5. education-num: continuous.
6. marital-status: Married-civ-spouse, Divorced, Never-married, Separated, Widowed, Married-spouse-absent, Married-AF-spouse.
occupation: Tech-support, Craft-repair, Other-service, Sales, Exec-managerial, Prof-specialty, Handlers-cleaners, Machine-op-inspct, Adm-clerical, Farming-fishing, Transport-moving, Priv-house-serv, Protective-serv, Armed-Forces.
7. relationship: Wife, Own-child, Husband, Not-in-family, Other-relative, Unmarried.
8. race: White, Asian-Pac-Islander, Amer-Indian-Eskimo, Other, Black.
9. sex: Female, Male.
10. capital-gain: continuous.
11. capital-loss: continuous.
12. hours-per-week: continuous.
13. native-country: United-States, Cambodia, England, Puerto-Rico, Canada, Germany, Outlying-US(Guam-USVI-etc), India, Japan, Greece, South, China, Cuba, Iran, Honduras, Philippines, Italy, Poland, Jamaica, Vietnam, Mexico, Portugal, Ireland, France, Dominican-Republic, Laos, Ecuador, Taiwan, Haiti, Columbia, Hungary, Guatemala, Nicaragua, Scotland, Thailand, Yugoslavia, El-Salvador, Trinadad&Tobago, Peru, Hong, Holand-Netherlands.

Source: https://archive.ics.uci.edu/ml/datasets/adult

## Data Loading

Firstly, we load the dataset and we set the feature and class names

In [28]:
feature_names = ['age', 'workclass', 'fnlwgt', 'education', 'education-num', 'marital-status', 'occupation', 'relationship', 'race', 'sex', 'capital-gain', 'capital-loss', 'hours-per-week', 'native-country','salary']
class_names=['<=50K','>50K'] #0: <=50K and 1: >50K
data = pd.read_csv('https://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data', names=feature_names, delimiter=', ')
data_test = pd.read_csv('https://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.test', names=feature_names, delimiter=', ')
data_test = data_test.drop(data_test.index[[0]])

We are doing the following preprocessing influenced by a github notebook

In [29]:
data = data[(data != '?').all(axis=1)]
data_test = data_test[(data_test != '?').all(axis=1)]
data_test['salary'] = data_test['salary'].map({'<=50K.': '<=50K', '>50K.': '>50K'})
frames = [data, data_test]
data = pd.concat(frames)

Feature Engineering from:
https://github.com/pooja2512/Adult-Census-Income/blob/master/Adult%20Census%20Income.ipynb. So run and skip the next code block

In [30]:
hs_grad = ['HS-grad','11th','10th','9th','12th']
elementary = ['1st-4th','5th-6th','7th-8th']
# replace elements in list.
for i in hs_grad:
    data['education'].replace(i , 'HS-grad', regex=True , inplace=True)
for e in elementary:
    data['education'].replace(e , 'elementary-school', regex=True, inplace = True)

married= ['Married-spouse-absent','Married-civ-spouse','Married-AF-spouse']
separated = ['Separated','Divorced']
#replace elements in list.
for m in married:
    data['marital-status'].replace(m ,'Married', regex=True, inplace = True)
for s in separated:
    data['marital-status'].replace(s ,'Separated', regex=True, inplace = True)

self_employed = ['Self-emp-not-inc','Self-emp-inc']
govt_employees = ['Local-gov','State-gov','Federal-gov']
for se in self_employed:
    data['workclass'].replace(se , 'Self_employed', regex=True, inplace = True)
for ge in govt_employees:
    data['workclass'].replace(ge , 'Govt_employees', regex=True, inplace = True)

del_cols = ['relationship','education-num']
data.drop(labels = del_cols, axis = 1, inplace = True)

index_age = data[data['age'] == 90].index
data.drop(labels = index_age, axis = 0, inplace =True)
num_col_new = ['age','capital-gain', 'capital-loss',
       'hours-per-week','fnlwgt']
cat_col_new = ['workclass', 'education', 'marital-status', 'occupation',
               'race', 'sex','salary','native-country']#add native-country label
scaler = MinMaxScaler()
#pd.DataFrame(scaler.fit_transform(data[num_col_new]),columns = num_col_new)
class DataFrameSelector(TransformerMixin):
    def __init__(self,attribute_names):
        self.attribute_names = attribute_names
    def fit(self,X,y = None):
        return self
    def transform(self,X):
        return X[self.attribute_names]
class num_trans(TransformerMixin):
    def __init__(self):
        pass
    def fit(self,X,y=None):
        return self
    def transform(self,X):
        df = pd.DataFrame(X)
        df.columns = num_col_new 
        return df
pipeline = Pipeline([('selector',DataFrameSelector(num_col_new)),  
                     ('scaler',MinMaxScaler()),('transform',num_trans())])#('scaler',MinMaxScaler()),        
num_df = pipeline.fit_transform(data)
num_df.shape
# columns which I don't need after creating dummy variables dataframe
cols = ['workclass_Govt_employess','education_Some-college',
        'marital-status_Never-married','occupation_Other-service',
        'race_Black','sex_Male','salary_>50K']
class dummies(TransformerMixin):
    def __init__(self,cols):
        self.cols = cols
    
    def fit(self,X,y = None):
        return self
    
    def transform(self,X):
        df = pd.get_dummies(X)
        df_new = df[df.columns.difference(cols)] 
        return df_new
pipeline_cat=Pipeline([('selector',DataFrameSelector(cat_col_new)),
                      ('dummies',dummies(cols))])
cat_df = pipeline_cat.fit_transform(data)
cat_df['id'] = pd.Series(range(cat_df.shape[0]))
num_df['id'] = pd.Series(range(num_df.shape[0]))
final_df = pd.merge(cat_df,num_df,how = 'inner', on = 'id')
print(f"Number of observations in final dataset: {final_df.shape}")

Number of observations in final dataset: (45167, 82)


We extract the train and target data from the dataframe

In [31]:
y = final_df['salary_<=50K'].values
final_df.drop(labels = ['id','salary_<=50K'],axis = 1,inplace = True)
X = final_df.values

In [32]:
feature_names = list(final_df.columns.values)
categorical_features = ['workclass', 'education', 'marital-status', 'occupation', 'race', 'sex','native-country']

In [33]:
len(X)

45167

In [34]:
from imblearn.under_sampling import TomekLinks, NeighbourhoodCleaningRule, NearMiss,RandomUnderSampler
from collections import Counter

print('Original dataset shape %s' % Counter(y))

tl = TomekLinks()
X_res, y_res = tl.fit_resample(X, y)
print('TomekLinks: Resampled dataset shape %s' % Counter(y_res))
ncr = NeighbourhoodCleaningRule()
X_res, y_res = ncr.fit_resample(X_res, y_res)
print('NC: Resampled dataset shape %s' % Counter(y_res))
nm = NearMiss(version=3)
X_res, y_res = nm.fit_resample(X_res, y_res)
print('NM: Resampled dataset shape %s' % Counter(y_res))
rus = RandomUnderSampler(random_state=42)
X_res, y_res = rus.fit_resample(X_res, y_res)
print('Random: Resampled dataset shape %s' % Counter(y_res))

Original dataset shape Counter({1: 33970, 0: 11197})
TomekLinks: Resampled dataset shape Counter({1: 31155, 0: 11197})
NC: Resampled dataset shape Counter({1: 25878, 0: 11197})
NM: Resampled dataset shape Counter({0: 11197, 1: 7017})
Random: Resampled dataset shape Counter({0: 7017, 1: 7017})


In [35]:
rus = RandomUnderSampler(sampling_strategy='all', random_state=42)
X_res2, y_res2 = rus.fit_resample(X_res[:9000], y_res[:9000])
print('Random: Resampled dataset shape %s' % Counter(y_res2))

Random: Resampled dataset shape Counter({0: 1983, 1: 1983})


In [36]:
rus = RandomUnderSampler(sampling_strategy='all', random_state=42)
X_res2, y_res2 = rus.fit_resample(X_res[:7517], y_res[:7517])
print('Random: Resampled dataset shape %s' % Counter(y_res2))

Random: Resampled dataset shape Counter({0: 500, 1: 500})


We have 1372 instances. We are going to use the build-in GridSearch of LionForests to find and train the best classifier for this dataset

## Machine Learning models training step

We will use a MinMax scaler to normalize the input


In [37]:
scaler = MinMaxScaler(feature_range=(-1,1))

We are going to use 4 different classifiers (Random Forests, SVMs, Logistic Regression, Neural Networks

In [38]:
classifiers = {}
scalers = {}

In [39]:
pipe = Pipeline(steps=[('scaler', scaler), ('rf', RandomForestClassifier(random_state=77))])
parameters =[{
    'rf__max_depth': [7], #1, 5, 7, 10
    'rf__max_features': ['sqrt'], #'sqrt', 'log2', 0.75, None
    'rf__bootstrap': [False], #True, False
    'rf__min_samples_leaf' : [2], #1, 2, 5, 10, 0.10
    'rf__n_estimators': [10] #10, 100, 500, 1000
}]
clf = GridSearchCV(pipe, parameters, scoring='f1', cv=10, n_jobs=-1, verbose=1)
clf.fit(X_res2, y_res2)
scaler_rf = clf.best_estimator_.steps[0][1]
rf = clf.best_estimator_.steps[1][1]
classifiers[1] = [rf, str("Random Forests: "+ str(clf.best_score_))]
scalers[1] = scaler_rf

Fitting 10 folds for each of 1 candidates, totalling 10 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done   6 out of  10 | elapsed:    1.4s remaining:    0.9s
[Parallel(n_jobs=-1)]: Done  10 out of  10 | elapsed:    1.4s finished


In [40]:
pipe = Pipeline(steps=[('scaler', scaler), ('svm', SVC(probability=True,random_state=77))])
parameters = [
  {'svm__C': [-3, 1, 3, 10, 100, 1000], 'svm__kernel': ['linear']},
  {'svm__C': [-3, 1, 3, 10, 100, 1000], 'svm__gamma': [0.1, 0.01, 0.001, 0.0001], 'svm__kernel': ['rbf']},
]
parameters = {'svm__C': [3], 'svm__gamma': [0.1], 'svm__kernel': ['rbf']} #best
clf = GridSearchCV(pipe, parameters, scoring='f1', cv=10, n_jobs=-1, verbose=1)
clf.fit(X_res2, y_res2)
scaler_svm = clf.best_estimator_.steps[0][1]
svm = clf.best_estimator_.steps[1][1]
classifiers[2] = [svm, str("SVM: "+ str(clf.best_score_))]
scalers[2] = scaler_svm

Fitting 10 folds for each of 1 candidates, totalling 10 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done   6 out of  10 | elapsed:    0.3s remaining:    0.2s
[Parallel(n_jobs=-1)]: Done  10 out of  10 | elapsed:    0.5s finished


In [41]:
pipe = Pipeline(steps=[('scaler', scaler), ('lr', LogisticRegression(random_state=77))])
parameters = [
  {'lr__C': [-3, 1, 3, 10, 100, 1000], 'lr__penalty': ['l1'], 'lr__solver': ['liblinear', 'saga']},
  {'lr__C': [-3, 1, 3, 10, 100, 1000], 'lr__penalty': ['l2'], 'lr__solver': ['newton-cg', 'lbfgs', 'sag','saga']}
]
parameters = {'lr__C': [3], 'lr__penalty': ['l2'], 'lr__solver': ['sag']}#best
clf = GridSearchCV(pipe, parameters, scoring='f1', cv=10, n_jobs=-1, verbose=1)
clf.fit(X_res2, y_res2)
scaler_lr = clf.best_estimator_.steps[0][1]
lr = clf.best_estimator_.steps[1][1]
classifiers[3] = [lr, str("Logistic Regression: "+ str(clf.best_score_))]
scalers[3] = scaler_lr

Fitting 10 folds for each of 1 candidates, totalling 10 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done   6 out of  10 | elapsed:    0.2s remaining:    0.1s
[Parallel(n_jobs=-1)]: Done  10 out of  10 | elapsed:    0.2s finished


In [42]:
pipe = Pipeline(steps=[('scaler', scaler), ('nn', MLPClassifier(early_stopping=True, random_state=77))])
parameters = {
    'nn__hidden_layer_sizes': [(2,10),(5,10),(10,100),(20,200),(50,500),(100,1000)], 
    'nn__activation': ['logistic', 'tanh', 'relu'],
    'nn__solver': ['sgd', 'adam'],
    'nn__alpha': [0.000001,0.0001,0.001, 0.01, 0.1],
    'nn__learning_rate': ['constant', 'invscaling', 'adaptive']}
parameters = {
    'nn__hidden_layer_sizes': [(50,500)], 
    'nn__activation': ['tanh'],
    'nn__solver': ['adam'],
    'nn__alpha': [0.000001],
    'nn__learning_rate': ['constant']}
clf = GridSearchCV(pipe, parameters, scoring='f1', cv=10, n_jobs=-1, verbose=1)
clf.fit(X_res2, y_res2)

scaler_nn = clf.best_estimator_.steps[0][1]
nn = clf.best_estimator_.steps[1][1]
classifiers[4] = [nn, str("Neural Network: "+ str(clf.best_score_))]
scalers[4] = scaler_nn

Fitting 10 folds for each of 1 candidates, totalling 10 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done   6 out of  10 | elapsed:    1.3s remaining:    0.8s
[Parallel(n_jobs=-1)]: Done  10 out of  10 | elapsed:    1.9s finished


We can see the best classifiers of each algorithm and their scores. In this dataset SVMs work better than the others!

In [43]:
classifiers

{1: [RandomForestClassifier(bootstrap=False, max_depth=7, max_features='sqrt',
                         min_samples_leaf=2, n_estimators=10, random_state=77),
  'Random Forests: 0.9431755994216005'],
 2: [SVC(C=3, gamma=0.1, probability=True, random_state=77),
  'SVM: 0.9593078770419098'],
 3: [LogisticRegression(C=3, random_state=77, solver='sag'),
  'Logistic Regression: 0.9467018455060752'],
 4: [MLPClassifier(activation='tanh', alpha=1e-06, early_stopping=True,
                hidden_layer_sizes=(50, 500), random_state=77),
  'Neural Network: 0.9442092900222173']}

## Quantitative Tests

Finally, we run the quantitative experimennts for these classifiers and 3 to 4 interpretation techniques. All the models and the techniques will be tested on a commno subset of the original data, in order to check which technique provided less untruthful features for each model. This is a way to select model and interpretation technique, as well as a way to benchmark different interpretation techniques. For a qualitative and more informative example of Altruist and its explanations please open the Qualitative.ipynb

In [45]:
@interact(eli_5=False, lime=True, shap=True, perm_importance=True, intristic=False, cl=(1,4))
def g(eli_5, lime, shap, perm_importance, intristic, cl=1):
    print(classifiers[cl][1])
    X_t = scalers[cl].transform(X_res2)
    fi = FeatureImportance(X_t[900:1000], y_res2[900:1000], feature_names, class_names)
    fi_names = {fi.fi_lime:'Lime',fi.fi_shap:'Shap',fi.fi_eli:'Eli5',fi.fi_perm_imp:'Permuation Importance',fi.fi_rf:'Pseudo-Intristic RFs', fi.fi_coef_lr:'Intristic LR'}
    fis = []
    if (eli_5 and not cl == 2 and not cl == 3):
        fis.append(fi.fi_eli)
    if lime:
        fis.append(fi.fi_lime)
    if shap:
        fis.append(fi.fi_shap)
    if perm_importance:
        fis.append(fi.fi_perm_imp)
    if intristic and cl == 1:
        fis.append(fi.fi_rf)
    if intristic and cl == 3:
        fis.append(fi.fi_coef_lr)
    fis_scores = []
    for i in fis:
        fis_scores.append([])
    count = 0;
    altruistino = Altruist(classifiers[cl][0], X_t, fis, feature_names,None)
    for instance in X_t[900:1000]:
        if (count + 1) % 10 == 0:
            print(count+1,"/",len(X_t),"..",end=", ")
        count = count + 1
        untruthful_features = altruistino.find_untruthful_features(instance)
        for i in range(len(untruthful_features[0])):
            fis_scores[i].append(len(untruthful_features[0][i]))
    print(len(X_t),"/",len(X_t))
    count = 0
    for fis_score in fis_scores:
        fi = fis[count]
        count = count + 1
        print(fi_names[fi],np.array(fis_score).mean())
    fi_matrix = np.array(fis_scores)
    count = 0
    fi_all = []
    for instance in X_t[900:1000]:
        fi_all.append(fi_matrix[:,count].min())
        count = count + 1
    print("Altogether:",np.array(fi_all).mean())

interactive(children=(Checkbox(value=False, description='eli_5'), Checkbox(value=True, description='lime'), Ch…