# This notebook contains the experiments on Heart Statlog dataset with LionForests

In [1]:
from LionForests import LionForests
from sklearn.preprocessing import StandardScaler, MinMaxScaler
import pandas as pd 
import numpy as np
import urllib

Firstly, we load the dataset and we set the feature and class names

In [2]:
url="http://archive.ics.uci.edu/ml/machine-learning-databases/statlog/heart/heart.dat"
raw_data = urllib.request.urlopen(url)
credit=np.genfromtxt(raw_data)
X,y = credit[:,:-1], credit[:,-1].squeeze()
feature_names = ['age','sex','chest pain','resting blood pressure','serum cholestoral',
               'fasting blood sugar','resting electrocardiographic results','maximum heart rate achieved','exercise induced angina','oldpeak',
               'the slope of the peak exercise','number of major vessels','reversable defect']
class_names = ['absence','presence']

This dataset contains few instances. Only 270

In [3]:
len(X)

270

We can explore the features of this dataset

In [4]:
pd.DataFrame(X,columns=feature_names).describe()

Unnamed: 0,age,sex,chest pain,resting blood pressure,serum cholestoral,fasting blood sugar,resting electrocardiographic results,maximum heart rate achieved,exercise induced angina,oldpeak,the slope of the peak exercise,number of major vessels,reversable defect
count,270.0,270.0,270.0,270.0,270.0,270.0,270.0,270.0,270.0,270.0,270.0,270.0,270.0
mean,54.433333,0.677778,3.174074,131.344444,249.659259,0.148148,1.022222,149.677778,0.32963,1.05,1.585185,0.67037,4.696296
std,9.109067,0.468195,0.95009,17.861608,51.686237,0.355906,0.997891,23.165717,0.470952,1.14521,0.61439,0.943896,1.940659
min,29.0,0.0,1.0,94.0,126.0,0.0,0.0,71.0,0.0,0.0,1.0,0.0,3.0
25%,48.0,0.0,3.0,120.0,213.0,0.0,0.0,133.0,0.0,0.0,1.0,0.0,3.0
50%,55.0,1.0,3.0,130.0,245.0,0.0,2.0,153.5,0.0,0.8,2.0,0.0,3.0
75%,61.0,1.0,4.0,140.0,280.0,0.0,2.0,166.0,1.0,1.6,2.0,1.0,7.0
max,77.0,1.0,4.0,200.0,564.0,1.0,2.0,202.0,1.0,6.2,3.0,3.0,7.0


Then, we can train our random forests model using LionForests

In [5]:
y = [int(i-1) for i in y] 
lf = LionForests(class_names=class_names)
scaler = MinMaxScaler(feature_range=(-1,1))
lf.train(X, y, scaler, feature_names)

Fitting 10 folds for each of 1 candidates, totalling 10 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done  10 out of  10 | elapsed:    4.0s finished


And we can see the number of estimators and the best set of parameters

In [6]:
number_of_estimators = lf.model.n_estimators
print("Accuracy:",lf.accuracy,", Number of estimators:",lf.number_of_estimators)
print(lf.model)

Accuracy: 0.8188916011524707 , Number of estimators: 500
RandomForestClassifier(bootstrap=False, class_weight=None, criterion='gini',
                       max_depth=5, max_features='sqrt', max_leaf_nodes=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=5, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, n_estimators=500,
                       n_jobs=-1, oob_score=False, random_state=0, verbose=0,
                       warm_start=False)


Now, we are ready to produce explanations using lionForests

In [7]:
lf.following_breadcrumbs(X[81], False, True, False, complexity=4)

'if 6.5<=reversable defect<=7.0 & 3.5<=chest pain<=4.0 & 0.0<=number of major vessels<=0.5 & 1.55<=oldpeak<=1.7 & 0.5<=exercise induced angina<=1.0 & 127.507<=maximum heart rate achieved<=133.494 & 1.5<=the slope of the peak exercise<=2.5 & 184.999<=serum cholestoral<=199.496 & 119.0<=resting blood pressure<=121.491 then presence'

While the original explanation could look like this:

In [8]:
lf.following_breadcrumbs(X[81], False, False, False, complexity=4)

'if 6.5<=reversable defect<=7.0 & 3.5<=chest pain<=4.0 & 0.0<=number of major vessels<=0.5 & 1.55<=oldpeak<=1.7 & 0.5<=exercise induced angina<=1.0 & 128.005<=maximum heart rate achieved<=130.998 & 1.5<=the slope of the peak exercise<=2.5 & 0.5<=sex<=1.0 & 184.999<=serum cholestoral<=199.496 & 29.002<=age<=41.497 & 0.0<=resting electrocardiographic results<=0.5 & 119.0<=resting blood pressure<=121.491 & 0.0<=fasting blood sugar<=0.5 then presence'