# Krzysztof Tomala - Explainable AI - Homework 5

# Report

# List of features with short descriptions

Age : Age of the patient

Sex : Sex of the patient

exang: exercise induced angina (1 = yes; 0 = no)

ca: number of major vessels visible in fluoroscopy (0-3)

cp : Chest Pain type chest pain type
    Value 1: typical angina
    Value 2: atypical angina
    Value 3: non-anginal pain
    Value 4: asymptomatic

trtbps : resting blood pressure (in mm Hg)

chol : cholestoral in mg/dl fetched via BMI sensor

fbs : (fasting blood sugar > 120 mg/dl) (1 = true; 0 = false)

rest_ecg : resting electrocardiographic results
    Value 0: normal
    Value 1: having ST-T wave abnormality (T wave inversions and/or ST elevation or depression of > 0.05 mV)
    Value 2: showing probable or definite left ventricular hypertrophy by Estes' criteria

thalach : maximum heart rate achieved

thal : Thalium Stress Test result
    
slope : the slope of the peak exercise ST segment (2 = upsloping; 1 = flat; 0 = downsloping)

oldpeak : ST depression induced by exercise relative to rest

# Tasks

    Calculate Permutation-based Variable Importance for the selected model.

![image](forest.png)

We can see that the most importatnt variables for the random tree classifier, based on Permutation-based Variable Importance, are ca, cp_0 and thalach. 

    Train three more candidate models (different variable transformations, different model architectures, hyperparameters) and compare their rankings of important features using PVI. What are the differences? Why?

![image](forest_hp2.png)

![image](xgb.png)

![image](xgb_hp2.png)

We can see that the numerical results vary a bit, but order of the variables is pretty similar ( for example ca and cp_0 are always in top 4). 
The difference might come from small differences in the models or just from randomness of the method.

    For the tree-based model from (1), compare PVI with:
    A) the traditional feature importance measures for trees: Gini impurity etc.; what is implemented in a given library: see e.g. the feature_importances_ attribute in xgboost and sklearn.
    B) [in Python] SHAP variable importance based on the TreeSHAP algorithm available in the shap package.


![image](feature_importance.png)

![image](shap.png)

We can see that this methods give similiar results to the ones before. I think it is happening because we use all of them to find which variables have more and less impact on the predictions of the model, so it would be very bad if deifferent methods gave us different results.

# Appendix

In [1]:
import pandas as pd
import sklearn
from sklearn import ensemble
import dalex as dx
import lime

In [2]:
dataset = pd.read_csv('heart.csv')
dataset = pd.get_dummies(dataset)
dataset

Unnamed: 0,age,sex,cp,trtbps,chol,fbs,restecg,thalachh,exng,oldpeak,slp,caa,thall,output
0,63,1,3,145,233,1,0,150,0,2.3,0,0,1,1
1,37,1,2,130,250,0,1,187,0,3.5,0,0,2,1
2,41,0,1,130,204,0,0,172,0,1.4,2,0,2,1
3,56,1,1,120,236,0,1,178,0,0.8,2,0,2,1
4,57,0,0,120,354,0,1,163,1,0.6,2,0,2,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
298,57,0,0,140,241,0,1,123,1,0.2,1,0,3,0
299,45,1,3,110,264,0,1,132,0,1.2,1,0,3,0
300,68,1,0,144,193,1,1,141,0,3.4,1,2,3,0
301,57,1,0,130,131,0,1,115,1,1.2,1,1,3,0


In [3]:
features = dataset.drop(columns='output')

#fixing typo in data
features['thalach']=features['thalachh']
features = features.drop(columns='thalachh')

features['slope']=features['slp']
features = features.drop(columns='slp')

features['ca']=features['caa']
features = features.drop(columns='caa')

features = pd.get_dummies(features, columns=['cp', 'thall'])

features
X_train, X_test, y_train, y_test = sklearn.model_selection.train_test_split(features, dataset['output'], test_size=0.3, random_state=0)
X_train

Unnamed: 0,age,sex,trtbps,chol,fbs,restecg,exng,oldpeak,thalach,slope,ca,cp_0,cp_1,cp_2,cp_3,thall_0,thall_1,thall_2,thall_3
137,62,1,128,208,1,0,0,0.0,140,2,0,0,1,0,0,0,0,1,0
106,69,1,160,234,1,0,0,0.1,131,1,1,0,0,0,1,0,0,1,0
284,61,1,140,207,0,0,1,1.9,138,2,1,1,0,0,0,0,0,0,1
44,39,1,140,321,0,0,0,0.0,182,2,0,0,0,1,0,0,0,1,0
139,64,1,128,263,0,1,1,0.2,105,1,1,1,0,0,0,0,0,0,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
251,43,1,132,247,1,0,1,0.1,143,1,4,1,0,0,0,0,0,0,1
192,54,1,120,188,0,1,0,1.4,113,1,1,1,0,0,0,0,0,0,1
117,56,1,120,193,0,0,0,1.9,162,1,0,0,0,0,1,0,0,0,1
47,47,1,138,257,0,0,0,0.0,156,2,0,0,0,1,0,0,0,1,0


In [4]:
forest = sklearn.ensemble.RandomForestClassifier()
forest.fit(X=X_train,y=y_train)
print(f'Accuracy: {sklearn.metrics.accuracy_score(y_test,forest.predict(X_test))}')
print(f'Recall: {sklearn.metrics.recall_score(y_test,forest.predict(X_test))}')
print(f'Precision: {sklearn.metrics.precision_score(y_test,forest.predict(X_test))}')

forest_accuracy = sklearn.metrics.accuracy_score(y_test,forest.predict(X_test))
forest_recall = sklearn.metrics.recall_score(y_test,forest.predict(X_test))
forest_precision = sklearn.metrics.precision_score(y_test,forest.predict(X_test))

print('\nResults on train dataset:')
print(f'Accuracy: {sklearn.metrics.accuracy_score(y_train,forest.predict(X_train))}')
print(f'Recall: {sklearn.metrics.recall_score(y_train,forest.predict(X_train))}')
print(f'Precision: {sklearn.metrics.precision_score(y_train,forest.predict(X_train))}')

Accuracy: 0.8571428571428571
Recall: 0.9148936170212766
Precision: 0.8269230769230769

Results on train dataset:
Accuracy: 1.0
Recall: 1.0
Precision: 1.0


In [5]:
obs = list(range(2))
import numpy as np
predict = lambda m, d: m.predict_proba(d)[:,1]

for i in obs:
    obs1 = X_test.iloc[obs[i]].to_numpy().reshape(1,-1)
    print(obs1)
    print(predict(forest, obs1))
    

X = X_test
y = y_test
explainer = dx.Explainer(forest, X_test, y_test, predict_function=predict, label="forest")

[[ 70.    1.  145.  174.    0.    1.    1.    2.6 125.    0.    0.    1.
    0.    0.    0.    0.    0.    0.    1. ]]
[0.19]
[[ 64.    1.  170.  227.    0.    0.    0.    0.6 155.    1.    0.    0.
    0.    0.    1.    0.    0.    0.    1. ]]
[0.65]
Preparation of a new explainer is initiated

  -> data              : 91 rows 19 cols
  -> target variable   : Parameter 'y' was a pandas.Series. Converted to a numpy.ndarray.
  -> target variable   : 91 values
  -> model_class       : sklearn.ensemble._forest.RandomForestClassifier (default)
  -> label             : forest
  -> predict function  : <function <lambda> at 0x7f50744321f0> will be used
  -> predict function  : Accepts pandas.DataFrame and numpy.ndarray.
  -> predicted values  : min = 0.0, mean = 0.561, max = 1.0
  -> model type        : classification will be used (default)
  -> residual function : difference between y and yhat (default)
  -> residuals         : min = -0.89, mean = -0.0447, max = 0.81
  -> model_info        :



In [6]:
pvi = explainer.model_parts(random_state=0)
pvi.result

Unnamed: 0,variable,dropout_loss,label
0,thall_1,0.089773,forest
1,cp_1,0.089966,forest
2,chol,0.090837,forest
3,_full_model_,0.091151,forest
4,thall_0,0.091151,forest
5,trtbps,0.091175,forest
6,fbs,0.091296,forest
7,thall_3,0.093013,forest
8,cp_3,0.09427,forest
9,cp_2,0.094995,forest


In [7]:
forest = sklearn.ensemble.RandomForestClassifier(n_estimators=17, criterion='entropy', max_features='log2')
forest.fit(X=X_train,y=y_train)
print(f'Accuracy: {sklearn.metrics.accuracy_score(y_test,forest.predict(X_test))}')
print(f'Recall: {sklearn.metrics.recall_score(y_test,forest.predict(X_test))}')
print(f'Precision: {sklearn.metrics.precision_score(y_test,forest.predict(X_test))}')

forest_accuracy = sklearn.metrics.accuracy_score(y_test,forest.predict(X_test))
forest_recall = sklearn.metrics.recall_score(y_test,forest.predict(X_test))
forest_precision = sklearn.metrics.precision_score(y_test,forest.predict(X_test))

print('\nResults on train dataset:')
print(f'Accuracy: {sklearn.metrics.accuracy_score(y_train,forest.predict(X_train))}')
print(f'Recall: {sklearn.metrics.recall_score(y_train,forest.predict(X_train))}')
print(f'Precision: {sklearn.metrics.precision_score(y_train,forest.predict(X_train))}')

obs = list(range(2))
import numpy as np
predict = lambda m, d: m.predict_proba(d)[:,1]

for i in obs:
    obs1 = X_test.iloc[obs[i]].to_numpy().reshape(1,-1)
    print(obs1)
    print(predict(forest, obs1))
    

X = X_test
y = y_test
explainer = dx.Explainer(forest, X_test, y_test, predict_function=predict, label="forest_hp2")

pvi = explainer.model_parts(random_state=0)
pvi.result

Accuracy: 0.8131868131868132
Recall: 0.8297872340425532
Precision: 0.8125

Results on train dataset:
Accuracy: 0.9952830188679245
Recall: 1.0
Precision: 0.9915966386554622
[[ 70.    1.  145.  174.    0.    1.    1.    2.6 125.    0.    0.    1.
    0.    0.    0.    0.    0.    0.    1. ]]
[0.17647059]
[[ 64.    1.  170.  227.    0.    0.    0.    0.6 155.    1.    0.    0.
    0.    0.    1.    0.    0.    0.    1. ]]
[0.58823529]
Preparation of a new explainer is initiated

  -> data              : 91 rows 19 cols
  -> target variable   : Parameter 'y' was a pandas.Series. Converted to a numpy.ndarray.
  -> target variable   : 91 values
  -> model_class       : sklearn.ensemble._forest.RandomForestClassifier (default)
  -> label             : forest_hp2
  -> predict function  : <function <lambda> at 0x7f5074401430> will be used
  -> predict function  : Accepts pandas.DataFrame and numpy.ndarray.
  -> predicted values  : min = 0.0, mean = 0.54, max = 1.0
  -> model type        : class



Unnamed: 0,variable,dropout_loss,label
0,chol,0.117868,forest_hp2
1,restecg,0.118786,forest_hp2
2,fbs,0.118835,forest_hp2
3,cp_2,0.124154,forest_hp2
4,_full_model_,0.124758,forest_hp2
5,thall_0,0.124758,forest_hp2
6,cp_1,0.125,forest_hp2
7,thall_1,0.12558,forest_hp2
8,trtbps,0.125991,forest_hp2
9,age,0.126064,forest_hp2


In [8]:
import xgboost
model = xgboost.XGBClassifier(
    n_estimators=50, 
    max_depth=2, 
    use_label_encoder=False, 
    eval_metric="logloss",
    
    enable_categorical=True,
    tree_method="hist"
)

model.fit(X_train, y_train)

def pf_xgboost_classifier_categorical(model, df):
    df.loc[:, df.dtypes == 'object'] =\
        df.select_dtypes(['object'])\
        .apply(lambda x: x.astype('category'))
    return model.predict_proba(df)[:, 1]

explainer = dx.Explainer(model, X_test, y_test, predict_function=pf_xgboost_classifier_categorical, label="xgb")

pvi = explainer.model_parts(random_state=0)
pvi.result



Preparation of a new explainer is initiated

  -> data              : 91 rows 19 cols
  -> target variable   : Parameter 'y' was a pandas.Series. Converted to a numpy.ndarray.
  -> target variable   : 91 values
  -> model_class       : xgboost.sklearn.XGBClassifier (default)
  -> label             : xgb
  -> predict function  : <function pf_xgboost_classifier_categorical at 0x7f50643944c0> will be used
  -> predict function  : Accepts only pandas.DataFrame, numpy.ndarray causes problems.
  -> predicted values  : min = 0.00192, mean = 0.547, max = 0.998
  -> model type        : classification will be used (default)
  -> residual function : difference between y and yhat (default)
  -> residuals         : min = -0.965, mean = -0.0305, max = 0.967
  -> model_info        : package xgboost

A new explainer has been created!


Unnamed: 0,variable,dropout_loss,label
0,_full_model_,0.099613,xgb
1,thall_1,0.099613,xgb
2,thall_0,0.099613,xgb
3,cp_1,0.099613,xgb
4,cp_2,0.099613,xgb
5,cp_3,0.099613,xgb
6,thall_3,0.100145,xgb
7,fbs,0.101547,xgb
8,chol,0.103385,xgb
9,slope,0.103965,xgb


In [9]:
model = xgboost.XGBClassifier(
    n_estimators=33, 
    max_depth=5, 
    use_label_encoder=False, 
    eval_metric="logloss",
    enable_categorical=True,
    tree_method="approx",
    learning_rate=0.01,
)

model.fit(X_train, y_train)

def pf_xgboost_classifier_categorical(model, df):
    df.loc[:, df.dtypes == 'object'] =\
        df.select_dtypes(['object'])\
        .apply(lambda x: x.astype('category'))
    return model.predict_proba(df)[:, 1]

explainer = dx.Explainer(model, X_test, y_test, predict_function=pf_xgboost_classifier_categorical,  label="xgb_hp2")

pvi = explainer.model_parts(random_state=0)
pvi.result



Preparation of a new explainer is initiated

  -> data              : 91 rows 19 cols
  -> target variable   : Parameter 'y' was a pandas.Series. Converted to a numpy.ndarray.
  -> target variable   : 91 values
  -> model_class       : xgboost.sklearn.XGBClassifier (default)
  -> label             : xgb_hp2
  -> predict function  : <function pf_xgboost_classifier_categorical at 0x7f50643949d0> will be used
  -> predict function  : Accepts only pandas.DataFrame, numpy.ndarray causes problems.
  -> predicted values  : min = 0.372, mean = 0.515, max = 0.63
  -> model type        : classification will be used (default)
  -> residual function : difference between y and yhat (default)
  -> residuals         : min = -0.623, mean = 0.00179, max = 0.622
  -> model_info        : package xgboost

A new explainer has been created!


Unnamed: 0,variable,dropout_loss,label
0,slope,0.13235,xgb_hp2
1,fbs,0.133221,xgb_hp2
2,_full_model_,0.133221,xgb_hp2
3,thall_1,0.133221,xgb_hp2
4,thall_0,0.133221,xgb_hp2
5,cp_1,0.133221,xgb_hp2
6,cp_2,0.133221,xgb_hp2
7,exng,0.133221,xgb_hp2
8,thall_3,0.133221,xgb_hp2
9,restecg,0.133221,xgb_hp2


In [10]:
forest = sklearn.ensemble.RandomForestClassifier()
forest.fit(X=X_train,y=y_train)

order = np.flip(np.argsort(forest.feature_importances_))
a = {'name': forest.feature_names_in_[order], 'importance': forest.feature_importances_[order]}
pd.DataFrame.from_dict(a)

Unnamed: 0,name,importance
0,thall_2,0.118656
1,ca,0.111996
2,oldpeak,0.107557
3,cp_0,0.09089
4,thalach,0.08819
5,thall_3,0.072221
6,chol,0.072025
7,age,0.069722
8,trtbps,0.065788
9,slope,0.045181


In [11]:
import shap
shap_explainer = shap.explainers.Tree(forest, data=X, model_output="probability")
shap_values = shap_explainer(X)[:,:,1]
shap_values

.values =
array([[ 3.35288241e-02, -1.27535756e-02, -2.38071254e-02, ...,
        -2.65567752e-04, -8.72872385e-02, -6.77144156e-02],
       [ 7.86944023e-03, -2.01429871e-02, -2.55855572e-02, ...,
        -1.46520112e-04, -5.26436405e-02, -5.71370994e-02],
       [-2.52248817e-02, -2.31964931e-02, -2.42187337e-02, ...,
        -4.02930226e-05, -3.48891929e-02, -4.78788581e-02],
       ...,
       [ 2.76423334e-02, -1.71681052e-02,  8.35020913e-03, ...,
         1.34746200e-04,  9.18965179e-02,  5.32609881e-02],
       [ 9.75876558e-03, -2.34831234e-02,  1.62524856e-02, ...,
         7.63736257e-04,  7.18219500e-02,  5.72851899e-02],
       [-1.82144156e-02, -2.56614332e-02,  1.19324961e-02, ...,
         7.52747266e-04,  6.54504173e-02,  3.73231285e-02]])

.base_values =
array([0.55043956, 0.55043956, 0.55043956, 0.55043956, 0.55043956,
       0.55043956, 0.55043956, 0.55043956, 0.55043956, 0.55043956,
       0.55043956, 0.55043956, 0.55043956, 0.55043956, 0.55043956,
       0.5504395

In [12]:
rf_resultX = pd.DataFrame(shap_values.values, columns = forest.feature_names_in_)

vals = np.abs(rf_resultX.values).mean(0)

shap_importance = pd.DataFrame(list(zip(forest.feature_names_in_, vals)),
                                  columns=['name','importance_shap'])
shap_importance.sort_values(by=['importance_shap'],
                               ascending=False, inplace=True)
shap_importance

Unnamed: 0,name,importance_shap
10,ca,0.084313
17,thall_2,0.07954
11,cp_0,0.068458
18,thall_3,0.055299
7,oldpeak,0.051604
8,thalach,0.038103
1,sex,0.031986
9,slope,0.025195
6,exng,0.022847
0,age,0.016674
