# Homework 4
### Become familiar with Ceteris Paribus (CP) and Partial Dependence profiles (PDP), and its variants like Accumulated Local Effects (ALE).

Dataset used for experiments with explanation libraries is Heart Attack Dataset. Most of the experiments are run using `XGBClassifier`, except the last one where `RandomForestClassifier` is used.

## Task 1

<img src="imgs/task1.jpg" alt="task1 solution" width="1000"/>


## Task 2.3

Chosen patients from the dataset for all the next tasks are exactly the same as in the previous homeworks.

We can see that for two different samples there are some differences in the presented plots. Change of "slp" feature, affects predictions differently for both patients. For one of the patients, there is no big difference in prediction. Predicted score differs much more for the second patient. Similar behaviour can be observed for "chol".
<br /> <br />

<img src="imgs/task2.3.png" alt="task2 solution" width="1200"/>


## Task 2.4
Comparing PDP with CP, we can see that "sex" feature does not really change. However, differences for "age" feature are clearly visible.
Globally, we can see that change of age between 50-60 and 60-70 strongly influences the predicted class. For the sampled patient, those influences are not that strong.

<br /> <br />
<img src="imgs/task2.4.2.png" alt="task2 solution" width="1200"/>
<img src="imgs/task2.4.1.png" alt="task2 solution" width="1200"/>

## Task 2.5
Compared models are XGBClassifier and RandomForestClassifier. As in the previous task, we can see that "sex" influence does not differ between models.
Features, such as "age" and "chol", behave similarly for both models. There are visible common decreases in prediction score (for "age" between around 57-60) and increases (for "chol" equal around 200).

<br /> <br />
<img src="imgs/task2.5.1.png" alt="task2 solution" width="1200"/>
<img src="imgs/task2.5.2.png" alt="task2 solution" width="1200"/>

In [None]:
!pip install pandas
!pip install plotly
!pip install seaborn
!pip install sklearn
!pip install xgboost
!pip install imblearn
!pip install dalex
!pip install shap
!pip install lime

In [48]:
import pandas as pd
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from xgboost import XGBClassifier

from sklearn.model_selection import cross_val_score
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay, classification_report
import dalex as dx
import shap
import sklearn

In [12]:
df = pd.read_csv('data/heart.csv')
categorical_cols = ['exng', 'caa', 'cp', 'restecg']

ohe = OneHotEncoder(handle_unknown='ignore', sparse=False, drop='first')
df[categorical_cols] = df[categorical_cols].astype('category')
df_tr = df.copy()

ohe.fit(df_tr[categorical_cols])
df_tr[ohe.get_feature_names_out(categorical_cols)] = ohe.transform(df_tr[categorical_cols])
df_tr.drop(columns=categorical_cols, inplace=True)

X, y = df_tr.drop(columns=['output']), df_tr['output']

In [4]:
def run_training(model, run_cv: bool = False):
    if run_cv:
        print(f'CV mean accuracy: {cross_val_score(model, X, y, cv=5, scoring="accuracy").mean()}')

    model.fit(X, y)
    y_pred = model.predict(X)
    conf_mat = confusion_matrix(y, y_pred)
    cmd = ConfusionMatrixDisplay(conf_mat)
    cmd.plot()
    print(classification_report(y, y_pred))

In [13]:
model1 = XGBClassifier()
model1.fit(X, y)

## Task 2
### 1.

In [14]:
patients = X.iloc[[14, 108]]#.sample(2) #14, 108
print(patients.index)
model1.predict_proba(patients)

Int64Index([14, 108], dtype='int64')


array([[8.0226064e-03, 9.9197739e-01],
       [4.6610832e-04, 9.9953389e-01]], dtype=float32)

In [16]:
explainer = dx.Explainer(model1, X, y)

Preparation of a new explainer is initiated

  -> data              : 303 rows 19 cols
  -> target variable   : Parameter 'y' was a pandas.Series. Converted to a numpy.ndarray.
  -> target variable   : 303 values
  -> model_class       : xgboost.sklearn.XGBClassifier (default)
  -> label             : Not specified, model's class short name will be used. (default)
  -> predict function  : <function yhat_proba_default at 0x7faacbbb2440> will be used (default)
  -> predict function  : Accepts pandas.DataFrame and numpy.ndarray.
  -> predicted values  : min = 0.000237, mean = 0.545, max = 1.0
  -> model type        : classification will be used (default)
  -> residual function : difference between y and yhat (default)
  -> residuals         : min = -0.177, mean = 3.57e-05, max = 0.139
  -> model_info        : package xgboost

A new explainer has been created!


In [17]:
explainer.model_performance()

Unnamed: 0,recall,precision,f1,accuracy,auc
XGBClassifier,1.0,1.0,1.0,1.0,1.0


In [18]:
explainer.model_parts().result

Unnamed: 0,variable,dropout_loss,label
0,_full_model_,0.0,XGBClassifier
1,restecg_2,0.0,XGBClassifier
2,caa_3,0.0,XGBClassifier
3,caa_4,0.0,XGBClassifier
4,cp_1,0.0,XGBClassifier
5,fbs,0.0,XGBClassifier
6,restecg_1,4.4e-05,XGBClassifier
7,exng_1,0.000404,XGBClassifier
8,caa_2,0.000632,XGBClassifier
9,cp_2,0.000848,XGBClassifier


In [36]:
cp1 = explainer.predict_profile(new_observation=patients)
cp1.plot(variables=['slp', 'chol'])

Calculating ceteris paribus: 100%|██████████| 19/19 [00:01<00:00, 14.64it/s]


In [55]:
pdp = explainer.model_profile()
pdp.result

Calculating ceteris paribus: 100%|██████████| 19/19 [00:01<00:00, 13.58it/s]


Unnamed: 0,_vname_,_label_,_x_,_yhat_,_ids_
0,age,XGBClassifier,29.00,0.559517,0
1,age,XGBClassifier,29.48,0.559517,0
2,age,XGBClassifier,29.96,0.559517,0
3,age,XGBClassifier,30.44,0.559517,0
4,age,XGBClassifier,30.92,0.559517,0
...,...,...,...,...,...
1914,restecg_2,XGBClassifier,0.96,0.543292,0
1915,restecg_2,XGBClassifier,0.97,0.543292,0
1916,restecg_2,XGBClassifier,0.98,0.543292,0
1917,restecg_2,XGBClassifier,0.99,0.543292,0


In [57]:
pdp.plot(variables=['age', 'sex'])
cp = explainer.predict_profile(new_observation=patients.iloc[[0]])
cp.plot(variables=['age', 'sex'])

Calculating ceteris paribus: 100%|██████████| 19/19 [00:00<00:00, 25.24it/s]


In [52]:
model2 = RandomForestClassifier(max_depth=7, random_state=0)
model2.fit(X, y)

explainer2 = dx.Explainer(model2, X, y, label="LogisticRegression")
explainer2.model_performance()

Preparation of a new explainer is initiated

  -> data              : 303 rows 19 cols
  -> target variable   : Parameter 'y' was a pandas.Series. Converted to a numpy.ndarray.
  -> target variable   : 303 values
  -> model_class       : sklearn.ensemble._forest.RandomForestClassifier (default)
  -> label             : LogisticRegression
  -> predict function  : <function yhat_proba_default at 0x7faacbbb2440> will be used (default)
  -> predict function  : Accepts pandas.DataFrame and numpy.ndarray.
  -> predicted values  : min = 0.00362, mean = 0.547, max = 0.992
  -> model type        : classification will be used (default)
  -> residual function : difference between y and yhat (default)
  -> residuals         : min = -0.7, mean = -0.002, max = 0.549
  -> model_info        : package sklearn

A new explainer has been created!



X does not have valid feature names, but RandomForestClassifier was fitted with feature names



Unnamed: 0,recall,precision,f1,accuracy,auc
LogisticRegression,0.987879,0.981928,0.984894,0.983498,0.998507


In [50]:
pdp2 = explainer2.model_profile()
pdp2.result

Calculating ceteris paribus: 100%|██████████| 19/19 [00:04<00:00,  4.43it/s]


Unnamed: 0,_vname_,_label_,_x_,_yhat_,_ids_
0,age,LogisticRegression,29.00,0.574700,0
1,age,LogisticRegression,29.48,0.574700,0
2,age,LogisticRegression,29.96,0.574700,0
3,age,LogisticRegression,30.44,0.574700,0
4,age,LogisticRegression,30.92,0.574700,0
...,...,...,...,...,...
1914,restecg_2,LogisticRegression,0.96,0.545224,0
1915,restecg_2,LogisticRegression,0.97,0.545224,0
1916,restecg_2,LogisticRegression,0.98,0.545224,0
1917,restecg_2,LogisticRegression,0.99,0.545224,0


In [58]:
pdp2.plot(variables=['age', 'sex', 'chol'])
pdp.plot(variables=['age', 'sex', 'chol'])