# CP and PDP for Heart Attack Analysis models

## Task 1

Consider a following model:

f(x1, x2) = (x1 + x2)^2

Assume that x1, x2 ~ U[-1,1] and x1=x2 (full dependency)

Calculate PD profile for variable x1 in this model.

Extra task if you do not fear conditional expected values: Calculate ME and ALE profiles for variable x1 in this model.

### Solution

<!--
$$g^{1}_{PD}(z) = E_{X_2 \sim U[-1,1]}[(z + x_2)^2] = z^2 + 2*z*E_{X_2 \sim U[-1,1]}[x_2] + E_{X_2 \sim U[-1,1]}[x_2^2]$$
$$E_{X_2 \sim U[-1,1]}[x_2] = 0$$
$$E_{X_2 \sim U[-1,1]}[x_2^2] = \frac{1}{2} * \int_{-1}^{1}x_2^2dx_2 = \frac{1}{2}(\frac{1}{3} * 1 - \frac{1}{3} * (-1)) = \frac{1}{3}$$
$$g^{1}_{PD}(z) = z^2 + 2*z*0 + \frac{2}{3} = z^2 + \frac{1}{3}$$
-->

![task_1](imgs/task_1.png)

## Task 2

In this work I will explain predictions obtained from a Random Forest model. I will use CP and PDP
implementations from the `dalex` framework. Dataset used is The Heart Attack Analysis dataset
([source](https://www.kaggle.com/datasets/rashikrahmanpritom/heart-attack-analysis-prediction-dataset)).

Dataset attributes:

- Age : Age of the patient
- Sex : Sex of the patient
- exang: exercise induced angina (1 = yes; 0 = no)
- caa: number of major vessels (0-3)
- cp : Chest Pain type
    - Value 1: typical angina
    - Value 2: atypical angina
    - Value 3: non-anginal pain
    - Value 4: asymptomatic
- trtbps : resting blood pressure (in mm Hg)
- chol : cholestoral in mg/dl fetched via BMI sensor
- fbs : (fasting blood sugar > 120 mg/dl) (1 = true; 0 = false)
- rest_ecg : resting electrocardiographic results
    - Value 0: normal
    - Value 1: having ST-T wave abnormality (T wave inversions and/or ST elevation or depression of > 0.05 mV)
    - Value 2: showing probable or definite left ventricular hypertrophy by Estes' criteria
- thalach : maximum heart rate achieved
- oldpeak : Previous peak
- slp : Slope
- target : 0= less chance of heart attack 1= more chance of heart attack
- thall : Thalium Stress Test result [0, 3]

I have preprocessed the dataset by one hot encoding categorical features.

### Variables correlation
![correlation_matrix](imgs/correlation_matrix.png)

### CP explanations
![rf_cp](imgs/rf_cp.png)

I have plotted 4 variables with the highest importance (according to the dalex explainer) on 8 samples from the dataset.
We can see that changing any single variable from the selection would not have the effect of the final prediction
(assuming prediction threshold at 0.5). In some cases the differences are higher but still not enough. I think this
might be caused by the fact that variables as strongly correlated (as shown on the matrix above).


### Different CP profiles.
![rf_cp_2](imgs/rf_cp_2.png)

I have found 2 samples from the dataset that have different profiles for the variable `age`. I think that in this case
it is because the model returns opposite predictions (correctly) and increasing the `age` actually "confuses" the model
because of strong correlations with different variables, thus model is just less confident for both samples.
Additionally, from my knowledge of the dataset, I think that the older age does not necessarily mean higher risk of
heart attack as opposite to the other selected variables which have higher correlation with the output.


### PDP explanations
![rf_pdp](imgs/rf_pdp.png)

Apart from the `age` plot we can see very clear (almost linear) correlation between variable value and the prediction
score. I think that it further explains what I presented in the previous paragraph where I hypothesized that age is not
so clearly correlated with the chance of having a heart attack. On the opposite note in my opinion in this case PDP
explanations are easier to read because of lesser variance. Of course that opinion is influenced by the fact that I am
more inclined to explaining the model as a whole and not just its prediction on a single sample. Keeping in mind that
single prediction explanations have their very important purpose.


### PDP explanations for Logistic Regression model
![lr_pdp](imgs/lr_pdp.png)

For Logistic Regression the partial dependence plots are linear, because of the model's functional form - linear. Random
Forest functional form can be arbitrarily complex, resulting in more complex partial dependence plots. Aside from that,
all of the selected variables, besides `age`, show similar influence on the prediction as PD for Random Forest model.
The `age` variable shows rather opposite trend to the one from Random Forrest, where I would be inclined to say that
it shows that higher age statistically means lower prediction score than lower age, but the visible nonlinearity of
this plot (age for Random Forrest) might explain the possible difficulties for the Regression model.

## Appendix

### Install required packages.

In [1]:
%%capture
%pip install dalex jinja2 kaleido numpy nbformat pandas plotly scikit-learn

### Imports and loading dataset

In [2]:
import dalex as dx
import numpy as np
import pandas as pd
import plotly.express as px
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import RidgeClassifier
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler

rng = np.random.default_rng(0)

TARGET_COLUMN = "output"
df = pd.read_csv("heart.csv")
df.describe()

Unnamed: 0,age,sex,cp,trtbps,chol,fbs,restecg,thalachh,exng,oldpeak,slp,caa,thall,output
count,303.0,303.0,303.0,303.0,303.0,303.0,303.0,303.0,303.0,303.0,303.0,303.0,303.0,303.0
mean,54.366337,0.683168,0.966997,131.623762,246.264026,0.148515,0.528053,149.646865,0.326733,1.039604,1.39934,0.729373,2.313531,0.544554
std,9.082101,0.466011,1.032052,17.538143,51.830751,0.356198,0.52586,22.905161,0.469794,1.161075,0.616226,1.022606,0.612277,0.498835
min,29.0,0.0,0.0,94.0,126.0,0.0,0.0,71.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,47.5,0.0,0.0,120.0,211.0,0.0,0.0,133.5,0.0,0.0,1.0,0.0,2.0,0.0
50%,55.0,1.0,1.0,130.0,240.0,0.0,1.0,153.0,0.0,0.8,1.0,0.0,2.0,1.0
75%,61.0,1.0,2.0,140.0,274.5,0.0,1.0,166.0,1.0,1.6,2.0,1.0,3.0,1.0
max,77.0,1.0,3.0,200.0,564.0,1.0,2.0,202.0,1.0,6.2,2.0,4.0,3.0,1.0


Shuffling the data, extracting target column and one hot encoding categorical columns.

In [3]:
df = df.sample(frac=1, random_state=0).reset_index(drop=True)

y = df[[TARGET_COLUMN]]

x = df.drop(TARGET_COLUMN, axis=1)

categorical_cols = ["sex", "cp", "fbs", "restecg", "exng", "slp", "caa", "thall"]
numerical_cols = list(set(x.columns) - set(categorical_cols))

x = pd.get_dummies(x, columns=categorical_cols, drop_first=True)
n_columns = len(x.columns)

categorical_cols, numerical_cols

(['sex', 'cp', 'fbs', 'restecg', 'exng', 'slp', 'caa', 'thall'],
 ['trtbps', 'thalachh', 'oldpeak', 'age', 'chol'])

## Random Forest model

In [4]:
model = RandomForestClassifier(random_state=0).fit(x, y)

accuracy_score(y, model.predict(x))

  model = RandomForestClassifier(random_state=0).fit(x, y)


1.0

Selected samples.

In [5]:
sample_ids = [24, 46, 121, 133, 178, 203, 246, 285]
sample_2 = [203, 246]
df.iloc[sample_ids]

Unnamed: 0,age,sex,cp,trtbps,chol,fbs,restecg,thalachh,exng,oldpeak,slp,caa,thall,output
24,41,1,1,110,235,0,1,153,0,0.0,2,0,2,1
46,41,1,2,130,214,0,0,168,0,2.0,1,0,2,1
121,71,0,2,110,265,1,0,130,0,0.0,2,1,2,1
133,57,0,0,120,354,0,1,163,1,0.6,2,0,2,1
178,58,1,0,128,216,0,0,131,1,2.2,1,3,3,0
203,45,1,3,110,264,0,1,132,0,1.2,1,0,3,0
246,35,0,0,138,183,0,1,182,0,1.4,2,0,2,1
285,60,1,0,130,206,0,0,132,1,2.4,1,2,3,0


In [6]:
sample_preds = model.predict_proba(x.iloc[sample_ids])
sample_preds

array([[0.14, 0.86],
       [0.04, 0.96],
       [0.06, 0.94],
       [0.12, 0.88],
       [0.96, 0.04],
       [0.89, 0.11],
       [0.01, 0.99],
       [1.  , 0.  ]])

In [7]:
def dx_predict_func(m, d): 
    return m.predict_proba(d)[:, 1]

explainer = dx.Explainer(model, x, y, predict_function=dx_predict_func)

Preparation of a new explainer is initiated

  -> data              : 303 rows 22 cols
  -> target variable   : Parameter 'y' was a pandas.DataFrame. Converted to a numpy.ndarray.
  -> target variable   : 303 values
  -> model_class       : sklearn.ensemble._forest.RandomForestClassifier (default)
  -> label             : Not specified, model's class short name will be used. (default)
  -> predict function  : <function dx_predict_func at 0x7f4fd035e940> will be used
  -> predict function  : Accepts pandas.DataFrame and numpy.ndarray.
  -> predicted values  : min = 0.0, mean = 0.547, max = 1.0
  -> model type        : classification will be used (default)
  -> residual function : difference between y and yhat (default)
  -> residuals         : min = -0.39, mean = -0.00238, max = 0.34
  -> model_info        : package sklearn

A new explainer has been created!




In [8]:
explainer.model_parts().result

Unnamed: 0,variable,dropout_loss,label
0,_full_model_,0.0,RandomForestClassifier
1,thall_1,0.0,RandomForestClassifier
2,slp_1,0.0,RandomForestClassifier
3,cp_1,0.0,RandomForestClassifier
4,fbs_1,0.0,RandomForestClassifier
5,restecg_2,0.0,RandomForestClassifier
6,caa_4,1.1102230000000002e-17,RandomForestClassifier
7,caa_3,2.195872e-06,RandomForestClassifier
8,slp_2,4.391744e-06,RandomForestClassifier
9,thall_3,3.952569e-05,RandomForestClassifier


## Ceteris Paribus

In [9]:
cp = explainer.predict_profile(new_observation=x.iloc[sample_ids])
fig = cp.plot(variables=["age", "thalachh", "oldpeak", "thall_2"], show=False)
fig.write_image("imgs/rf_cp.png")
fig.show()

Calculating ceteris paribus: 100%|██████████| 22/22 [00:00<00:00, 75.12it/s]


## Smaller sample

In [10]:
cp = explainer.predict_profile(new_observation=x.iloc[sample_2])
fig = cp.plot(variables=["age", "thalachh", "oldpeak", "thall_2"], show=False)
fig.write_image("imgs/rf_cp_2.png")
fig.show()

Calculating ceteris paribus: 100%|██████████| 22/22 [00:00<00:00, 80.72it/s]


## Partial Dependence Plots

In [11]:
pdp = explainer.model_profile()
fig = pdp.plot(variables=["age", "thalachh", "oldpeak", "thall_2"], show=False)
fig.write_image("imgs/rf_pdp.png")
fig.show()

Calculating ceteris paribus: 100%|██████████| 22/22 [00:03<00:00,  6.08it/s]


## Logistic Regression

In [12]:
lr_clf = RidgeClassifier(random_state=0).fit(x, y.squeeze())

accuracy_score(y, lr_clf.predict(x))

0.8778877887788779

In [13]:
def lr_predict_func(m, d):
    pred = m.decision_function(d)
    return 1 / (1 + np.exp(-pred))

lr_explainer = dx.Explainer(lr_clf, x, y, predict_function=lr_predict_func)

Preparation of a new explainer is initiated

  -> data              : 303 rows 22 cols
  -> target variable   : Parameter 'y' was a pandas.DataFrame. Converted to a numpy.ndarray.
  -> target variable   : 303 values
  -> model_class       : sklearn.linear_model._ridge.RidgeClassifier (default)
  -> label             : Not specified, model's class short name will be used. (default)
  -> predict function  : <function lr_predict_func at 0x7f5013cc4a60> will be used
  -> predict function  : Accepts pandas.DataFrame and numpy.ndarray.
  -> predicted values  : min = 0.183, mean = 0.522, max = 0.817
  -> model type        : classification will be used (default)
  -> residual function : difference between y and yhat (default)
  -> residuals         : min = -0.735, mean = 0.0228, max = 0.739
  -> model_info        : package sklearn

A new explainer has been created!



X does not have valid feature names, but RidgeClassifier was fitted with feature names



In [14]:
lr_pdp = lr_explainer.model_profile()
fig = lr_pdp.plot(variables=["age", "thalachh", "oldpeak", "thall_2"], show=False)
fig.write_image("imgs/lr_pdp.png")
fig.show()

Calculating ceteris paribus: 100%|██████████| 22/22 [00:00<00:00, 55.82it/s]
