# Task 1

$f(x_1, x_2) = (x_1+x_2)^2$, where $x_1, x_2 \sim U[-1, 1]$ <br><br>
**PD** <br>
$g_{PD}^1(z) = \mathbb{E}_{x_2} (z + x_2)^2 = z^2 + 2z\mathbb{E}_{x_2}x_2 + \mathbb{E}_{x_2} x_2^2 = z^2 + 0 + \int_{-1}^1 \frac{1}{2} v^2 dv = z^2 + 1/3$ <br><br>
**ME** <br>
$g_{ME}^1(z) = \mathbb{E}_{x_2 | x_1=z} (z + x_2)^2 = \mathbb{E}_{x_2 | x_1=z} 4z^2 = 4z^2$ <br><br>
**ALE** <br>
$g_{AL}^1(z) = \int_{-1}^z \mathbb{E}_{x_2 | x_1=v} \frac{∂(x_1+x_2)^2}{∂x_1}dv = \int_{-1}^z \mathbb{E}_{x_2 | x_1=v} (2x_1 + 2x_2)dv = \int_{-1}^z \mathbb{E}_{x_2 | x_1=v} 4v dv = 2z^2 - 2$

# Task 2

# Raport

I'm working on the Heart Attack dataset.
Short description of variables:
* `age`: The person's age in years
* `sex`: The person's sex
  * 1: male
  * 0: female
* `cp`: chest pain type
  * 0: asymptomatic
  * 1: atypical angina
  * 2: non-anginal pain
  * 3: typical angina
* `trtbps`: The person's resting blood pressure (mm Hg on admission to the hospital)
* `chol`: The person's cholesterol measurement in mg/dl
* `fbs`: The person's fasting blood sugar (> 120 mg/dl)
  * 1: true
  * 0: false
* `restecg`: Resting electrocardiographic measurement
  * 0: showing probable or definite left ventricular hypertrophy by Estes' criteria
  * 1: normal
  * 2: having ST-T wave abnormality (T wave inversions and/or ST elevation or depression of > 0.05 mV)
* `thalachh`: The person's maximum heart rate achieved
* `exng`: Exercise induced angina
  * 1: true
  * 0: false
* `oldpeak`: ST depression induced by exercise relative to rest ('ST' relates to positions on the ECG plot.)
* `slp`: the slope of the peak exercise ST segment
  * 0: downsloping
  * 1: flat
  * 2: upsloping
* `caa`: The number of major vessels (0-3) colored by fluoroscopy
* `thall`: Results of the blood flow observed via the radioactive dye
  * 0: NULL (dropped from the dataset previously)
  * 1: fixed defect (no blood flow in some part of the heart)
  * 2: normal blood flow
  * 3: reversible defect (a blood flow is observed but it is not normal)
* `output`: Heart disease (target)
  * 0: disease
  * 1: no disease

## 2.

![](https://drive.google.com/uc?export=view&id=1r0_NnA4PI8fRNQ88ISd38tJyNMQPuOR1)
![](https://drive.google.com/uc?export=view&id=1iErK2_DsYNi7QvndnDbE1XHPTabJNUiJ)
![](https://drive.google.com/uc?export=view&id=1y3Gpvnp5T8ksfja5BT6E97N4cGpLHCre)
![](https://drive.google.com/uc?export=view&id=16kRTAUSSgsFs7w-VBFJ-o3jnVtpzid7x)

From the plots above one can see that the CP Profiles with respect to the `age` variable depend on the sample. For instance, for observation `150` manipulating `age` can cause changing of prediction, which is close to 0.5. For the other samples changing only the `age` variable doesn't influence model's (binary) prediction.

## 3.
We can also visualize CP Profile for many variables at the same time. Below we can find this type of plot for 4 samples from the previous point.

![](https://drive.google.com/uc?export=view&id=1llWGoVyJuYRzTw5DW_nmSP5oQigz_LOh)

For example, the observation `0` (second from the top) is mostly decreasing, whereas the observation `225` (the bottom one) is increasing. Tree-based methods are generally able to express interactions. Here we can see the example of this - estimation of a prediction of a sample can either fall or rise with the `age` variable, which implies that this variable interacts with others.

## 4.


The PDP plot can be quite confusing, as for younger patients the prediction is greater (1 - no disease) and for patients in their 60s the prediction falls, which is consistent with intuition. However, for even older patients prediction increases again, which can be counterintuitive. Despite that, it is fairly consistent with what we've seen on CP plots. There are few possible explanations of this issue. Maybe it's caused by correlation between features, for instance, older patients are more likely to make check-ups, so they can detect alarming symptoms sooner and can prevent the disease.

![](https://drive.google.com/uc?export=view&id=1TyoTg3pwKGYFs2GyzuU2SKeWQ6vhv52C)

## 5. 
Interestingly, the PDP Profile is much smoother in case of Random Forest (below). Nevertheless, both plots follow similar trend - from high values to lower and then greater again. However, here this (rather) unlikely phenomena spotted in the previous point is much less significant.

![](https://drive.google.com/uc?export=view&id=1H_A7Vn5XplSbkSdRK5ctHS0Guitm75kb)

# Appendix

## Import packages

In [None]:
!pip install dalex &>/dev/null
!pip install lime &>/dev/null

In [55]:
import numpy as np
import pandas as pd

from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import roc_auc_score, accuracy_score, confusion_matrix,\
  ConfusionMatrixDisplay, f1_score, recall_score, precision_score
from sklearn.preprocessing import StandardScaler

import xgboost as xgb
import dalex as dx
import lime

SEED = 42

## Load data

In [None]:
!gdown 14RnHkHVRmZHdXF7_THt7arQzlzplKbF1

Downloading...
From: https://drive.google.com/uc?id=14RnHkHVRmZHdXF7_THt7arQzlzplKbF1
To: /content/heart.csv
  0% 0.00/11.3k [00:00<?, ?B/s]100% 11.3k/11.3k [00:00<00:00, 14.7MB/s]


In [None]:
df_raw = pd.read_csv('heart.csv')
df_raw.head()

Unnamed: 0,age,sex,cp,trtbps,chol,fbs,restecg,thalachh,exng,oldpeak,slp,caa,thall,output
0,63,1,3,145,233,1,0,150,0,2.3,0,0,1,1
1,37,1,2,130,250,0,1,187,0,3.5,0,0,2,1
2,41,0,1,130,204,0,0,172,0,1.4,2,0,2,1
3,56,1,1,120,236,0,1,178,0,0.8,2,0,2,1
4,57,0,0,120,354,0,1,163,1,0.6,2,0,2,1


## Data preprocessing

In [None]:
df_raw.info() # only int and float

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 303 entries, 0 to 302
Data columns (total 14 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   age       303 non-null    int64  
 1   sex       303 non-null    int64  
 2   cp        303 non-null    int64  
 3   trtbps    303 non-null    int64  
 4   chol      303 non-null    int64  
 5   fbs       303 non-null    int64  
 6   restecg   303 non-null    int64  
 7   thalachh  303 non-null    int64  
 8   exng      303 non-null    int64  
 9   oldpeak   303 non-null    float64
 10  slp       303 non-null    int64  
 11  caa       303 non-null    int64  
 12  thall     303 non-null    int64  
 13  output    303 non-null    int64  
dtypes: float64(1), int64(13)
memory usage: 33.3 KB


In [None]:
print(df_raw.shape) # 303 observations, 14 variables (including one output class)
df_raw.describe()

(303, 14)


Unnamed: 0,age,sex,cp,trtbps,chol,fbs,restecg,thalachh,exng,oldpeak,slp,caa,thall,output
count,303.0,303.0,303.0,303.0,303.0,303.0,303.0,303.0,303.0,303.0,303.0,303.0,303.0,303.0
mean,54.366337,0.683168,0.966997,131.623762,246.264026,0.148515,0.528053,149.646865,0.326733,1.039604,1.39934,0.729373,2.313531,0.544554
std,9.082101,0.466011,1.032052,17.538143,51.830751,0.356198,0.52586,22.905161,0.469794,1.161075,0.616226,1.022606,0.612277,0.498835
min,29.0,0.0,0.0,94.0,126.0,0.0,0.0,71.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,47.5,0.0,0.0,120.0,211.0,0.0,0.0,133.5,0.0,0.0,1.0,0.0,2.0,0.0
50%,55.0,1.0,1.0,130.0,240.0,0.0,1.0,153.0,0.0,0.8,1.0,0.0,2.0,1.0
75%,61.0,1.0,2.0,140.0,274.5,0.0,1.0,166.0,1.0,1.6,2.0,1.0,3.0,1.0
max,77.0,1.0,3.0,200.0,564.0,1.0,2.0,202.0,1.0,6.2,2.0,4.0,3.0,1.0


Based on the data and documentation from kaggle we should apply one-hot encoding transformation to the following variables: `cp`, `restecg`, `slp` and `thall`.

In [None]:
df = pd.get_dummies(df_raw, columns=['cp', 'restecg', 'slp', 'thall'])
df.head()

Unnamed: 0,age,sex,trtbps,chol,fbs,thalachh,exng,oldpeak,caa,output,...,restecg_0,restecg_1,restecg_2,slp_0,slp_1,slp_2,thall_0,thall_1,thall_2,thall_3
0,63,1,145,233,1,150,0,2.3,0,1,...,1,0,0,1,0,0,0,1,0,0
1,37,1,130,250,0,187,0,3.5,0,1,...,0,1,0,1,0,0,0,0,1,0
2,41,0,130,204,0,172,0,1.4,0,1,...,1,0,0,0,0,1,0,0,1,0
3,56,1,120,236,0,178,0,0.8,0,1,...,0,1,0,0,0,1,0,0,1,0
4,57,0,120,354,0,163,1,0.6,0,1,...,0,1,0,0,0,1,0,0,1,0


In [None]:
X = df.drop('output', axis=1)
y = df.output

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=SEED, stratify=y)
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

With stratify argument we can assure than the proportion of values in the sample produced will be the same as the proportion of values in `y`.

Also, I decided to standardize features by removing the mean and scaling to unit variance. We don't want a feature that has a variance that is orders of magnitude larger than others.

For now I will use just train/test split, in the future I can extend it with train/val/test split if needed.

## Train a XGBoost model

In [None]:
BST_model = xgb.XGBClassifier(random_state=SEED, max_depth=2).fit(X, y)

## Evaluate model on some examples

In [None]:
pf_xgboost_classifier_default = lambda m, d: m.predict_proba(d)[:, 1]
explainer = dx.Explainer(BST_model, X, y, predict_function=pf_xgboost_classifier_default, label="GBM")

Preparation of a new explainer is initiated

  -> data              : 303 rows 23 cols
  -> target variable   : Parameter 'y' was a pandas.Series. Converted to a numpy.ndarray.
  -> target variable   : 303 values
  -> model_class       : xgboost.sklearn.XGBClassifier (default)
  -> label             : GBM
  -> predict function  : <function <lambda> at 0x7f95549dadd0> will be used
  -> predict function  : Accepts only pandas.DataFrame, numpy.ndarray causes problems.
  -> predicted values  : min = 0.00365, mean = 0.544, max = 0.995
  -> model type        : classification will be used (default)
  -> residual function : difference between y and yhat (default)
  -> residuals         : min = -0.929, mean = 0.000168, max = 0.836
  -> model_info        : package xgboost

A new explainer has been created!


Check performance

In [None]:
explainer.model_performance()

Unnamed: 0,recall,precision,f1,accuracy,auc
GBM,0.933333,0.905882,0.919403,0.910891,0.978788


In [None]:
observations = X.iloc[:300:75] # choosing 4 observations
observations

Unnamed: 0,age,sex,trtbps,chol,fbs,thalachh,exng,oldpeak,caa,cp_0,...,restecg_0,restecg_1,restecg_2,slp_0,slp_1,slp_2,thall_0,thall_1,thall_2,thall_3
0,63,1,145,233,1,150,0,2.3,0,0,...,1,0,0,1,0,0,0,1,0,0
75,55,0,135,250,0,161,0,1.4,0,0,...,1,0,0,0,1,0,0,0,1,0
150,66,1,160,228,0,138,0,2.3,0,1,...,1,0,0,0,0,1,0,1,0,0
225,70,1,145,174,0,125,1,2.6,0,1,...,0,1,0,1,0,0,0,0,0,1


In [None]:
explainer.predict(observations)

array([0.6567136 , 0.94133586, 0.4969589 , 0.09892399], dtype=float32)

In [None]:
observations.index[1]

75

## Calculate what-if explanations of some predictions

In [None]:
for i in range(len(observations)):
    cp = explainer.predict_profile(observations.iloc[i])
    cp.plot(variables=['age'], title=f'Ceteris Paribus Profiles for obs {observations.index[i]}')

Calculating ceteris paribus: 100%|██████████| 23/23 [00:00<00:00, 217.20it/s]


Calculating ceteris paribus: 100%|██████████| 23/23 [00:00<00:00, 131.10it/s]


Calculating ceteris paribus: 100%|██████████| 23/23 [00:00<00:00, 115.62it/s]


Calculating ceteris paribus: 100%|██████████| 23/23 [00:00<00:00, 128.55it/s]


In [None]:
cp_many = explainer.predict_profile(observations)
cp_many.plot(variables=['age'])

Calculating ceteris paribus: 100%|██████████| 23/23 [00:00<00:00, 149.38it/s]


## PDP

In [50]:
pdp = explainer.model_profile()
pdp.plot(variables=['age'])

Calculating ceteris paribus: 100%|██████████| 23/23 [00:01<00:00, 11.69it/s]


## Random Forest (another model)

In [57]:
RF_model = RandomForestClassifier(class_weight='balanced', random_state=SEED).fit(X, y)
explainer_rf = dx.Explainer(RF_model, X, y, predict_function=pf_xgboost_classifier_default, label="LR")

Preparation of a new explainer is initiated

  -> data              : 303 rows 23 cols
  -> target variable   : Parameter 'y' was a pandas.Series. Converted to a numpy.ndarray.
  -> target variable   : 303 values
  -> model_class       : sklearn.ensemble._forest.RandomForestClassifier (default)
  -> label             : LR
  -> predict function  : <function <lambda> at 0x7f95549dadd0> will be used
  -> predict function  : Accepts pandas.DataFrame and numpy.ndarray.
  -> predicted values  : min = 0.0, mean = 0.547, max = 1.0
  -> model type        : classification will be used (default)
  -> residual function : difference between y and yhat (default)
  -> residuals         : min = -0.44, mean = -0.00241, max = 0.34
  -> model_info        : package sklearn

A new explainer has been created!



X does not have valid feature names, but RandomForestClassifier was fitted with feature names



In [58]:
explainer_rf.model_performance()

Unnamed: 0,recall,precision,f1,accuracy,auc
LR,1.0,1.0,1.0,1.0,1.0


In [59]:
explainer_rf.predict(observations)

array([0.85, 0.97, 0.74, 0.04])

In [61]:
pdp_rf = explainer_rf.model_profile()
pdp_rf.plot(variables=['age'])

Calculating ceteris paribus: 100%|██████████| 23/23 [00:06<00:00,  3.44it/s]
