# Task 1

$$
g_{PD}^1(z) =
E_{X^{-j}}f(X^{j|=z}) =
E_{x_2} (z^2 + 2zx_2 + x_2^2) =
z^2 + 2zE_{x_2} x_2 + E_{x_2} x_2^2 =
\\ =
z^2 + 2z \cdot 0 + E_{x_2} x_2^2 =
z^2 + E_{x_2 \in [0, 1]} x_2^2 =
z^2 + \frac{1}{3}
$$

$$
g_{MP}^1(z) =
E_{X^{-j}|x^j=z}f(x^{j|=z}) =
E_{x_2|x_1=z} (z+x_2)^2 =
4z^2
$$

$$
g_{AL}^1(z) =
\int_{z_0}^z [E_{X^{-j}|X^j=v} \frac{\partial f(x)}{\partial x_j}]dv =
\int_{-1}^z E_{x_2|x_1=v} \frac{\partial(x_1^2 + 2x_1x_2 + x_2^2)}{\partial x_1} dv + c =
\\ =
\int_{-1}^z E_{x_2|x_1=v} (2x_1 + 2x_2) dv + c =
\int_{-1}^z (2v + 2v)dv + c =
4 \int_{-1}^z v dv + c =
2z^2 - 2 + c
$$

# Task 2

## 2.

Here are two Ceteris Paribus profiles (my own implementation). I'm sampling 50 different (equally distributed between max and min) possible values of age and calculating the model's prediction. As we can see, the model changes its prediction a lot, the plot is not smooth.

| CP - Obs. 1.      | CP - Obs. 3.      |
|-------------------|-------------------|
| ![](imgs/2/0.png) | ![](imgs/2/2.png) |


## 3.

As above, here are two Ceteris Paribus profiles (my own implementation). Those two differ much:
- the first has lower prediction score for age in interval [55, 65]
- the second has lower prediction score for age bigger than 60 and higher score in the interval [50, 60]

This is quite common with Random Forest model, it looks at a lot of different correlations - different observations will behave differently when changing one variable

| CP - Obs. 1.      | CP - Obs. 2.      |
|-------------------|-------------------|
| ![](imgs/3/0.png) | ![](imgs/3/1.png) |


## 4.

Here there is a PDP plot (my own implementation). I calculate the mean over whole test dataset Ceteris Paribus profiles.
In the background we can see all singular CP plots.
This gives much better understanding of the whole model. There is a high variance in CP plots, taking the mean of them reduces it.
As we can see, the model in general tends to give smaller score if the age is close to 60.

| PDP                 | CP - Obs. 1.      |
|---------------------|-------------------|
| ![](imgs/4/pdp.png) | ![](imgs/4/0.png) |



## 5.

Here are two PDP plots - for the linear regression and for the random forest model.
The difference is quite clear:
- Linear regression can have only linear CP plots, so the PDP (mean) is also linear. This model gives lower prediction to young patients.
- Random Forest has more complicated CP plots. On the PDP plot we can see that this model gives lower score for ages close to 60 and a bit higher for ages close to 55.


| PDP - Linear Model                 | PDP - Random Forest                    |
|------------------------------------|----------------------------------------|
| ![](imgs/5/LogisticRegression.png) | ![](imgs/5/RandomForestClassifier.png) |



# Appendix

## 0.
Here the data is loaded (same as in previous homework) and a simple model is trained and evaluated. It is Random Forest Classifier from sklearn with default parameters.

Loading and preparing the data consists of:
- one hot encoding (models like logistic regression require this)
- splitting between target (y) and x

In [1]:
import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)
warnings.simplefilter(action='ignore', category=UserWarning)

import numpy as np
import pandas as pd
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import roc_auc_score, accuracy_score
from sklearn.model_selection import train_test_split
import plotly.express as px
import plotly.graph_objects as go


np.random.seed(42)


def load_data():
    df = pd.read_csv('heart.csv')

    # One hot encoding (for linear classifier)
    df = pd.get_dummies(df, columns=['caa', 'cp', 'restecg'])

    # Get targets
    y_all = df['output'].to_numpy()

    x_all = df.drop(columns=['output'])

    # Split data to train and test
    return train_test_split(x_all, y_all, test_size=0.2, random_state=42)


x_train, x_test, y_train, y_test = load_data()
print(f"{x_train.shape=}", f"{x_test.shape=}", f"{y_train.shape=}", f"{y_test.shape=}")

x_train.shape=(242, 22) x_test.shape=(61, 22) y_train.shape=(242,) y_test.shape=(61,)


Training the models

In [2]:
def get_model(verbose=False, model=None):
    if model is None:
        model = RandomForestClassifier()
    metrics = {
        "auc": roc_auc_score,
        "accuracy": accuracy_score
    }
    model.fit(x_train, y_train)
    pred_test = model.predict(x_test)
    if verbose:
        print({metric_name: metric_fun(y_test, pred_test) for metric_name, metric_fun in metrics.items()})

    return model

model = get_model(True)
model_linear = get_model(True, LogisticRegression())

{'auc': 0.8841594827586207, 'accuracy': 0.8852459016393442}
{'auc': 0.8685344827586206, 'accuracy': 0.8688524590163934}


## 1.
Here are the model predictions on the first two observations from the test set

In [3]:
print(model.predict_proba(x_test.iloc[0:2])[:, 1])

[0.06 0.61]


## 2.

In [4]:
col = 'age'

def get_plot_data(n = 50, m=model):
    minmax = x_test[col].min(), x_test[col].max()

    plot = pd.DataFrame()
    for i in range(n):
        data = x_test.copy()
        val = (minmax[0] * i + minmax[1] * (n-i)) / n
        data[col] = val
        preds = m.predict_proba(data)[:, 1]
        plot = plot.append({
            col: val,
            **{f"prediction (obs. {j+1}.)": preds[j] for j in range(len(preds))},
            "prediction mean": preds.mean(axis=0),
        }, ignore_index=True)
    return plot


plot = get_plot_data()
for j in [0, 2]:
    px.line(plot, x=col, y=plot.columns[1+j]).write_image(f"imgs/2/{j}.png")


## 3.

In [5]:
for j in [0, 1]:
    px.line(plot, x=col, y=plot.columns[1+j]).write_image(f"imgs/3/{j}.png")


## 4.

In [6]:
for j in [0, 1, 2]:
    px.line(plot, x=col, y=plot.columns[1+j]).write_image(f"imgs/4/{j}.png")

fig = go.Figure()
fig.add_trace(go.Scatter(x=plot[col], y=plot["prediction mean"], mode='lines', name="prediction mean"))
for c in plot.columns[1:-1]:
    fig.add_trace(go.Scatter(x=plot[col], y=plot[c], mode='lines', name='lines', showlegend=False, opacity=0.1))

fig.write_image(f"imgs/4/pdp.png")

## 5.

In [7]:
for m in [model, model_linear]:
    plot = get_plot_data(m=m)
    fig = go.Figure()
    fig.add_trace(go.Scatter(x=plot[col], y=plot["prediction mean"], mode='lines', name="prediction mean"))
    for c in plot.columns[1:-1]:
        fig.add_trace(go.Scatter(x=plot[col], y=plot[c], mode='lines', name='lines', showlegend=False, opacity=0.1))

    fig.write_image(f"imgs/5/{m.__class__.__name__}.png")