<img src="https://www.th-koeln.de/img/logo.svg" style="float:right;" width="200">

# 9th exercise: <font color="#C70039">Interpretable Machine Learning by means of Partial Dependence (PDP) and Individual Conditional Expectation (ICE) Plots</font>
* Course: AML
* Lecturer: <a href="https://www.gernotheisenberg.de/">Gernot Heisenberg</a>
* Author of notebook: <a href="https://www.gernotheisenberg.de/">Gernot Heisenberg</a>
* Date:   04.09.2024

<img src="https://scikit-learn.org/stable/_images/sphx_glr_plot_partial_dependence_003.png" style="float: center;" width="800">

---------------------------------
**GENERAL NOTE 1**:
Please make sure you are reading the entire notebook, since it contains a lot of information on your tasks (e.g. regarding the set of certain paramaters or a specific computational trick), and the written mark downs as well as comments contain a lot of information on how things work together as a whole.

**GENERAL NOTE 2**:
* Please, when commenting source code, just use English language only.
* When describing an observation please use English language, too.
* This applies to all exercises throughout this course.

---------------------------------

### <font color="ce33ff">DESCRIPTION</font>:

Partial dependence plots (PDP) and individual conditional expectation (ICE) plots can be used to visualize and analyze the interaction between the target response and a set of input features of interest.
Both PDPs [H2009] and ICEs [G2015] assume that the input features of interest are independent from the complement features and this assumption is often violated in practice. Thus, in the case of correlated features, we will create absurd data points to compute the PDP/ICE.

[H2009]
T. Hastie, R. Tibshirani and J. Friedman, The Elements of Statistical Learning, Second Edition, Section 10.13.2, Springer, 2009.

[G2015]
A. Goldstein, A. Kapelner, J. Bleich, and E. Pitkin, “Peeking Inside the Black Box: Visualizing Statistical Learning With Plots of Individual Conditional Expectation” Journal of Computational and Graphical Statistics, 24(1): 44-65, Springer, 2015.

---------------------------------

### <font color="FFC300">TASKS</font>:
The tasks that you need to work on within this notebook are always indicated below as bullet points.
If a task is more challenging and consists of several steps, this is indicated as well.
Make sure you have worked down the task list and commented your doings.
This should be done by using markdown.<br>
<font color=red>Make sure you don't forget to specify your name and your matriculation number in the notebook.</font>

**YOUR TASKS in this exercise are as follows**:
1. import the notebook to Google Colab or use your local machine.
2. make sure you specified you name and your matriculation number in the header below my name and date.
    * set the date too and remove mine.
3. read the entire notebook carefully
    * add comments whereever you feel it necessary for better understanding
    * run the notebook for the first time.
    * try to follow the interpretations by printing out the decision tree and look for the feature patterns that the PDPs indicate.

**PART I**<br>
4. download an interesting data set from Kaggle and do the preprocessing.<br>
5. change the classifier according to the data set. The more blackbox the better.<br>
6. use PDP to identify the most relevant features explaining the target response of the data set.<br>
7. comment your entire code and your findings.<br>  

**PART II**<br>
8. use the data set and the classifer from steps 4 and 5<br>
9. plot ICE curves with parameter (kind='both')<br>
10. comment your entire code and your findings.<br>  

---------------------------------

# <font color="ce33ff">PART I (Partial Dependence Plots)</font>

## Imports
Import all necessary python utilities.

In [2]:
!pip install -q condacolab
import condacolab
condacolab.install()

⏬ Downloading https://github.com/conda-forge/miniforge/releases/download/23.11.0-0/Mambaforge-23.11.0-0-Linux-x86_64.sh...
📦 Installing...
📌 Adjusting configuration...
🩹 Patching environment...
⏲ Done in 0:00:16
🔁 Restarting kernel...


In [1]:
!conda create -n py38 python=3.8 -y
!conda activate py38

Channels:
 - conda-forge
Platform: linux-64
Collecting package metadata (repodata.json): - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / 

In [2]:
!pip install PDPbox

Collecting PDPbox
  Downloading PDPbox-0.3.0-py3-none-any.whl.metadata (4.6 kB)
Collecting joblib>=1.1.0 (from PDPbox)
  Downloading joblib-1.4.2-py3-none-any.whl.metadata (5.4 kB)
Collecting matplotlib>=3.6.2 (from PDPbox)
  Downloading matplotlib-3.10.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (11 kB)
Collecting numpy>=1.21.5 (from PDPbox)
  Downloading numpy-2.2.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (62 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m62.0/62.0 kB[0m [31m3.3 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting pandas>=1.4.4 (from PDPbox)
  Downloading pandas-2.2.3-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (89 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m89.9/89.9 kB[0m [31m5.8 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting plotly>=5.9.0 (from PDPbox)
  Downloading plotly-5.24.1-py3-none-any.whl.metadata (7.3 kB)
Collecting pqdm>=0.2.0 (from PD

In [None]:
#make sure you take these packages (in colab too)
#!pip install matplotlib==3.1.1 scikit-learn==0.23.1 PDPbox==0.2.1

In [1]:
import numpy as np
import pandas as pd

from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier

from matplotlib import pyplot as plt
from pdpbox import pdp, get_example, info_plots

#import warnings
#warnings.filterwarnings('ignore')

## Load data set

In [4]:
url = "https://github.com/len-rtz/AML/blob/a87416cff712f53d5cb8c19d8f79a27fef9cba90/data/FIFA/FIFA.Statistics.2018.csv"
data = pd.read_csv(url)
data.head(3)

ParserError: Error tokenizing data. C error: Expected 1 fields in line 42, saw 31


In [None]:
# all features are:
print(data.columns.tolist())

## Preprocessing


In [None]:
# Convert from string “Yes”/”No” to binary
y = (data['Man of the Match'] == 'Yes')

feature_names = [i for i in data.columns if data[i].dtype in [np.int64]]

x = data[feature_names]

## Train the classifier

Start with a simple decision tree model.
<font color=red>Note:</font> The calculation of a partial dependence can happen obviously, only after a model has been trained.

In [None]:
train_x, val_x, train_y, val_y = train_test_split(x, y, random_state=1)
tree_model = DecisionTreeClassifier(random_state=0, max_depth=5, min_samples_split=5).fit(train_x, train_y)

## Partial Dependence Plots (PDP)
#### read the API reference guide for further possibilities
https://pdpbox.readthedocs.io/en/latest/PDPIsolate.html#pdpbox.pdp.PDPIsolate

In [None]:
# Create the pdp data to be plotted
pdp_goals = pdp.PDPIsolate(model=tree_model, df=val_x, model_features=feature_names, feature='Goal Scored', feature_name='Number of Goals Scored')

# plot the PDP for feature 'Goal Scored'
fig, axes = pdp_goals.plot(
    center=False,
    plot_lines=True,
    frac_to_plot=100,
    cluster=False,
    n_cluster_centers=None,
    cluster_method='accurate',
    plot_pts_dist=True,
    to_bins=False,
    show_percentile=False,
    which_classes=None,
    figsize=None,
    dpi=300,
    ncols=2,
    plot_params={"pdp_hl": True},
    engine='plotly',
    template='plotly_white')

fig.show()

A few things are worth to be pointed out for interpreting this plot.

The y-axis is interpreted as change in the prediction from what it would be predicted at the baseline or leftmost value.

From this particular graph you can interpret, that scoring one goal substantially increases the chances of winning "Man of The Match."
But extra goals beyond that show little to no impact on predictions.

In [None]:
# Create the pdp data to be plotted
pdp_dist = pdp.PDPIsolate(model=tree_model, df=val_x, model_features=feature_names, feature='Distance Covered (Kms)', feature_name='Distance covered in km')

# plot the PDP for feature 'Distance Covered (Kms)'
fig, axes = pdp_dist.plot(
    center=False,
    plot_lines=True,
    frac_to_plot=100,
    cluster=False,
    n_cluster_centers=None,
    cluster_method='accurate',
    plot_pts_dist=True,
    to_bins=False,
    show_percentile=False,
    which_classes=None,
    figsize=None,
    dpi=300,
    ncols=2,
    plot_params={"pdp_hl": True},
    engine='plotly',
    template='plotly_white')

fig.show()

In this PDP plot you will see the ticks on the x-axis as depicting the real data samples.

This PDP plot seems to be too simple to represent reality.
Maybe that's because the model is so simple. Print the decision tree to compare that finding to the decision tree structure.
For the purpose of
Let's back up our theory and do the same plot with a Random Forest model.

In [None]:
# Build a new model: Random Forest classifier
rf_model = RandomForestClassifier(random_state=0).fit(train_x, train_y)

In [None]:
# Create the pdp data to be plotted
pdp_dist = pdp.PDPIsolate(model=rf_model, df=val_x, model_features=feature_names, feature='Distance Covered (Kms)', feature_name='Distance covered in km')

# plot the PDP for feature 'Distance Covered (Kms)'
fig, axes = pdp_dist.plot(
    center=False,
    plot_lines=True,
    frac_to_plot=100,
    cluster=False,
    n_cluster_centers=None,
    cluster_method='accurate',
    plot_pts_dist=True,
    to_bins=False,
    show_percentile=False,
    which_classes=None,
    figsize=None,
    dpi=300,
    ncols=2,
    plot_params={"pdp_hl": True},
    engine='plotly',
    template='plotly_white')

fig.show()

<font color=red>Interpretation:</font>
This model states that it is more likely to win "Man of the Match" if the players run a total of about 100 km during the match. More running leads to lower predictions.
In general, the smoother shape of this curve seems more plausible than the step function of the decision tree model.
However, this data set is much too small. One should be very careful when interpreting a model.

## 2D Partial Depedence Plot

Now, plotting the PDP for two features can be done by using the **pdp_interact** and **pdp_interact_plot** functions.

First, switch back to the simple decision tree model.

In [None]:
train_x, val_x, train_y, val_y = train_test_split(x, y, random_state=1)
tree_model = DecisionTreeClassifier(random_state=0, max_depth=5, min_samples_split=5).fit(train_x, train_y)

In [None]:
# Similar to previous PDP plot.
# However, use pdp_interact instead of pdp_isolate and pdp_interact_plot instead of pdp_isolate_plot

# plot PDP for the two features
features_to_plot = ['Goal Scored', 'Distance Covered (Kms)']

pdp_goal_distance = pdp.PDPInteract(model=tree_model, df=val_x, model_features=feature_names, features=features_to_plot, feature_names=["Number of Goals Scored", "Distance covered in km"])

fig, axes = pdp_goal_distance.plot(
    plot_type="grid",
    plot_pdp=True,
    to_bins=True,
    show_percentile=True,
    which_classes=None,
    figsize=None,
    dpi=300,
    ncols=2,
    plot_params=None,
    engine='plotly',
    template='plotly_white'
)
fig.show()

<font color=red>Interpretation:</font>
This **2D PDP** shows predictions for any combination of **Goals Scored** and **Distance Covered (Kms)**.

For example, it seems to yield the highest predictions when a team scores at least one (1) goal and they run a total distance close to 100km.
If the players score 0 goals, the covered distance does not matter.

Try to see this by tracing through the decision tree with 0 goals!

But distance can impact predictions if the players score goals.
Make sure you can see this from the 2D PDP.
Can you find this pattern in the decision tree too?

If you feel motivated to tweak the chart this tutorial ressource might be of value:
https://github.com/SauceCat/PDPbox/blob/master/tutorials/pdpbox_binary_classification.ipynb

# <font color="ce33ff">PART II (Individual Conditional Expectation)</font>

ICE is also a model-agnostic method that can be applied to any model.
In fact, it is basically the same concept as PDP but is different in that it displays the marginal effect of feature(s)
for each instance instead of calculating the average effect in a overall data context as the PDP does.
Thus, it can understood as the equivalent to a PDP for individual data instances.
Visually, an ICE plot displays the dependence of the prediction on a feature for each instance separately,
resulting in one line per instance.

There are multiple packages and libraries that can be used to compute ICE plots.

The PartialDependenceDisplay function in the sklearn.inspection module, the PyCEBox package and H2O package’s ice_plot function are available.

Let’s take a look at an example in Sklearn’s documentation (https://scikit-learn.org/stable/modules/partial_dependence.html).

## Imports
Import all necessary python utilities.

In [None]:
import numpy as np
import pandas as pd

from sklearn.datasets import make_hastie_10_2
from sklearn.ensemble import GradientBoostingClassifier

# make sure you have installed scikit-learn of version > 1.0
# since the method from_estimator() is not available in previous versions
from sklearn.inspection import PartialDependenceDisplay

In [None]:
# Read some inbuild data set as part of the Sklearn data sets being offered
# To get more information on the data set please refer to
''' https://scikit-learn.org/stable/modules/generated/sklearn.datasets.make_hastie_10_2.html '''

x, y = make_hastie_10_2(random_state=0) # set a seed with random_state

clf = GradientBoostingClassifier(n_estimators=100, learning_rate=1.0, max_depth=1, random_state=0).fit(x, y)
features = [0, 1] #features x and y

PartialDependenceDisplay.from_estimator(clf, x, features, kind='individual')

It is evident that, similar to PDPs, ICE curves can be computed only after a model has been trained.

If you specify the parameter kind='both', then a PDP and an ICE curve is plotted in one canvas at the same time.
This will be meaningful when looking at both, the marginal average effect and marginal individual effects at once!

In [None]:
PartialDependenceDisplay.from_estimator(clf, x, features, kind='both')