In [1]:
# Optional: change Jupyter Notebook theme to GDD theme
from IPython.core.display import HTML
HTML(url='https://gdd.li/jupyter-theme')

![footer_logo](https://marysia.nl/assets/GDD/css/logo.png)

# Partial Dependence Plots

![footer_logo](https://pbs.twimg.com/profile_images/726045460331401216/izaV7jmy_400x400.jpg)

The last method of explainability that we will cover in this workshop is that of _Partial Dependence Plots_. Combined with the other methods (_Ceteris Paribus_, _Prediction Break-Down_ and _Permutation Feature Importance_) this will cover the entire scale of local and global feature importance and sensitivity methods. 

|        | **Feature Importance**             | **Sensitivity**             |
|--------|--------------------------------|--------------------------|
| **Local**  | Prediction Break-Down          | Ceteris Paribus          |
| **Global** | Permutation Feature Importance | _Partial Dependence Plots_ |

Partial Dependence Plots are therefore a _global_ method for determining feature _sensitivity_. As a reminder of how feature sensitivity helps explain a model, we will first revisit the Ceteris Paribus plots for the Penguins body mass regression model. Then, we will extend this to Partial Dependence Plots. 

### Outline
1. [Revisit: Ceteris Paribus plots](#cetpar)
1. [Partial Dependence Plots](#partplot)
1. [Exercise](#ex1)


<a id = 'cetpar'></a>

# 1. Revisit: Ceteris Paribus plots

Let us revist the Ceteris Paribus plots from the earlier notebook. In order to do that, we once again start out by loading in the data and creating the model. 

In [None]:
import pandas as pd 
from sklearn.model_selection import train_test_split

# Read in the data again - do *not* drop Adélie! 
penguins = (
    pd.read_csv('data/penguins.csv')
    .dropna()
)

# Set features & target.
feature_columns = ['flipper_length_mm', 'bill_length_mm', 'bill_depth_mm', 'sex', 'island', 'species']
target = 'body_mass_g'

# Set X and y
X = penguins.loc[:, feature_columns]
y = penguins.loc[:, target]

# Split the dataset. 
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state=42)

In [None]:
from sklearn.ensemble import RandomForestRegressor 
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder
from sklearn.pipeline import Pipeline

ct = ColumnTransformer([
    ('onehot', OneHotEncoder(), ['sex', 'species', 'island']),
], remainder='passthrough')

pipeline = Pipeline([
    ('ct', ct),
    ('model', RandomForestRegressor(random_state=42))
])

pipeline.fit(X_train, y_train)
pipeline.score(X_test, y_test)

In order to create the Ceteris Paribus plots, the model needs to be wrapped in an Explainer object.

In [None]:
import dalex 

explainer = dalex.Explainer(pipeline, X_train, y_train)

Once we have an explainer, we can pick and choose a row to explain.

In [None]:
n = 230
row = X_test.loc[[230]]
row

In [None]:
profile = explainer.predict_profile(row)
profile.plot()

This is the plot we've seen before. This plot was created by _specifically_ passing the `.predict_profile` method to the row in question. However, what would happen if we would pass more than one row as an argument? Maybe even the whole test set? 

In [None]:
profile = explainer.predict_profile(X_test)
profile.plot()

As you can see, this creates a Ceteris Paribus plot for every single datapoint in the dataset! The plots themselves become barely readable. 

<a id = 'partplot'></a>
# 2. Partial Dependence Plots
Let us visit a completely different method: `.model_profile`. 

In [None]:
profile = explainer.model_profile()
profile.plot()

Once we plot the model profile, it displays only one single line. But for what data point is that? We passed it the entire dataset, after all...

Let's add an argument to the plot method.

In [None]:
profile = explainer.model_profile(groups='island')
profile.plot(geom='profiles')

<a id = 'ex2'></a>
# 3. Exercise
<div class="exercise" markdown="1">

### Exercise 
#### Partial Dependence Plots

Analyse the Partial Dependence Plots. 

1. Add an argument `groups='species'` to the model_profile method. Also try 'sex' and 'island'. What happens? Notice anything interesting? 
1. Can you explain in your own words how these plots are calculated?  
1. Can you think of a way these plots can help you with feature engineering? 
1. Can you think of the advantages and disadvantages of this approach? 

</div>


In [None]:
# Code block for question 1. 