In [1]:
# Optional: change Jupyter Notebook theme to GDD theme
from IPython.core.display import HTML
HTML(url='https://gdd.li/jupyter-theme')

![footer_logo](https://marysia.nl/assets/GDD/css/logo.png)

# Break-Down Plots

In the previous notebook, we investigated the sensitivity of the model towards its features. In other words, what would happen if we kept all other features the same, but changed the value of one. In this notebook, we'll dive further into understanding what happens when a prediction is created, this time answering the question: _which variables contribute to this result the most?_

We will consider the same model -- body mass prediction of a penguin, based on its bodily measurements and a few categorical features.

### Outline
1. [Create the model](#model)
1. [Prediction Break-Down Plots](#breakdown)
1. [Exercise](#ex1)

![](https://github.com/allisonhorst/palmerpenguins/raw/master/man/figures/logo.png)

<a id = 'model'></a>

# 1. Create the Model

We start out by again loading in the data. This is the same as we saw in the previous notebook -- we load in the penguins dataset, set our feature matrix X and target y, and divide up our data in a train and test split. Notice that by setting the random state, we ensure we have the same train/test split in every notebook! 

In [None]:
import pandas as pd
from sklearn.model_selection import train_test_split

# Read in the data again - do *not* drop Adélie! 
penguins = (
    pd.read_csv('data/penguins.csv')
    .dropna()
)

# Set features & target.
feature_columns = ['flipper_length_mm', 'bill_length_mm', 'bill_depth_mm', 'sex', 'island', 'species']
target = 'body_mass_g'

# Set X and y
X = penguins.loc[:, feature_columns]
y = penguins.loc[:, target]

# Split the dataset. 
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state=42)

We once again create our random forest regressor model. 

In [None]:
from sklearn.ensemble import RandomForestRegressor 
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder
from sklearn.pipeline import Pipeline

ct = ColumnTransformer([
    ('onehot', OneHotEncoder(), ['sex', 'species', 'island']),
], remainder='passthrough')

pipeline_rf = Pipeline([
    ('ct', ct),
    ('model', RandomForestRegressor())
])

pipeline_rf.fit(X_train, y_train)
pipeline_rf.score(X_test, y_test)

<a id = 'breakdown'></a>
# 3. Prediction Break-Down Plots 

Let us select a row at random. Again, feel free to choose a different data point from the test set! 

In [None]:
n = 230
row = X_test.loc[[230]]
row

Our explainer is once again created. This time, we are a little bit more explicit in our variable naming -- both our pipeline and explainer have `_rf` as a postfix to indicate it refers to our Random Forest model.

In [None]:
import dalex 

explainer_rf = dalex.Explainer(pipeline_rf, X_train, y_train)

This is where things change -- whereas previously we used `.predict_profile` to calculate the Ceteris Paribus plot, in this case, we use `.predict_parts`. This takes a few arguments:
* `new_observation`: this is an observation that was not in the original train set, which we would like to understand.
* `type`: options are 'break_down_interactions', 'break_down', 'shap' and 'shap_wrapper'. The default is 'break_down_interactions', though we will set it at 'break_down'
* `label`: this is simply what will appear at the top of the plot.

In [None]:
row_explanation = explainer_rf.predict_parts(
    row,
    type="break_down",
    label=f"row {n}",
)


row_explanation.plot()

<a id = 'ex1'></a>
# 3. Exercise 

<div class="exercise" markdown="1">

### Exercise 
#### Analyse Prediction Break Down

Try to answer the following questions: 

1. **Plot analysis**: What does the color green in the plot indicate? What does the color red in the plot indicate? What does the color blue in the plot indicate?
1. **Order dependence**: Copy the cell that creates the row explanation. In the copied cell, add `order=X_train.columns.tolist()` as an argument. What changes about the plot? Why? Are the values the same for each feature, or have the values changed? 
1. **Linear regression**: Create a breakdown plot for a linear regression model for the same n. How is it different from the random forest explanation? 
1. **Order dependence for linear regression**: Copy the cell that creates the row explanation. In the copied cell, add `order=X_train.columns.tolist()` as an argument. Does it change in the way you expect? Are the values the same for each feature, or have the values changed? 
1. **Advantages & Disadvantages**: How does the break-down plot compare to the ceteris paribus plot seen before? What questions can you answer with this plot, that you could not answer with the ceteris paribus plot? What can you explain with the Ceteris Paribus plot that you cannot explain with the break-down plot? In summary, what are the respective advantages and disadvantages.

**Bonus**: Why do you think order dependence matters for a random forest, but not for linear regression?

**Bonus**:  Can you explain how the breakdown is calculated? Can you reproduce the first three numbers of the explanation? (4192.021, +741.452, +210.558 for n = 230, but you may choose a different value for n)

</div>

In [None]:
# The Linear Regression pipeline
from sklearn.linear_model import LinearRegression 

ct = ColumnTransformer([
    ('onehot', OneHotEncoder(), ['sex', 'species', 'island']),
], remainder='passthrough')

pipeline_lr = Pipeline([
    ('ct', ct),
    ('model', LinearRegression())
])

pipeline_lr.fit(X_train, y_train)
pipeline_lr.score(X_test, y_test)

In [None]:
# Create your break-down plot for linear regression here.


In [None]:
# Create your break-down plot for linear regression with adjusted order variable here.


------