In [1]:
# Optional: change Jupyter Notebook theme to GDD theme
from IPython.core.display import HTML
HTML(url='https://gdd.li/jupyter-theme')

![footer_logo](https://marysia.nl/assets/GDD/css/logo.png)

# Feature Importance Permutation

In the previous notebook, we investigated the sensitivity of the model towards its features. In other words, what would happen if we kept all other features the same, but changed the value of one. In this notebook, we'll dive further into understanding what happens when a prediction is created, this time answering the question: _which variables contribute to this result the most?_

We will consider the same model -- body mass prediction of a penguin, based on its bodily measurements and a few categorical features.


### Outline
1. [Built-in Random Forest Feature Importance](#ranfor)
1. [Scikit-Learn Permutation Feature Importance](#scikit)
1. [Permutation Feature Importance with Dalex](#dalex)
1. [Exercise](#ex1)

![](images/gentoo.jpg)



<a id = 'ranfor'></a>

# 1. Built-in Random Forest Feature Importance


First, let's load in the data! We start out with a slightly simpler model than we've seen before -- one that only takes numeric features as its input. 

In [None]:
import pandas as pd
from sklearn.model_selection import train_test_split

penguins = (
    pd.read_csv('data/penguins.csv')
    .dropna()
)

# Set features & target.
feature_columns = ['flipper_length_mm', 'bill_length_mm', 'bill_depth_mm']
target = 'body_mass_g'

# Set X and y
X = penguins.loc[:, feature_columns]
y = penguins.loc[:, target]

# Split the dataset. 
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state=42)

We want to investigate the built-in scikit-learn feature importance functionality. This means we need to choose a model that has `.feature_importances_` as an attribute after it has been fitted - a Random Forest regressor.

In [None]:
from sklearn.ensemble import RandomForestRegressor

model = RandomForestRegressor(random_state=42)
model.fit(X_train, y_train)
model.score(X_test, y_test)

Not bad. Slightly worse than the models we've seen before, but that is to be expected, as the model has less features to learn from.

A Random Forest model in scikit-learn has an attribute `feature_importances_` after it has been fitted, that we can check out. The postfix `_` indicates that this attribute is only accessible/set after the model has trained. Why? Well, it would be impossible to view what features the model deems most important before it has seen any data! 

In [None]:
model.feature_importances_

As mentioned previously, they are the average of the feature importances for all the trees in the forest.

In [None]:
[tree.feature_importances_ for tree in model]

Let's wrap these values in nice Pandas dataframe to plot them, so we can see what importance matches with what feature.


In [None]:
importance_df = pd.DataFrame(model.feature_importances_,   # the importances
                            columns=['importance'],   # give the df something to sort by later
                            index=X_train.columns.tolist()   # set the index to the feature names
                            )
importance_df

In [None]:
importance_df.sort_values('importance').plot(kind='barh');

<a id = 'scikit'></a>

# 2. Scikit-Learn Permutation Feature Importance

This way of calculating feature importances is specific to models such as the Decision Tree and Random Forest, as it is based on decreasing impurity. It is not a model agnostic method.

Scikit-learn, however, also provides a model-agnostic way of assessing feature importance called _permutation feature importance_. We will explain how this method works later, but for now let's just show the feature importances it computes.

In [None]:
from sklearn.inspection import permutation_importance

result = permutation_importance(model, 
                               X_test,
                               y_test,
                               n_repeats=10)
result

In [None]:
importances_df = pd.DataFrame(result.importances_mean, 
                             columns=['importance'],
                             index=X_test.columns.tolist())
importances_df

In [None]:
importances_df.sort_values('importance').plot(kind='barh');

We have now seen two techniques for computing feature importance, but we have only computed the importance of numerical features.

What if we extend the model with some categorical features? 

In [None]:
# Set features & target.
feature_columns = ['flipper_length_mm', 'bill_length_mm', 'bill_depth_mm', 'sex', 'island', 'species']
target = 'body_mass_g'

# Set X and y
X = penguins.loc[:, feature_columns]
y = penguins.loc[:, target]

# Split the dataset. 
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state=42)

In [None]:
from sklearn.ensemble import RandomForestRegressor 
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder
from sklearn.pipeline import Pipeline

ct = ColumnTransformer([
    ('onehot', OneHotEncoder(), ['sex', 'species', 'island']),
], remainder='passthrough')

pipeline_rf = Pipeline([
    ('ct', ct),
    ('model', RandomForestRegressor(random_state=42))
])

pipeline_rf.fit(X_train, y_train)
pipeline_rf.score(X_test, y_test)

As expected, a better model! But let's see how we can grab those feature importances.

In [None]:
feature_importances = pipeline_rf['model'].feature_importances_
feature_importances

Alright, there we have our feature importances.. but wait, there are a few more than we expect. Let's check how many features we had originally, and how many importances we have now. 


In [None]:
print(f'Number of features: {len(X_train.columns)}')
print(f'Number of feature importances: {len(feature_importances)}')

It seems like we have a few more importances than features. Why is that? 

Well, because we are using the one hot encoder to one-hot encode all our categorical features, we end up with a separate column for each possible category for each categorical feature.

In [None]:
feature_names = pipeline_rf['ct'].get_feature_names_out().tolist()
feature_names

In [None]:
importance_df = pd.DataFrame(feature_importances, 
                             columns=['feature importances'], 
                             index=feature_names)
importance_df

In [None]:
importance_df.sort_values('feature importances').plot(kind='barh');

We can also do this in a model-agnostic way with permuation feature importance.

In [None]:
from sklearn.inspection import permutation_importance

result = permutation_importance(pipeline_rf, 
                               X_test,
                               y_test,
                               n_repeats=10)
result

In [None]:
importance_df = pd.DataFrame(result.importances_mean, 
                             columns=['importance'],
                             index=X_test.columns.tolist())
importance_df

In [None]:
importance_df.sort_values('importance').plot(kind='barh');

Notice that with this technique we do not experience the previous issue: importances are only computed for the original features!

<a id = 'dalex'></a>

# 3. Permutation Feature Importance with Dalex

Permutation Feature Importance is a model-agnostic feature importance calculation method built-in in scikit-learn itself. However, we can also calculate permutation feature importance with the dalex package.

This requires us to create an Explainer object first.

In [None]:
import dalex 

explainer_rf = dalex.Explainer(pipeline_rf, X_test, y_test)

Now we can use the explainer with `model_parts` to do permutation feature importance.

In [None]:
(
    explainer_rf
    .model_parts()
    .plot()
)


<a id = 'ex1'></a>
# 4. Exercise 



<div class="exercise" markdown="1">

### Exercise 1
#### Random Forest feature importances

1. Consider the Random Forest `.feature_importances_` attribute. Can you think of a way to combine the various categories per feature (i.e. Sex=Male and Sex=Female) to one single importance measure? 
2. Examine how the impurity-based technique ranks the features in terms of importance. How does this compare to the scikit-learn permutation based technique? (hint: consider flipper length vs. sex) 

</div>


<div class="exercise" markdown="1">

### Exercise 2
#### Permutation Feature Importance

Consider the scikit-learn permutation feature importance & dalex permutation feature importance.
1. Is the order of which features are considered most important the same for both methods? 
1. Are the feature importance values in the same scale?

Permutation Feature Importance calculates the importance of an individual feature. It does this by scoring the performance of the model on the original data based on some metric (R2, RMSE, etc.) and comparing this to an altered version of the dataset. The altered version of the dataset is created by shuffling all the values in one column. The intuition here is that if the overall performance of the model decreases due to the feature values being shuffled, the feature must be important. Likewise, if the overall performance of the model remains constant while the feature values are shuffled, that feature must not be very important. 


1. Why do you imagine the values differ per feature between the scikit-learn built-in permutation feature importance and the implementation in dalex? 
1. Can you think of a way to calculate the uncertainty of each feature importance? 
1. Can you think of any reason why this method may not reflect the true importance? 
1. Here we show the feature importances based on the _test_ data. Can you make an argument for and against calculating this on the test data (compared to the train data)?
1. An alternative is _drop feature importance_. Can you imagine what this method does? Why would we prefer permutation feature importance? 


**Bonus**: Implement your own version of permutation feature importance. 
**Bonus**: Extend it with a measure of uncertainty! 

</div>

In [None]:
# Your (bonus) code here!

-----