[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/mahmouddraz/xai/blob/main/notebooks/permutation_importance/XAI_Permutation_Importance_exercise.ipynb)

# Permutation Importance

## 1. Basic Idea

The main idea behind permutation importance is to shuffle the value of one feature column and then check the performance of a pretrained network on this permuted dataset. The performance will probably decrease, but the relative drop will indicate how important the feature is. 

Hence, this method is model agnostic.

Further information can be found on https://arxiv.org/pdf/1801.01489.pdf 

## 2. Algorithm

1. Take a model that was fit to the training set
2. Estimate the predicted performance of the model on a validation dataset and take that as the baseline performance
3. For each feature j:
  
  a. Shuffle all the values of the column in the original dataset (the other columns and labels are fixed)

  b. Record the performance of the shuffeled dataset on the original network 

  c. Compute the feature importance as the absolute difference between the baseline performance and the performance of the shuffeled dataset.

Repeat a - c for a large number of times and average over all trials.

## 3. Permutation Importance applied to a stroke dataset using a Random Forest

In [None]:
#need newer matplotlib version for nice bar plots
#!pip install matplotlib==3.5.2

In [None]:
! git clone https://gist.github.com/aishwarya8615/d2107f828d3f904839cbcb7eaa85bd04 'stroke'
! git clone https://github.com/mahmouddraz/xai 'xai_workshop'

Preparing the dataset to feed it into a Random Forest Classifier.

In [None]:
# Imports
import sys 
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from tabulate import tabulate
import pickle
import matplotlib.pyplot as plt
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score

In [None]:
#Read data
data = pd.read_csv('/content/stroke/healthcare-dataset-stroke-data.csv')

#Print the data
print(f'We have {len(data)} datapoints. The first ten data points are:')
print(data.iloc[0:10,:].to_markdown()) 

# Load pretrained model and data
with open('/content/xai_workshop/pretrained_models/rf_stroke/rf_stroke.pickle', "rb") as f:
  rf = pickle.load(f)
with open('/content/xai_workshop/pretrained_models/rf_stroke/rf_stroke_X_train.pickle', "rb") as f:
  X_train = pickle.load(f)
with open('/content/xai_workshop/pretrained_models/rf_stroke/rf_stroke_X_test.pickle', "rb") as f:
  X_test = pickle.load(f)
with open('/content/xai_workshop/pretrained_models/rf_stroke/rf_stroke_y_test.pickle', "rb") as f:
  y_test = pickle.load(f)

Use the pretrained random forest to generate predictions and baseline accuracy.

In [None]:
y_pred = rf.predict(X_test)

print(f"The classifier is trained has an accuracy of {accuracy_score(y_test, y_pred)}.")
baseline_performance = accuracy_score(y_test, y_pred)

Now we apply permutation importance.

In [None]:
import numpy as np

runs_per_feature = 50

feature_and_importance = {}

# Iterate over all features
for fx, feature in enumerate(X_test): 

  ################## IMPLEMENTATION OF PERMUTATION IMPORTANCE ##################

Finally we may want to plot the results.

In [None]:
fig, ax = plt.subplots(figsize=[12, 8])
bars = ax.barh(np.arange(len(feature_and_importance)), feature_and_importance.values())

ax.bar_label(bars, labels=feature_and_importance.keys())
ax.set_yticklabels([])
plt.show()

## 4. Exercises

Answer the following questions:

1. What happens if we increase the number of `runs_per_feature`? What happens if we decrease it? Explain your reasoning.
2. What changes if we just set one feature to zero instead of shuffeling the values?
3. What happens if we use the same method for a neural network? Please check the code below:

In [None]:
from tensorflow import keras 
import tensorflow as tf 
from tensorflow.keras import layers, models

In [None]:
with open('/content/xai_workshop/pretrained_models/nn_stroke/nn_stroke_X_train.pickle', "rb") as f:
  X_train = pickle.load(f)
with open('/content/xai_workshop/pretrained_models/nn_stroke/nn_stroke_X_test.pickle', "rb") as f:
  X_test = pickle.load(f)
with open('/content/xai_workshop/pretrained_models/nn_stroke/nn_stroke_y_test.pickle', "rb") as f:
  y_test = pickle.load(f)

nn = keras.models.load_model('/content/xai_workshop/pretrained_models/nn_stroke/nn_stroke')

In [None]:
#Compute the baseline performance
y_pred = nn.predict(X_test)

print(f"The classifier is trained has an accuracy of {accuracy_score(y_test, np.round(abs(y_pred)), normalize=True) }.")
baseline_performance = accuracy_score(y_test, np.round(abs(y_pred)), normalize=True)

Again we implement permutation importance for the trained network.

In [None]:
runs_per_feature = 50

################## IMPLEMENTATION OF PERMUTATION IMPORTANCE ##################

In [None]:
import matplotlib.pyplot as plt

fig, ax = plt.subplots(figsize=[12, 8])
bars = ax.barh(np.arange(len(feature_and_importance)), feature_and_importance.values())

ax.bar_label(bars, labels=feature_and_importance.keys())
ax.set_yticklabels([])
plt.show()