# Introduction


**What?** Introduction to permutation importance



# Permutation importance


- If the question is: **What features have the biggest impact on predictions?**
- If the question is NOT: **How would the model change if I change a feature value of x amount?**
- Then a technique called **feature importance** can help.
- If I randomly shuffle a single column of the validation data, leaving the target and all other columns in place, how would that affect the accuracy of predictions in that now-shuffled data?

![Shuffle](https://i.imgur.com/h17tMUU.png)

- Randomly re-ordering a single column should cause less accurate predictions, since the resulting data no longer corresponds to anything observed in the real world.  Model accuracy especially suffers if we shuffle a column that the model relied on heavily for predictions.  In this case, shuffling `height at age 10` would cause terrible predictions. If we shuffled `socks owned` instead, the resulting predictions wouldn't suffer nearly as much.



# Import modules

In [1]:
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
import eli5
from eli5.sklearn import PermutationImportance

# Import the dataset


- Our example will use a model that predicts whether a soccer/football team will have the "Man of the Game" winner based on the team's statistics.  
- The "Man of the Game" award is given to the best player in the game.  



In [3]:
data = pd.read_csv('../DATASETS/FIFA_2018_Statistics.csv')
y = (data['Man of the Match'] == "Yes")  # Convert from string "Yes"/"No" to binary
feature_names = [i for i in data.columns if data[i].dtype in [np.int64]]
X = data[feature_names]
train_X, val_X, train_y, val_y = train_test_split(X, y, random_state=1)

# Build the model


- Model-building isn't our current focus, so the cell below loads the data and builds a rudimentary model. 



In [4]:
my_model = RandomForestClassifier(n_estimators=100,
                                  random_state=0).fit(train_X, train_y)

# Computing permutation importance


- We'll use a library called **eli5**.
- Top positive values are the most important features, and those towards the bottom matter least.
    - The first number in each row shows how much model performance decreased with a random shuffling (in this case, using "accuracy" as the performance metric). Since there is always a bif of randomness the number after the **±** measures how performance varied from one-reshuffling to the next.
- You should ignore values **lower than zero**.
    - You'll occasionally see negative values for permutation importances. In those cases, the predictions on the shuffled (or noisy) data happened to be more accurate than the real data! This happens when the feature didn't matter (should have had an importance close to 0), but random chance caused the predictions on shuffled data to be more accurate. This is more common with small datasets, like the one in this example, **because** there is more room for luck/chance.



In [5]:
perm = PermutationImportance(my_model, random_state=1).fit(val_X, val_y)
eli5.show_weights(perm, feature_names = val_X.columns.tolist())

Weight,Feature
0.1750  ± 0.0848,Goal Scored
0.0500  ± 0.0637,Distance Covered (Kms)
0.0437  ± 0.0637,Yellow Card
0.0187  ± 0.0500,Off-Target
0.0187  ± 0.0637,Free Kicks
0.0187  ± 0.0637,Fouls Committed
0.0125  ± 0.0637,Pass Accuracy %
0.0125  ± 0.0306,Blocked
0.0063  ± 0.0612,Saves
0.0063  ± 0.0250,Ball Possession %


# References


- https://www.kaggle.com/dansbecker/permutation-importance
- [Dataset](https://www.kaggle.com/mathan/fifa-2018-match-statistics)

