<a href="https://colab.research.google.com/github/meetnaren/permutation-importance/blob/master/Permutation_importance.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Permutation Importance
This notebook explains how to determine feature importance through permutation and how the results can sometimes differ from the feature importances given by tree-based algorithms such as random forest.

The data used here is the credit card default data, taken from UCI ML repository hosted in Kaggle, available here: https://www.kaggle.com/uciml/default-of-credit-card-clients-dataset

## 1. Setting up the environment
Importing all the required packages.

In [0]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
import random
import progressbar
from google.colab import files

## 2. Kaggle setup
The two cells below make use of the Kaggle API to download the data and unzip to be used for our analysis. Please make sure you have your 'kaggle.json' file available to be uploaded upon execution of the cell below.

In [0]:
!pip install kaggle --upgrade -q
!mkdir ~/.kaggle

files.upload()

In [0]:
!mv kaggle.json ~/.kaggle/kaggle.json
!kaggle datasets download -d uciml/default-of-credit-card-clients-dataset
!unzip default-of-credit-card-clients-dataset.zip

credit=pd.read_csv('UCI_Credit_Card.csv').drop(columns=['ID'])

In [5]:
credit.head()

Unnamed: 0,LIMIT_BAL,SEX,EDUCATION,MARRIAGE,AGE,PAY_0,PAY_2,PAY_3,PAY_4,PAY_5,PAY_6,BILL_AMT1,BILL_AMT2,BILL_AMT3,BILL_AMT4,BILL_AMT5,BILL_AMT6,PAY_AMT1,PAY_AMT2,PAY_AMT3,PAY_AMT4,PAY_AMT5,PAY_AMT6,default.payment.next.month
0,20000.0,2,2,1,24,2,2,-1,-1,-2,-2,3913.0,3102.0,689.0,0.0,0.0,0.0,0.0,689.0,0.0,0.0,0.0,0.0,1
1,120000.0,2,2,2,26,-1,2,0,0,0,2,2682.0,1725.0,2682.0,3272.0,3455.0,3261.0,0.0,1000.0,1000.0,1000.0,0.0,2000.0,1
2,90000.0,2,2,2,34,0,0,0,0,0,0,29239.0,14027.0,13559.0,14331.0,14948.0,15549.0,1518.0,1500.0,1000.0,1000.0,1000.0,5000.0,0
3,50000.0,2,2,1,37,0,0,0,0,0,0,46990.0,48233.0,49291.0,28314.0,28959.0,29547.0,2000.0,2019.0,1200.0,1100.0,1069.0,1000.0,0
4,50000.0,1,2,1,57,-1,0,-1,0,0,0,8617.0,5670.0,35835.0,20940.0,19146.0,19131.0,2000.0,36681.0,10000.0,9000.0,689.0,679.0,0


## 3. Fitting a Random Forest classifier
Let us one-hot encode the categorical variables, split the dataset and fit a random forest classifier.

In [0]:
cats=['SEX', 'EDUCATION', 'MARRIAGE', 'PAY_0', 'PAY_2', 'PAY_3', 'PAY_4', 'PAY_5', 'PAY_6'] #categorical columns
for c in cats:
    credit[c]=credit[c].astype('category')

credit_ohe=pd.get_dummies(credit, drop_first=True) #One-hot encoding the categorical variables

y=credit_ohe['default.payment.next.month']
X=credit_ohe.drop(columns=['default.payment.next.month'])

In [0]:
X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y)

In [69]:
rf=RandomForestClassifier(n_estimators=100)
rf.fit(X_train, y_train)

RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
                       max_depth=None, max_features='auto', max_leaf_nodes=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, n_estimators=100,
                       n_jobs=None, oob_score=False, random_state=None,
                       verbose=0, warm_start=False)

Let us see how our model performs against the training and test data sets.

In [70]:
print('Training set metrics:')
print('Accuracy:', accuracy_score(y_train, rf.predict(X_train)))
print('Precision:', precision_score(y_train, rf.predict(X_train)))
print('Recall:', recall_score(y_train, rf.predict(X_train)))
print('F1:', f1_score(y_train, rf.predict(X_train)))
print('---------------')
print('Test set metrics:')
print('Accuracy:', accuracy_score(y_test, rf.predict(X_test)))
print('Precision:', precision_score(y_test, rf.predict(X_test)))
print('Recall:', recall_score(y_test, rf.predict(X_test)))
print('F1:', f1_score(y_test, rf.predict(X_test)))

Training set metrics:
Accuracy: 0.9992888888888889
Precision: 0.9993960136903564
Recall: 0.9973879847297569
F1: 0.998390989541432
---------------
Test set metrics:
Accuracy: 0.8133333333333334
Precision: 0.6361724500525763
Recall: 0.36467751657625075
F1: 0.46360153256704983


The model seems to be overfitting, with the test set metrics varying significantly from those of the training set.

## 4. Random Forest feature importances
Let us see which features were deemed important by the model by plotting the feature importances.

In [0]:
col_sorted_by_importance=rf.feature_importances_.argsort()
feat_imp=pd.DataFrame({
    'cols':X.columns[col_sorted_by_importance],
    'imps':rf.feature_importances_[col_sorted_by_importance]
})

In [0]:
!pip install plotly_express --upgrade -q

In [75]:
import plotly_express as px
import plotly.offline as po
px.bar(feat_imp.sort_values(['imps'], ascending=False)[:25], x='cols', y='imps', labels={'cols':'column', 'imps':'feature importance'})

Surprisingly, 'AGE' is the most important feature for credit card default. Let us now validate the same by finding the feature importances through permutation.

## 5. Permutation importance
Permutation importance is a technique where we shuffle the values of a single column and run the model to see how the scores get affected. If the scores are affected greatly, then the feature is highly important to the model and if not, it does not add significant value to the model.

Let us see the feature importances for recall score on the test dataset.

In [0]:
def PermImportance(X, y, clf, metric, num_iterations=100):
    '''
    Calculates the permutation importance of features in a dataset.
    Inputs:
    X: dataframe with all the features
    y: array-like sequence of labels
    clf: sklearn classifier, already trained on training data
    metric: sklearn metric, such as accuracy_score, precision_score or recall_score
    num_iterations: no. of repetitive runs of the permutation
    Outputs:
    baseline: the baseline metric without any of the columns permutated
    scores: differences in baseline metric caused by permutation of each feature, dict in the format {feature:[diffs]}
    '''
    bar=progressbar.ProgressBar(max_value=len(X.columns))
    baseline_metric=metric(y, clf.predict(X))
    scores={c:[] for c in X.columns}
    for c in X.columns:
        X1=X.copy(deep=True)
        for _ in range(num_iterations):
            temp=X1[c].tolist()
            random.shuffle(temp)
            X1[c]=temp
            score=metric(y, clf.predict(X1))
            scores[c].append(baseline_metric-score)
        bar.update(X.columns.tolist().index(c))
    return baseline_metric, scores

In [78]:
baseline, scores=PermImportance(X_test, y_test, rf, recall_score, num_iterations=10)

 98% (81 of 82) |####################### | Elapsed Time: 0:03:08 ETA:   0:00:02

Let us now plot the top percent changes in recall that resulted from permuting each feature.

In [0]:
percent_changes={c:[] for c in X.columns}
for c in scores:
    for i in range(len(scores[c])):
        percent_changes[c].append(scores[c][i]/baseline*100)

In [81]:
px.bar(
    pd.DataFrame.from_dict(percent_changes).melt().groupby(['variable']).mean().reset_index().sort_values(['value'], ascending=False)[:25], 
    x='variable', 
    y='value', 
    labels={
        'variable':'column', 
        'value':'% change in recall'
        }
       )

We can see that 'AGE' is not one of the top most important features when we use the permutation technique to determine feature importances.