## Extracting Feature Importance via Permutation
In this exercise, we will compute and extract feature importance by permutating a Random Forest classifier model trained to predict the customer drop-out ratio.

In [2]:
pip install mlxtend

Collecting mlxtend
  Using cached mlxtend-0.19.0-py2.py3-none-any.whl (1.3 MB)
Installing collected packages: mlxtend
Successfully installed mlxtend-0.19.0
Note: you may need to restart the kernel to use updated packages.


In [3]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor
from mlxtend.evaluate import feature_importance_permutation
import altair as alt

In [4]:
file_url = 'https://raw.githubusercontent.com/PacktWorkshops/The-Data-Science-Workshop/master/Chapter09/Dataset/phpYYZ4Qc.csv'

In [5]:
df = pd.read_csv(file_url)
y = df.pop('rej')

#split data
X_train, X_test, y_train, y_test = train_test_split(df, y, test_size=0.3, random_state=1)

In [6]:
# RandomForestRegressor
rf_model = RandomForestRegressor(random_state=1, 
                                n_estimators=50, 
                                max_depth=6, 
                                min_samples_leaf=60)
rf_model.fit(X_train, y_train)

RandomForestRegressor(max_depth=6, min_samples_leaf=60, n_estimators=50,
                      random_state=1)

In [10]:
# extract feature importance via permutation
imp_vals, _= feature_importance_permutation(
    predict_method=rf_model.predict,
    X=X_test.values, y=y_test.values, 
    metric='r2', num_rounds=1, seed=2)
imp_vals

array([ 0.00000000e+00, -3.34728428e-05, -2.83476215e-05,  1.03738033e-04,
        4.61246775e-06,  1.96879681e-01,  8.71635991e-05, -7.16980150e-05,
        3.28788126e-04,  1.05860288e-03,  0.00000000e+00,  5.56589408e-01,
       -4.31208212e-05,  1.13215046e-04,  2.22409533e-05,  5.96895938e-05,
        5.35704113e-05,  1.76990072e-01,  2.81084956e-03,  6.79193119e-05,
        0.00000000e+00,  0.00000000e+00,  1.16553234e-02,  2.77582324e-05,
        1.40812233e-04,  1.96362926e-06,  3.66606090e-04, -1.82522826e-04,
        1.14460108e-05,  3.72080724e-05,  0.00000000e+00,  5.54878998e-04])

It is quite hard to interpret the raw results. Let's plot the variable importance by permutating the model on a graph.

In [11]:
varimp_df = pd.DataFrame({'feature': df.columns, 
                         'importance': imp_vals})
# plot a bar chart with altair using coef_df and importnace
alt.Chart(varimp_df).mark_bar().encode(x='importance', 
                                      y='feature')

From this output, we can see the variables that impact the prediction the most for this Random Forest model are: a2pop, a1pop, a3pop, b1eff, and temp, in decreasing order of importance. This is very similar to the results of Exercise 9.02, Extracting RandomForest Feature Importance