# Feature Importances

In [1]:
from sklearn.model_selection import train_test_split
from sklearn.datasets import load_diabetes

dataset = load_diabetes()
X = dataset['data']
y = dataset['target']



X_train, X, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1)

X_train, X_val, y_train, y_val = train_test_split(X_train, y_train, test_size=0.2, random_state=1)

### Feature Importances

Sklearn's random forest estimator allows us to view the importances of our features directly.

In [36]:
from sklearn.ensemble import RandomForestRegressor
rfr = RandomForestRegressor(n_estimators = 12, max_features='log2', min_samples_leaf=8).fit(X_train, y_train)

In [71]:
rfr.score(X_val, y_val)

0.4572085759988026

In [37]:
importance_numbers = rfr.feature_importances_

In [38]:
importance_list = importance_numbers.tolist()

In [24]:
import numpy as np
feature_names = dataset['feature_names']
importances = np.vstack((feature_names, importance_list))

In [26]:
import pandas as pd

In [29]:
importances = pd.Series(importance_list, index = dataset['feature_names'])

In [32]:
importances.sort_values(ascending=False)

bmi    0.293283
s5     0.205167
bp     0.131995
s3     0.127470
s6     0.089923
age    0.045828
s1     0.038752
s4     0.037920
s2     0.020621
sex    0.009042
dtype: float64

>  is sometimes called "gini importance" or "mean decrease impurity" and is defined as the total decrease in node impurity (weighted by the probability of reaching that node (which is approximated by the proportion of samples reaching that node) averaged over all trees of the ensemble.

2. Working with feature importances

In [59]:
importances = ['bmi', 's5','bp', 's3', 's6', 'age', 's1']

In [67]:
X_importances_train = df_X_train[importances]

In [62]:
X_val_importances = X_val[importances]

In [64]:
X_importances_train.shape

(282, 7)

In [65]:
X_val_importances.shape

(71, 7)

In [70]:
rfr_importances.score(X_val_importances, y_val)

0.3900393366721254

Here, we can see that using the feature importances did not exactly provide us with a cutoff number, and reducing the features that we did reduced our accuracy.

### An alternative mechanism

In [33]:
import eli5
from eli5.sklearn import PermutationImportance


perm = PermutationImportance(rfr).fit(X_val, y_val)
eli5.show_weights(perm, feature_names=dataset['feature_names'])

Weight,Feature
0.2601  ± 0.1338,bmi
0.0964  ± 0.0560,s5
0.0421  ± 0.0249,s3
0.0375  ± 0.0654,bp
0.0023  ± 0.0149,s4
-0.0012  ± 0.0068,sex
-0.0026  ± 0.0074,s2
-0.0063  ± 0.0197,age
-0.0088  ± 0.0209,s6
-0.0096  ± 0.0210,s1


1. Trying the permutation

In [34]:
selected_permutation = ['bmi', 's5', 's3', 'bp']

In [42]:
df_X_train = pd.DataFrame(X_train, columns=dataset['feature_names'])
selected_X_train = df_X_train[selected_permutation]

In [46]:
from sklearn.ensemble import RandomForestRegressor
rfr_permutation = RandomForestRegressor(n_estimators = 12, max_features='log2', min_samples_leaf=8).fit(selected_X_train, y_train)

In [48]:
X_val = pd.DataFrame(X_val, columns=dataset['feature_names'])
X_val_permute = X_val[selected_permutation]

rfr_permutation.score(X_val_permute, y_val)

0.4743308981213368

In [49]:
eli5.show_weights(rfr_permutation, feature_names=selected_permutation)

Weight,Feature
0.4560  ± 0.3954,s5
0.2814  ± 0.2923,bmi
0.1358  ± 0.1524,bp
0.1268  ± 0.2031,s3


Here we were able to reduce our features to just four features (where we included seven features in the original model), and our model performed better than when we included all of the features in our original random forest.

### Why the difference?

The short answer is that the first measurement is biased by number of split points for a given feature.

> For a variable with many levels (in the most extreme case, a continuous variable will generally have as many levels as there are rows of data) this means testing many more split points. Testing more split points means there's a higher probability of finding a split that, purely by chance, happens to predict the dependent variable well. Therefore, variables where more splits are tried, will appear more often in the tree. This leads to the bias in the gini importance approach that we found.