# Feature Importances

## Introduction

In previous lessons, we've learned about how to tune our random forest model.  While we have tuned models that are perhaps good at predicting a specific outcome, we do not know how our random forest is using our features to make these predictions.  This is significant.  We want to use our models to not only predict the future, but also to tell us what factors are significant so that we can potentially change the future.  

There are two components to this.  First, is determining which of our features are most important.  Second is to determine the expected change in our target value as we change the value of a feature.  

In this lesson, we'll focus on finding our most influential features, and removing less important features from our model.  

### Loading our data

Let's begin by loading our data and training our `RandomForestRegressor`.  We'll work with our preprocessed AirBnb data.

> First we load up our training data.

In [2]:
import pandas as pd
df_train = pd.read_feather('./bnb_train.feather')
df_X_train = df_train.drop(columns = ['price'])
y_train = df_train.price

> And now our validation data.

In [3]:
df_val = pd.read_feather('./bnb_val.feather')
df_X_val = df_val.drop(columns = ['price'])
y_val = df_val.price

> Next we train our model using the hyperparameters that we previously discovered.

In [4]:
from sklearn.ensemble import RandomForestRegressor
rfr = RandomForestRegressor(n_estimators=40, max_features='log2')
rfr.fit(df_X_train, y_train)
rfr.score(df_X_val, y_val)

0.7572817041357578

We achieve a `.75` accuracy score on our validation set.  It's time to see what factors influence that model.

### Feature Importances

To determine the feature importances, we could use the `sklearn` library's `rfr.feature_importances_` function.  However, a more accurate technique comes from the `eli5` library's `PermutationImportance` transformer.  Let's see the `PermutationImportance` in action, and then we'll see how it works.

If we have not already, we should run `pip install eli5` in the terminal.  Then we use the `PermutationImportance` transformer.

In [6]:
from eli5.sklearn import PermutationImportance
perm = PermutationImportance(rfr).fit(df_X_val, y_val)

Notice that `eli5` is using a similar interface to sklearn.  It first initializes the transformer with our `RandomForestRegressor`, `rfr`, and then fits the regressor on our validation data.

We can the features in order of importance with the following:

In [62]:
import eli5
eli5.explain_weights_df(perm, feature_names = df_X_train.columns.to_list(), top=10)

Unnamed: 0,feature,weight,std
0,calculated_host_listings_count,0.076291,0.000196
1,license_is_na,0.073342,0.010245
2,property_type_other,0.069124,0.006424
3,"host_verifications_['email', 'phone']",0.068592,0.009262
4,host_total_listings_count,0.066563,0.009125
5,host_neighbourhood_is_na,0.064419,0.007286
6,neighbourhood_Wilmersdorf,0.051356,7.1e-05
7,host_listings_count,0.049906,0.002391
8,index,0.03797,0.002161
9,availability_60,0.033209,0.00494


Ok, so above we can see that according to `PermutationImportance`, that the most important features are `host_listings_count` and `availability_60`.  Our features are ordered by their `weight`.  Now it's time to learn what `weight` means, and how permutation importance works.

### Explaining Permutation Importance

To understand permutation importance, let's start with perhaps the ideal way to determine the most important features.  The ideal way to determine if our feature is important, is simply to train our model with all of the features and see the accuracy of the model.  And then to see if a feature is important, we can simply remove the feature, retrain our model, and see how much worse the model performs.  If the model performs a lot worse, our feature was important.

```python 
model_all_features = RandomForestRegressor()
model_all_features.fit(df_X_train, y_train).score(df_X_val, y_val)

scores = []
for column in X_train.columns:
    X_train_minus_one = df_X_train.drop(columns = [column])
    X_valid_minus_one = df_X_valid.drop(columns = [column])
    model_all_features.fit(X_train_minus_one, y_valid)
    score = model_all_features.score(X_valid_minus_one, y_valid)
    scores.append(score)
```

Now the only issue with this ideal approach is that retraining the model for each feature can take a lot of time.  

So to save time, we use permutation importance to do the next best thing.  With permutation importance, instead of retraining a model once for each feature, we only train our model one time.

In [10]:
model_all_features = RandomForestRegressor(n_estimators=40, max_features='log2')
model_all_features.fit(df_X_train, y_train).score(df_X_val, y_val)

0.7387637461679417

Then, once our model is trained, we see how important a feature is by replacing the feature's real data with random data. And the random data we use for that feature is simply the original data shuffled.  For example, let's say that we want to determine the feature importance of `host_sinceDayofyear`.

Here is the original feature data and related target data.

In [63]:
df_X_val['calculated_host_listings_count'].head(), y_val.head()

(0    1
 1    1
 2    4
 3    1
 4    1
 Name: calculated_host_listings_count, dtype: int64, 0    30.0
 1    45.0
 2    39.0
 3    60.0
 4    35.0
 Name: price, dtype: float64)

Now instead of scoring the model with the original `host_sinceDayofyear`, we shuffle that feature data. 

In [64]:
shuffled_host_since = df_X_val['calculated_host_listings_count'].sample(frac=1, random_state = 1)
shuffled_host_since.head(), y_val_top

(4370    1
 881     1
 3214    1
 3782    1
 4442    1
 Name: calculated_host_listings_count, dtype: int64, 0    30.0
 1    45.0
 2    39.0
 3    60.0
 4    35.0
 Name: price, dtype: float64)

And then score the model with that one feature's data shuffled.  Which is close to if we removed that data entirely, and see how the model performs.

In [67]:
permuted_X_val = df_X_val.copy()
host_listings_permute = permuted_X_val['calculated_host_listings_count'].sample(frac=1, random_state=1).reset_index()
permuted_X_val['calculated_host_listings_count'] = host_listings_permute
permuted_X_val['calculated_host_listings_count'].head()

0    4370
1     881
2    3214
3    3782
4    4442
Name: calculated_host_listings_count, dtype: int64

In [68]:
model_all_features.score(permuted_X_val, y_val)

0.26978065177587496

So we can see that our score decreased drastically from `.73` to `.269`.  With permutation importance library, it performs this operation many times for each feature to see the average decrease in score. 

So now we can better understand our permutation importances function.

In [69]:
import eli5
feat_imp_df = eli5.explain_weights_df(perm, feature_names=df_X_train.columns.to_list())
feat_imp_df.head(5)

Unnamed: 0,feature,weight,std
0,calculated_host_listings_count,0.076291,0.000196
1,license_is_na,0.073342,0.010245
2,property_type_other,0.069124,0.006424
3,"host_verifications_['email', 'phone']",0.068592,0.009262
4,host_total_listings_count,0.066563,0.009125


### Summary

In this lesson, we learned how to view the importances of a random forest with the `PermutationImportance` transformer from `eli5`.  We saw that the library approximates the decrease in the model's accuracy if the feature were removed, by shuffling each feature and calculating the decrease in accuaracy.  The library does this many times to calculate the mean and standard deviation.