# Feature Importances

## Introduction

### Loading our data

In [2]:
import pandas as pd
df_train = pd.read_feather('./bnb_train.feather')
df_X_train = df_train.drop(columns = ['price'])
y_train = df_train.price

In [3]:
df_val = pd.read_feather('./bnb_val.feather')
df_X_val = df_val.drop(columns = ['price'])
y_val = df_val.price

> Next we train our model using the hyperparameters that we previously discovered.

In [4]:
from sklearn.ensemble import RandomForestRegressor
rfr = RandomForestRegressor(n_estimators=40, max_features='log2')
rfr.fit(df_X_train, y_train)
rfr.score(df_X_val, y_val)

0.7572817041357578

### Feature Importances

In [6]:
from eli5.sklearn import PermutationImportance


In [1]:
# explain weights

### Explaining Permutation Importance

* Goal: recursive feature elimination

```python 
model_all_features = RandomForestRegressor()
model_all_features.fit(df_X_train, y_train).score(df_X_val, y_val)

scores = []
for column in X_train.columns:
    X_train_minus_one = df_X_train.drop(columns = [column])
    X_valid_minus_one = df_X_valid.drop(columns = [column])
    model_all_features.fit(X_train_minus_one, y_valid)
    score = model_all_features.score(X_valid_minus_one, y_valid)
    scores.append(score)
```

* Problem: time

### Solution: just scramble the features

In [10]:
model_all_features = RandomForestRegressor(n_estimators=40, max_features='log2')
model_all_features.fit(df_X_train, y_train).score(df_X_val, y_val)

0.7387637461679417

* original data

In [63]:
df_X_val['calculated_host_listings_count'].head(), y_val.head()

(0    1
 1    1
 2    4
 3    1
 4    1
 Name: calculated_host_listings_count, dtype: int64, 0    30.0
 1    45.0
 2    39.0
 3    60.0
 4    35.0
 Name: price, dtype: float64)

* scrambled data

In [64]:
shuffled_host_since = df_X_val['calculated_host_listings_count'].sample(frac=1, random_state = 1)
shuffled_host_since.head(), y_val_top

(4370    1
 881     1
 3214    1
 3782    1
 4442    1
 Name: calculated_host_listings_count, dtype: int64, 0    30.0
 1    45.0
 2    39.0
 3    60.0
 4    35.0
 Name: price, dtype: float64)

And then score the model with that one feature's data shuffled.  Which is close to if we removed that data entirely, and see how the model performs.

In [67]:
permuted_X_val = df_X_val.copy()
host_listings_permute = permuted_X_val['calculated_host_listings_count'].sample(frac=1, random_state=1).reset_index()
permuted_X_val['calculated_host_listings_count'] = host_listings_permute
permuted_X_val['calculated_host_listings_count'].head()

0    4370
1     881
2    3214
3    3782
4    4442
Name: calculated_host_listings_count, dtype: int64

In [68]:
model_all_features.score(permuted_X_val, y_val)

0.26978065177587496

### back to the show

In [69]:
import eli5
feat_imp_df = eli5.explain_weights_df(perm, feature_names=df_X_train.columns.to_list())
feat_imp_df.head(5)

Unnamed: 0,feature,weight,std
0,calculated_host_listings_count,0.076291,0.000196
1,license_is_na,0.073342,0.010245
2,property_type_other,0.069124,0.006424
3,"host_verifications_['email', 'phone']",0.068592,0.009262
4,host_total_listings_count,0.066563,0.009125
