# Feature Importances

## Introduction

### Loading our data

In [1]:
import pandas as pd
df_train = pd.read_feather('./bnb_train.feather')
df_X_train = df_train.drop(columns = ['price'])
y_train = df_train.price

In [10]:
df_val = pd.read_feather('./bnb_val.feather')
df_X_val = df_val.drop(columns = ['price'])
y_val = df_val.price

> Next we train our model using the hyperparameters that we previously discovered.

In [3]:
from sklearn.ensemble import RandomForestRegressor
rfr = RandomForestRegressor(n_estimators=40, max_features='log2')
rfr.fit(df_X_train, y_train)
rfr.score(df_X_val, y_val)

0.7703711201342761

In [18]:
# RandomForestRegressor()

### Feature Importances

In [11]:
from eli5.sklearn import PermutationImportance
perm = PermutationImportance(rfr).fit(df_X_val, y_val)

In [12]:
import eli5

In [17]:
eli5.explain_weights_df(perm, feature_names = df_X_val.columns.tolist(), top = 10)

Unnamed: 0,feature,weight,std
0,property_type_other,0.070201,0.005039
1,longitude,0.069717,0.00683
2,calculated_host_listings_count,0.065217,0.00144
3,summary_is_na,0.044757,0.000118
4,host_listings_count,0.042729,0.000629
5,host_sinceDay,0.042483,0.002693
6,host_id,0.042095,0.002303
7,host_response_rate,0.042094,0.007032
8,availability_90,0.038881,0.00228
9,host_total_listings_count,0.033085,0.003424


### Explaining Permutation Importance

* Goal: recursive feature elimination

```python 
model_all_features = RandomForestRegressor()
model_all_features.fit(df_X_train, y_train).score(df_X_val, y_val)

scores = []
for column in X_train.columns:
    X_train_minus_one = df_X_train.drop(columns = [column])
    X_valid_minus_one = df_X_valid.drop(columns = [column])
    model_all_features.fit(X_train_minus_one, y_valid)
    score = model_all_features.score(X_valid_minus_one, y_valid)
    scores.append(score)
```

* Problem: time

### Solution: just scramble the features

In [6]:
model_all_features = RandomForestRegressor(n_estimators=40, max_features='log2')
model_all_features.fit(df_X_train, y_train).score(df_X_val, y_val)

0.7557566778439913

* original data

In [7]:
df_X_val['calculated_host_listings_count'].head(), y_val.head()

(0    1
 1    1
 2    4
 3    1
 4    1
 Name: calculated_host_listings_count, dtype: int64, 0    30.0
 1    45.0
 2    39.0
 3    60.0
 4    35.0
 Name: price, dtype: float64)

* scrambled data

In [19]:
shuffled_host_since = df_X_val['calculated_host_listings_count'].sample(frac=1, random_state = 1)
shuffled_host_since.head()

4370    1
881     1
3214    1
3782    1
4442    1
Name: calculated_host_listings_count, dtype: int64

And then score the model with that one feature's data shuffled.  Which is close to if we removed that data entirely, and see how the model performs.

In [None]:
permuted_X_val = df_X_val.copy()
host_listings_permute = permuted_X_val['calculated_host_listings_count'].sample(frac=1, random_state=1).reset_index()
permuted_X_val['calculated_host_listings_count'] = host_listings_permute
permuted_X_val['calculated_host_listings_count'].head()

In [None]:
model_all_features.score(permuted_X_val, y_val)

### back to the show

In [20]:
import eli5
feat_imp_df = eli5.explain_weights_df(perm, feature_names=df_X_train.columns.to_list())
feat_imp_df.head(5)

Unnamed: 0,feature,weight,std
0,property_type_other,0.070201,0.005039
1,longitude,0.069717,0.00683
2,calculated_host_listings_count,0.065217,0.00144
3,summary_is_na,0.044757,0.000118
4,host_listings_count,0.042729,0.000629
