# Feature Selection

### Introduction

Now one benefit of viewing feature importances is that we can begin to remove less important features.  By reducing the number of features in our model we reap multiple benefits:

* Our models become easier to understand
* We remove collinear features 
* We can prioritize our feature engineering and interpretation on the selected features

### Loading our data

> First we load up our training data and validation data.

In [70]:
import pandas as pd
df_train = pd.read_feather('./bnb_train.feather')
df_X_train = df_train.drop(columns = ['price'])
y_train = df_train.price

df_val = pd.read_feather('./bnb_val.feather')
df_X_val = df_val.drop(columns = ['price'])
y_val = df_val.price

> Then we train our model.

In [4]:
from sklearn.ensemble import RandomForestRegressor
rfr = RandomForestRegressor(n_estimators=40, max_features='log2')
rfr.fit(df_X_train, y_train)
rfr.score(df_X_val, y_val)

0.7572817041357578

### Feature Selection

Now that we have trained our model, it's time to select some features from the model.  We start by using our `PermutationImportance` transformer.

In [72]:
from eli5.sklearn import PermutationImportance
perm = PermutationImportance(rfr).fit(df_X_val, y_val)

> And then we view the importances.

In [76]:
eli5.explain_weights_df(perm, feature_names = df_X_val.columns.to_list()).head(5)

Unnamed: 0,feature,weight,std
0,license_is_na,0.076021,0.002148
1,calculated_host_listings_count,0.074967,0.001796
2,"host_verifications_['email', 'phone']",0.069605,0.008095
3,property_type_other,0.0683,0.00766
4,host_neighbourhood_is_na,0.064198,0.009964


Now, to select the features, we could simply select the features from the dataframe.  Another mechanism is to the use the `SelectFromModel` transformer from sklearn.

In [90]:
from sklearn.feature_selection import SelectFromModel
first_selection = SelectFromModel(perm, threshold=0.01, prefit=True)
X_val_first_select = first_selection.transform(df_X_val)
X_train_first_select = first_selection.transform(df_X_train)

So here we select the top forty features.  Then we retrain the model with just these forty features and see how it performs.  

In [91]:
rfr_first_select = RandomForestRegressor(n_estimators=40, max_features='log2')
rfr_first_select.fit(X_train_first_select, y_train)
rfr_first_select.score(X_val_first_select, y_val)

0.8280295274769363

Here, we see an increase in performance from our model.  We can determine which features were selected with the `get_support` method.

In [82]:
first_selection.get_support()[:20]

array([ True,  True,  True,  True,  True,  True,  True,  True, False,
       False, False,  True, False, False, False,  True, False, False,
        True, False])

In [94]:
first_select_cols = df_X_train.columns[first_selection.get_support()]
first_select_cols[:5]

Index(['index', 'id', 'host_id', 'host_response_rate', 'host_listings_count'], dtype='object')

In [95]:
first_select_cols.shape

(38,)

### Round two

Now that we are only selecting thirty eight columns, and some of the features were collinear with others, if we look at feature importances we are sure to see some changes.  Let's call `PermutationImportance` again just with these features.

In [98]:
from eli5.sklearn import PermutationImportance
second_pmi = PermutationImportance(rfr_first_select).fit(X_val_first_select, y_val)

In [102]:
eli5.explain_weights_df(second_pmi, top=5, feature_names = first_select_cols.to_list())

Unnamed: 0,feature,weight,std
0,host_total_listings_count,0.117983,0.003129
1,host_listings_count,0.092155,0.008688
2,"host_verifications_['email', 'phone']",0.081905,0.011873
3,id,0.068955,0.001236
4,property_type_other,0.06428,0.000368


So we can see that host_total_listings count now has a weight near .12, up from it's previous score of .07.  On the other side, we also have features that no longer make our one percent weight cutoff.

In [103]:
eli5.explain_weights_df(second_pmi, feature_names = first_select_cols.to_list()).tail(5)

Unnamed: 0,feature,weight,std
33,last_reviewDayofyear,0.002364,0.001527
34,host_about_is_na,0.001856,0.00032
35,host_response_time_other,0.001167,0.000646
36,host_response_rate_is_na,0.000849,0.001007
37,instant_bookable,3.4e-05,3.7e-05


So let's select from our model again.

In [106]:
second_select = SelectFromModel(second_pmi, threshold=0.01, prefit=True)
X_train_second_select = second_select.transform(X_train_first_select)
X_val_second_select = second_select.transform(X_val_first_select)

rfr_second_select = RandomForestRegressor(n_estimators=40, max_features='log2')
rfr_second_select.fit(X_train_second_select, y_train)
rfr_second_select.score(X_val_second_select, y_val)

0.8139796316812621

And again.

In [112]:
third_pmi = PermutationImportance(rfr_second_select).fit(X_val_second_select, y_val)
third_select = SelectFromModel(third_pmi, threshold=0.01, prefit=True)

X_train_third_select = third_select.transform(X_train_second_select)
X_val_third_select = third_select.transform(X_val_second_select)

rfr = RandomForestRegressor(n_estimators=40, max_features='log2')
rfr.fit(X_train_third_select, y_train)
rfr.score(X_val_third_select, y_val)

0.8407392336302191

In [116]:
second_select_cols = first_select_cols[second_select.get_support()]
third_select_cols = second_select_cols[third_select.get_support()]
third_select_cols.shape

(20,)

In [122]:
eli5.explain_weights_df(third_pmi, feature_names = second_select_cols.to_list())[:10]

Unnamed: 0,feature,weight,std
0,availability_365,0.136897,0.005379
1,host_listings_count,0.119702,0.007686
2,host_total_listings_count,0.091553,0.003439
3,calculated_host_listings_count,0.08483,0.007077
4,"host_verifications_['email', 'phone']",0.082864,0.004521
5,host_sinceElapsed,0.082185,0.007138
6,property_type_other,0.07018,0.001945
7,summary_is_na,0.068375,0.004069
8,availability_90,0.059145,0.002999
9,host_id,0.04976,0.004265


So from here, we can begin to identify the most important features.

In [124]:
selected_X_train = pd.DataFrame(X_train_third_select, columns=third_select_cols)
selected_X_train.loc[:, 'price'] = y_train

selected_X_val = pd.DataFrame(X_val_third_select, columns=third_select_cols)
selected_X_val.loc[:, 'price'] = y_val

In [125]:
selected_X_train.to_feather('./selected_train.feather')

In [126]:
selected_X_val.to_feather('./selected_val.feather')

### Summary 

In this lesson, we went from feature importance to feature selection.  We used `sklearn`'s SelectFromModel transformer combined with the `PermutationImportance` transformer from `eli5` to select the features.  We saw that with each selection of the features, the feature importances changes, as collinearity between features is reduced.  We also saw that we can drastically reduce the number of features and maintain, if not increase, the accuracy of the model.