## Feature Selection
```In this exercise you will experience a very under rated aspect of the data scientist job: feature selection.
We will look at the most common algorithms, examine them and will develop new feature selection algorithms.
The data you will be working on is pretty hard: you will soon find out why :)```

```Note: When questions are asked (you can identify a question by '?'), answer it in another cell and be prepared to talk about it with your tutor.```

```~Ittai Haran```

In [1]:
import pandas as pd
import numpy as np

import matplotlib.pyplot as plt
%matplotlib inline

```Load data.csv. The data is made up from some measurements of soil in Africa, and was derived from Kaggle. SOC is the target.
```

In [None]:
df = pd.read_csv('resources/data.csv')

```Train a simple Random Forest model on the data. Use your knowledge to make it as good a model as you can. Don't forget to split the data to train and test segments.```

In [None]:
error_train = []
error_test = []
for depth in range(2,12):
    clf = RandomForestRegressor(n_estimators=100, n_jobs=-1, max_depth=depth)
    clf.fit(df_train, target_train)
    
    error_train.append(mean_squared_error(clf.predict(df_train), target_train))
    error_test.append(mean_squared_error(clf.predict(df_test), target_test))
    
    print(depth)
    
plt.plot(error_train)
plt.plot(error_test)
plt.show()

In [None]:
clf = RandomForestRegressor(n_estimators=100, n_jobs=-1, max_depth=7)
clf.fit(df_train, target_train)
print(mean_squared_error(clf.predict(df_train), target_train))
print(mean_squared_error(clf.predict(df_test), target_test))

```Use model.feature_importances_ to select the 100 most important features. Use these features to train your model. Did you get better results on the train segment? what are results on the test segment?
How does model.feature_importances_ work? Some features have 0 importance. Why?```

In [122]:
features = [x[0] for x in sorted(zip(df.columns, clf.feature_importances_), key = lambda x: -x[1])][:100]
clf_selected = RandomForestRegressor(n_estimators=100, n_jobs=-1, max_depth=7)
clf_selected.fit(df_train[features], target_train['SOC'])
print(mean_squared_error(clf_selected.predict(df_train[features]), target_train['SOC']))
print(mean_squared_error(clf_selected.predict(df_test[features]), target_test['SOC']))

```Well, that didn't work quite so good. Let's try a different approach. Now select the 100 features that are most correlated to the data. What results did you get now? You got so far two groups of feature selected. Are they similar to one another? Why or why not?```

In [148]:
corrs = np.corrcoef(df_train.transpose(), target_train)
corr_features = sorted(list(zip(list(df.columns)+['SOC'], corrs[-1]), key = lambda x: -x[1]))[1:]
corr_features = [x[0] for x in corr_features[:100]]

In [149]:
clf_selected = RandomForestRegressor(n_estimators=100, n_jobs=-1, max_depth=7)
clf_selected.fit(df_train[corr_features], target_train)
print(mean_squared_error(clf_selected.predict(df_train[corr_features]), target_train))
print(mean_squared_error(clf_selected.predict(df_test[corr_features]), target_test))

0.200874887434
0.401442933818


```Now cluster the features by their correlation to one another. Make sure you have at least ~30 clusters. Pick from each cluster the feature most correlated to the target. What are your results now? What clustering algorithm did you use?```

In [178]:
from sklearn.cluster import DBSCAN
clusters = DBSCAN(eps = 0.001, metric="precomputed").fit_predict(1-abs(corrs))

In [239]:
def choose(x):
    return max(x, key = lambda x: -x[0])[1]

choose_features = pd.DataFrame(zip(clusters, zip(corrs[-1], list(df.columns)+['SOC']))).iloc[:-1]
features_selected_corr = choose_features.groupby(0).agg(choose)[1]

In [248]:
clf_selected = RandomForestRegressor(n_estimators=100, n_jobs=-1, max_depth=7)
clf_selected.fit(df_train[features_selected_corr], target_train['SOC'])
print(mean_squared_error(clf_selected.predict(df_train[features_selected_corr]), target_train['SOC']))
print(mean_squared_error(clf_selected.predict(df_test[features_selected_corr]), target_test['SOC']))

0.0594274307251
0.174055082353


```Before we continue, read the documentation of sklearn.feature_selection. What other feature selection algorithms did you find there? try one of them on your data.```

```You are about to be very surprised. select 100 features randomly and evaluate the model trained with them. Repeat it 100 times. What is the best set of feature you got? How is it compared to the other subsets you got so far? Can you explain it?```

In [261]:
features_list = []
for i in range(100):
    features_random = list(set(np.random.choice(df.columns, size = 30)))
    clf_selected = RandomForestRegressor(n_estimators=100, n_jobs=-1, max_depth=7)
    clf_selected.fit(df_train[features_random], target_train['SOC'])
    print(mean_squared_error(clf_selected.predict(df_train[features_random]), target_train['SOC']))
    print(mean_squared_error(clf_selected.predict(df_test[features_random]), target_test['SOC']))
    print('-----------------------------------')
    features_list.append((features_random,
                        mean_squared_error(clf_selected.predict(df_train[features_random]), target_train['SOC']),
                        mean_squared_error(clf_selected.predict(df_test[features_random]), target_test['SOC'])))

0.0589292326065
0.220988208427
-----------------------------------
0.130027641885
0.363043766131
-----------------------------------
0.0626344792349
0.205456919831
-----------------------------------
0.0677479410859
0.21747579557
-----------------------------------
0.100113620463
0.322853339086
-----------------------------------
0.0469965640281
0.193533815239
-----------------------------------
0.0649124212461
0.322598669285
-----------------------------------
0.0618093761526
0.279370533775
-----------------------------------
0.0796151484123
0.185439996149
-----------------------------------
0.0964645818662
0.283625330334
-----------------------------------
0.0748653445205
0.16183771335
-----------------------------------
0.111090943036
0.307932037551
-----------------------------------
0.0710708284713
0.281532717236
-----------------------------------
0.0682780111316
0.174660986238
-----------------------------------
0.0511028452941
0.233219539664
-----------------------------------


In [273]:
features_rand_selected = min(features_list, key = lambda x: x[2])[0]
clf_selected = RandomForestRegressor(n_estimators=100, n_jobs=-1, max_depth=7)
clf_selected.fit(df_train[features_rand_selected], target_train['SOC'])
print(mean_squared_error(clf_selected.predict(df_train[features_rand_selected]), target_train['SOC']))
print(mean_squared_error(clf_selected.predict(df_test[features_rand_selected]), target_test['SOC']))

0.0552761666787
0.151489479898


```Now it's your time to enter the office's competition for feature selection. Invent your own feature selection algorithm. Make it a good one (because if it isn't, your tutor might ask you to implement one of the other algorithms proposed by the office's members, and that might be awful ;) ) Before you act, talk about your idea with your tutor.```