# Feature Selection Using Models Learned Thus Far

## Option 1: Feature selection using SelectFromModel

`SelectFromModel` is a meta-transformer that can be used along with any estimator that has a `coef_` or `feature_importances_`
attribute after fitting.

If a feature's `coef_` or `feature_importances_` value are below the provided threshold parameter, the feature is considered unimportant and is removed.

Apart from specifying the threshold numerically, there are built-in heuristics for finding a threshold using a string argument.

Available heuristics are `mean`, `median` and float multiples of these like `0.1 * mean`.

#### *Example: Fit a Random Forest model and use SelectFromModel to keep important features*

In [2]:
# Prepare the data
from sklearn.feature_selection import SelectFromModel
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split
import pandas as pd
import numpy as np

data_url = "https://raw.githubusercontent.com/selva86/datasets/master/BostonHousing.csv"
data = pd.read_csv(data_url)
target=data["medv"]
data=data.drop(['medv'], axis=1)

X_train, X_test, y_train, y_test = train_test_split(data, target, random_state=0)

In [4]:
# Train a Random Forest model
forest = RandomForestRegressor(n_estimators=200)
forest.fit(X_train, y_train)
print(forest.feature_importances_)

[0.03703163 0.00086692 0.00744187 0.00084376 0.01888859 0.40245067
 0.01258706 0.04033267 0.00499771 0.01792048 0.02161569 0.0104349
 0.42458805]


In [5]:
# Use SelectFromModel with a minimum threshold of 0.25
sfm = SelectFromModel(forest, threshold=.25)
sfm.fit(X_train, y_train)

# Transform data to select the features
X_train_new = sfm.transform(X_train)

print(X_train_new[0:5,:]) #only two variables in X now
print(X_train_new.shape) #compare to original data with 13 variables
print(X_train.shape)

[[ 5.605 18.46 ]
 [ 5.927  9.22 ]
 [ 7.267  6.05 ]
 [ 6.471 17.12 ]
 [ 6.782 25.79 ]]
(379, 2)
(379, 13)


#### *Example: Fit a Lasso model and use SelectFromModel to keep important features*

In [13]:
from sklearn.linear_model import Lasso
from sklearn.feature_selection import SelectFromModel

las = Lasso(alpha=10).fit(X_train, y_train)
sfm = SelectFromModel(las)
sfm.fit(X_train, y_train)

# Transform data to select the features
X_train_new = sfm.transform(X_train)

print(las.coef_)
print(X_train_new.shape) # down from 13 variables to 4

[-0.          0.03268741 -0.          0.          0.          0.
  0.         -0.          0.         -0.01155885 -0.          0.00679306
 -0.54971245]
(379, 4)


---
## Option 2: Use Recursive Feature Elimination
Given an external estimator that assigns weights to features (e.g. the coefficients of a linear model), the goal of recursive feature elimination (RFE) is to select features by recursively considering smaller and smaller sets of features.

First, the estimator is trained on the initial set of features and the importance of each feature is obtained either through a `coef_` attribute or through a `feature_importances_` attribute. Then the least important features are pruned from current set of features. That procedure is recursively repeated on the pruned set until the desired number of features to select is eventually reached.

Basic algorithm:
1. Start by running the full model.
2. Run a series of models that evaluate prediction error on `y_train` after dropping a feature.
3. Repeat for all features.
4. Drop feature that is helps least in predicting `y_train`.
5. Repeat process with n-1 features until you reach the desired stopping criterion (e.g. target number of features).

#### *Example:  RFE to find 5 features that lead to the best model prediction*

In [24]:
from sklearn.linear_model import LinearRegression
from sklearn.feature_selection import RFE

selector = RFE(LinearRegression(), n_features_to_select=5, step=1) # step tells RFE how many features to remove each time model features are evaluated
selector.fit(X_train, y_train) # fit RFE estimator

print("Num Features: "+str(selector.n_features_))
print("Feature Ranking: "+str(selector.ranking_))  # ranking for features
print("Selected Features: "+str(list(data.columns[selector.support_ ]))) # five most important features

Num Features: 5
Feature Ranking: [3 5 9 1 1 1 8 1 4 6 1 7 2]
Selected Features: ['chas', 'nox', 'rm', 'dis', 'ptratio']


In [25]:
# Transform X data for other use in this model or other models

X_train_new = selector.transform(X_train) # reduces X to subset identified above
display(X_train_new)

array([[ 0.    ,  0.431 ,  5.605 ,  7.9549, 19.1   ],
       [ 0.    ,  0.453 ,  5.927 ,  6.932 , 19.7   ],
       [ 1.    ,  0.447 ,  7.267 ,  4.7872, 17.6   ],
       ...,
       [ 0.    ,  0.547 ,  6.021 ,  2.7474, 17.8   ],
       [ 0.    ,  0.448 ,  6.03  ,  5.6894, 17.9   ],
       [ 0.    ,  0.51  ,  5.572 ,  2.5961, 16.6   ]])

---
## Extra Practice: Can you use feature selection to transform the following dataset using different feature selection techniques?

In [26]:
from sklearn.datasets import load_breast_cancer
bc = load_breast_cancer()

X = bc.data
y = bc.target