# Feature Selection Using Models Learned Thus Far...

First, Feature selection using SelectFromModel

SelectFromModel is a meta-transformer that can be used along with any estimator that has a coef_ or feature_importances_ 
attribute after fitting. 

The features are considered unimportant and removed, if the corresponding coef_ or feature_importances_ values are below 
the provided threshold parameter. 

Apart from specifying the threshold numerically, there are built-in heuristics for finding a threshold using a string argument. 

Available heuristics are “mean”, “median” and float multiples of these like “0.1*mean”.

### Example 1: Fit a Random Forest model and use SelectFromModel to keep important features

In [1]:
from sklearn.feature_selection import SelectFromModel
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split
import pandas as pd
import numpy as np


data_url = "https://raw.githubusercontent.com/selva86/datasets/master/BostonHousing.csv"
data = pd.read_csv(data_url)
target=data["medv"]
data=data.drop(['medv'], axis=1)

Xtrain, Xtest, ytrain, ytest = train_test_split(data, target,
                                                random_state=0)

forest = RandomForestRegressor(n_estimators=200)
formodel = forest.fit(Xtrain, ytrain)


print(formodel.feature_importances_)


[0.03786884 0.00150715 0.00784629 0.00143109 0.01752525 0.39519278
 0.01279564 0.0435756  0.00358423 0.01728389 0.02214276 0.01009055
 0.42915594]


In [None]:
# Set a minimum threshold of 0.25
sfm = SelectFromModel(formodel, threshold=.25)
sfm.fit(Xtrain, ytrain)
Xtrain_new = sfm.transform(Xtrain) # transform data to insert into new model

print(Xtrain_new[0:5,:]) #only two variables in X now

print(Xtrain_new.shape) #compare to original data with 13 variables

[[ 5.605 18.46 ]
 [ 5.927  9.22 ]
 [ 7.267  6.05 ]
 [ 6.471 17.12 ]
 [ 6.782 25.79 ]]
(379, 2)


### Example 2: Fit a Lasso model and use SelectFromModel to keep important features

In [None]:
from sklearn.linear_model import Lasso
from sklearn.feature_selection import SelectFromModel

lassomodel = Lasso(alpha=10).fit(Xtrain, ytrain)
model = SelectFromModel(lassomodel, prefit=True) # prefit argument allows non zero features to be chosen
                                                 # from regularized models like lasso
    
X_new = model.transform(Xtrain) # transform data to insert into new model

print(lassomodel.coef_)
print(X_new.shape) #down to four variables from 13



[-0.          0.03268741 -0.          0.          0.          0.
  0.         -0.          0.         -0.01155885 -0.          0.00679306
 -0.54971245]
(379, 4)


  f"X has feature names, but {self.__class__.__name__} was fitted without"


# Using Recursive Feature Elimination to Choose Model

Given an external estimator that assigns weights to features (e.g., the coefficients of a linear model), the goal of recursive feature elimination (RFE) is to select features by recursively considering smaller and smaller sets of features. 

First, the estimator is trained on the initial set of features and the importance of each feature is obtained either through a coef_ attribute or through a feature_importances_ attribute. 

Basic algorithm:
Start with full model.  Run series of models that evaluate prediction error on ytrain after dropping a feature.  Repeat for all features.  Drop feature that is helps least in predicting ytrain.  Repeat process with n-1 features...

Then, the least important features are pruned from current set of features. That procedure is recursively repeated on the pruned set until the desired number of features to select is eventually reached.

In [None]:
#EXAMPLE:  RFE to find 5 features that help model predict the best:

from sklearn.linear_model import LinearRegression
from sklearn.feature_selection import RFE

estimator = LinearRegression().fit(Xtrain, ytrain) #model with all X variables


selector = RFE(estimator, n_features_to_select=3, step=1) # step tells RFE how many features to remove each time model features are evaluated

selector = selector.fit(Xtrain, ytrain) # fit RFE estimator.

print("Num Features: "+str(selector.n_features_))
print("Selected Features: "+str(selector.support_)) # T/F for top five features
print("Feature Ranking: "+str(selector.ranking_))  # ranking for top five + features

Num Features: 3
Selected Features: [False False False  True  True  True False False False False False False
 False]
Feature Ranking: [ 5  7 11  1  1  1 10  3  6  8  2  9  4]


In [None]:
# Transform X data for other use in this model or other models:

Xnew = selector.transform( Xtrain) #reduces X to subset identified above
data.columns[selector.support_ ] # five most important features

Index(['chas', 'nox', 'rm'], dtype='object')

## Can you use feature selection to transform the following dataset using different feature selection techniques?  


In [None]:
from sklearn.datasets import load_breast_cancer
bc = load_breast_cancer()

In [None]:
X=bc.data
y = bc.target