# Machine Learning Pipeline - Feature Selection (Henry's Comments)

In this notebook, we pick up the transformed datasets that we saved in the previous notebook.

### Check out this link for different sklearn feature selection methods
- Removing feature with low variance
- Recursive feature selection

https://scikit-learn.org/stable/modules/feature_selection.html#select-from-model

## Reproducibility: Setting the seed

With the aim to ensure reproducibility between runs of the same notebook, but also between the research and production environment, for each step that includes some element of randomness, it is extremely important that we **set the seed**.

In [51]:
# to handle datasets
import pandas as pd
import numpy as np

# for plotting
import matplotlib.pyplot as plt

# to build the models
from sklearn.linear_model import Lasso
from sklearn.feature_selection import SelectFromModel

# to visualise al the columns in the dataframe
pd.pandas.set_option('display.max_columns', None)

## Import the intermediate data (as oppose to raw data)

In [52]:
# load the train and test set with the engineered variables

# We built and saved these datasets in the previous lecture.
# If you haven't done so, go ahead and check the previous notebook
# to find out how to create these datasets

X_train = pd.read_csv('../data/xtrain_processed.csv')
X_test = pd.read_csv('../data/xtest_processed.csv')

y_train = pd.read_csv('../data/ytrain_processed.csv')
y_test = pd.read_csv('../data/ytest_processed.csv')

In [53]:
X_train.loc[:5,:]

Unnamed: 0,MSSubClass,MSZoning,LotFrontage,LotArea,Street,Alley,LotShape,LandContour,Utilities,LotConfig,LandSlope,Neighborhood,Condition1,Condition2,BldgType,HouseStyle,OverallQual,OverallCond,YearBuilt,YearRemodAdd,RoofStyle,RoofMatl,Exterior1st,Exterior2nd,MasVnrType,MasVnrArea,ExterQual,ExterCond,Foundation,BsmtQual,BsmtCond,BsmtExposure,BsmtFinType1,BsmtFinSF1,BsmtFinType2,BsmtFinSF2,BsmtUnfSF,TotalBsmtSF,Heating,HeatingQC,CentralAir,Electrical,1stFlrSF,2ndFlrSF,LowQualFinSF,GrLivArea,BsmtFullBath,BsmtHalfBath,FullBath,HalfBath,BedroomAbvGr,KitchenAbvGr,KitchenQual,TotRmsAbvGrd,Functional,Fireplaces,FireplaceQu,GarageType,GarageYrBlt,GarageFinish,GarageCars,GarageArea,GarageQual,GarageCond,PavedDrive,WoodDeckSF,OpenPorchSF,EnclosedPorch,3SsnPorch,ScreenPorch,PoolArea,PoolQC,Fence,MiscFeature,MiscVal,MoSold,SaleType,SaleCondition,LotFrontage_na,MasVnrArea_na,GarageYrBlt_na
0,0.75,0.75,0.461171,0.0,1.0,1.0,0.333333,1.0,1.0,0.0,0.0,0.863636,0.4,1.0,0.75,0.6,0.777778,0.5,0.014706,0.04918,0.0,0.0,1.0,1.0,0.333333,0.0,0.666667,0.5,1.0,0.666667,0.666667,0.666667,1.0,0.002835,0.0,0.0,0.673479,0.239935,1.0,1.0,1.0,1.0,0.55976,0.0,0.0,0.52325,0.0,0.0,0.666667,0.0,0.375,0.333333,0.666667,0.416667,1.0,0.0,0.0,0.75,0.018692,1.0,0.75,0.430183,0.5,0.5,1.0,0.116686,0.032907,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.545455,0.666667,0.75,0.0,0.0,0.0
1,0.75,0.75,0.456066,0.0,1.0,1.0,0.333333,0.333333,1.0,0.0,0.0,0.363636,0.4,1.0,0.75,0.6,0.444444,0.75,0.360294,0.04918,0.0,0.0,0.6,0.6,0.666667,0.03375,0.666667,0.5,0.5,0.333333,0.666667,0.0,0.8,0.142807,0.0,0.0,0.114724,0.17234,1.0,1.0,1.0,1.0,0.434539,0.0,0.0,0.406196,0.333333,0.0,0.333333,0.5,0.375,0.333333,0.666667,0.25,1.0,0.0,0.0,0.75,0.457944,0.5,0.25,0.220028,0.5,0.5,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.75,1.0,0.0,0.636364,0.666667,0.75,0.0,0.0,0.0
2,0.916667,0.75,0.394699,0.0,1.0,1.0,0.0,0.333333,1.0,0.0,0.0,0.954545,0.4,1.0,1.0,0.6,0.888889,0.5,0.036765,0.098361,1.0,0.0,0.3,0.2,0.666667,0.2575,1.0,0.5,1.0,1.0,0.666667,0.0,1.0,0.080794,0.0,0.0,0.601951,0.286743,1.0,1.0,1.0,1.0,0.627205,0.0,0.0,0.586296,0.333333,0.0,0.666667,0.0,0.25,0.333333,1.0,0.333333,1.0,0.333333,0.8,0.75,0.046729,0.5,0.5,0.406206,0.5,0.5,1.0,0.228705,0.149909,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.090909,0.666667,0.75,0.0,0.0,0.0
3,0.75,0.75,0.445002,0.0,1.0,1.0,0.666667,0.666667,1.0,0.0,0.0,0.454545,0.4,1.0,0.75,0.6,0.666667,0.5,0.066176,0.163934,0.0,0.0,1.0,1.0,0.333333,0.0,0.666667,0.5,1.0,0.666667,0.666667,1.0,1.0,0.25567,0.0,0.0,0.018114,0.242553,1.0,1.0,1.0,1.0,0.56692,0.0,0.0,0.529943,0.333333,0.0,0.666667,0.0,0.375,0.333333,0.666667,0.25,1.0,0.333333,0.4,0.75,0.084112,0.5,0.5,0.362482,0.5,0.5,1.0,0.469078,0.045704,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.636364,0.666667,0.75,1.0,0.0,0.0
4,0.75,0.75,0.577658,0.0,1.0,1.0,0.333333,0.333333,1.0,0.0,0.0,0.363636,0.4,1.0,0.75,0.6,0.555556,0.5,0.323529,0.737705,0.0,0.0,0.6,0.7,0.666667,0.17,0.333333,0.5,0.5,0.333333,0.666667,0.0,0.6,0.086818,0.0,0.0,0.434278,0.233224,1.0,0.75,1.0,1.0,0.549026,0.0,0.0,0.513216,0.0,0.0,0.666667,0.0,0.375,0.333333,0.333333,0.416667,1.0,0.333333,0.8,0.75,0.411215,0.5,0.5,0.406206,0.5,0.5,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.545455,0.666667,0.75,0.0,0.0,0.0
5,0.916667,0.75,0.418208,0.0,1.0,1.0,0.0,0.333333,1.0,0.5,0.0,0.954545,0.4,1.0,1.0,0.6,0.888889,0.5,0.0,0.016393,1.0,0.0,0.3,0.2,0.666667,0.47875,1.0,0.5,1.0,1.0,0.666667,0.0,1.0,0.272856,0.0,0.0,0.075244,0.27856,1.0,1.0,1.0,1.0,0.616248,0.0,0.0,0.576053,0.333333,0.0,0.333333,0.5,0.125,0.333333,1.0,0.416667,1.0,0.333333,0.8,0.75,0.0,1.0,0.75,0.74189,0.5,0.5,1.0,0.0,0.131627,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.818182,1.0,1.0,0.0,0.0,0.0


In [54]:
y_train.head()

Unnamed: 0,SalePrice
0,12.21106
1,11.887931
2,12.675764
3,12.278393
4,12.103486


## Feature Selection

- Select a subset of the most predictive features. 
- We will use Lasso regression to do that.
- Feature selection is performed using "Embedded methods" (LASSO regression) that concurrently performs feature selection and model training. <br><br>

- Use SelectFromModel(): SelectFromModel is a meta-transformer that can be used alongside any estimator that assigns importance to each feature through a specific attribute (such as coef_, feature_importances_) or via an importance_getter callable after fitting. 
- The features are considered unimportant and removed (set to False in the output selector.get_support()) if the corresponding importance of the feature values are below the provided threshold parameter. Apart from specifying the threshold numerically, there are built-in heuristics for finding a threshold using a string argument. Available heuristics are “mean”, “median” and float multiples of these like “0.1*mean”. In combination with the threshold criteria, one can use the max_features parameter to set a limit on the number of features to select.

In [55]:
# We will do the model fitting and feature selection
# First, we specify the Lasso Regression model, and we select a suitable alpha (aka lambda).
# RECALL: By increasing alpha (the hyperparameter), we are reducing the flexibility of the regression model 
# and thus reducing the variance of the model at the cost of higher bias. 

# We use SelectFromModel (a feature selection function) from sklearn
# Set seed to make Pipeline reproducible


# SelectFromModel:

# https://stackoverflow.com/questions/64581307/how-to-properly-do-feature-selection-with-selectfrommodel-from-scikit-learn
# https://scikit-learn.org/stable/modules/feature_selection.html#select-from-model

# Estimator (model) that is used for feature selection
est = Lasso(alpha=0.001,random_state=0)

# SelectFromModel: select "important" variables based on magnitude of coefficents from estimator. 
# Important variables are defined as: abs(coeff) > threshold


# Init the selector
selector=SelectFromModel(estimator=est) # default value for the parameter 'threshold' is None
# Fit the selector with the train dataset
selector=selector.fit(X_train,y_train)

In [56]:
# Look at the features that were selected (those that were assigned True)
selector.get_support()

array([ True,  True,  True, False, False, False,  True,  True, False,
        True, False,  True, False, False, False, False,  True,  True,
       False,  True,  True, False,  True, False, False, False,  True,
       False,  True,  True, False,  True,  True, False, False, False,
       False, False, False,  True,  True, False,  True,  True, False,
        True,  True, False, False,  True, False, False,  True,  True,
        True,  True,  True, False, False,  True,  True,  True, False,
       False,  True,  True, False, False, False,  True, False, False,
       False, False, False, False, False,  True, False, False, False])

In [57]:
print(f"In total, there are {selector.get_support().sum()} features that are selected.")

In total, there are 36 features that are selected.


In [58]:
# Print the list of selected features
selected_feats = X_train.columns[selector.get_support()]
selected_feats

Index(['MSSubClass', 'MSZoning', 'LotFrontage', 'LotShape', 'LandContour',
       'LotConfig', 'Neighborhood', 'OverallQual', 'OverallCond',
       'YearRemodAdd', 'RoofStyle', 'Exterior1st', 'ExterQual', 'Foundation',
       'BsmtQual', 'BsmtExposure', 'BsmtFinType1', 'HeatingQC', 'CentralAir',
       '1stFlrSF', '2ndFlrSF', 'GrLivArea', 'BsmtFullBath', 'HalfBath',
       'KitchenQual', 'TotRmsAbvGrd', 'Functional', 'Fireplaces',
       'FireplaceQu', 'GarageFinish', 'GarageCars', 'GarageArea', 'PavedDrive',
       'WoodDeckSF', 'ScreenPorch', 'SaleCondition'],
      dtype='object')

In [59]:
# The default "threshold" of feature importance (measured by abs(coefficent))
selector.threshold_

1e-05

In [60]:
# The coefficents of the LASSO regression model (the estimator/underlying model of the selector)
selector.estimator_.coef_

array([ 0.03962656,  0.10524603,  0.00491007,  0.        ,  0.        ,
        0.        ,  0.02031931,  0.00886523,  0.        ,  0.02443736,
        0.        ,  0.24923169,  0.        ,  0.        , -0.        ,
       -0.        ,  0.46945041,  0.28749155, -0.        , -0.01991692,
        0.014472  , -0.        ,  0.02216444,  0.        ,  0.        ,
        0.        ,  0.0190256 ,  0.        ,  0.03124918,  0.07560136,
        0.        ,  0.05338541,  0.0586981 ,  0.        ,  0.        ,
        0.        ,  0.        ,  0.        ,  0.        ,  0.04266624,
        0.06646204, -0.        ,  0.38326678,  0.02840298, -0.        ,
        0.82389888,  0.09678501,  0.        ,  0.        ,  0.03540975,
        0.        , -0.        ,  0.08241905,  0.0466849 ,  0.06404286,
        0.01865078,  0.05455398,  0.        ,  0.        ,  0.0128951 ,
        0.18728738,  0.00628161,  0.        ,  0.        ,  0.01645242,
        0.04327071,  0.        ,  0.        ,  0.        ,  0.03

In [61]:
# let's print some stats
print(f'The total number of features: {X_train.shape[1]}')
print(f'The number of selected features: {len(selected_feats)}')
print(f'The number of features with coefficients shrank to zero: {np.sum(selector.estimator_.coef_==0)}')

The total number of features: 81
The number of selected features: 36
The number of features with coefficients shrank to zero: 45


In [62]:
# Output the selected features to a csv file
pd.Series(selected_feats)

0        MSSubClass
1          MSZoning
2       LotFrontage
3          LotShape
4       LandContour
5         LotConfig
6      Neighborhood
7       OverallQual
8       OverallCond
9      YearRemodAdd
10        RoofStyle
11      Exterior1st
12        ExterQual
13       Foundation
14         BsmtQual
15     BsmtExposure
16     BsmtFinType1
17        HeatingQC
18       CentralAir
19         1stFlrSF
20         2ndFlrSF
21        GrLivArea
22     BsmtFullBath
23         HalfBath
24      KitchenQual
25     TotRmsAbvGrd
26       Functional
27       Fireplaces
28      FireplaceQu
29     GarageFinish
30       GarageCars
31       GarageArea
32       PavedDrive
33       WoodDeckSF
34      ScreenPorch
35    SaleCondition
dtype: object

In [63]:
pd.Series(selected_feats).to_csv('../outputs/selected_features.csv', index=False)

# Additional Resources

- [Feature Selection for Machine Learning](https://www.udemy.com/course/feature-selection-for-machine-learning/?referralCode=186501DF5D93F48C4F71) - Online Course
- [Feature Selection for Machine Learning: A comprehensive Overview](https://trainindata.medium.com/feature-selection-for-machine-learning-a-comprehensive-overview-bd571db5dd2d) - Article
- https://scikit-learn.org/stable/modules/feature_selection.html#select-from-model

## A small example
- https://stackoverflow.com/questions/64581307/how-to-properly-do-feature-selection-with-selectfrommodel-from-scikit-learn

In [64]:
from sklearn.feature_selection import SelectFromModel
from sklearn.linear_model import LogisticRegression
X = [[ 0.87, -1.34,  0.31 ],
     [-2.79, -0.02, -0.85 ],
     [-1.34, -0.48, -2.55 ],
     [ 1.92,  1.48,  0.65 ]]
y = [0, 1, 0, 1]

In [65]:
selector = SelectFromModel(estimator=LogisticRegression(), threshold="1.25*mean").fit(X, y)

In [66]:
print(selector.get_support())

[False  True False]


In [67]:
print(selector.estimator_.coef_)

[[-0.3252302   0.83462377  0.49750423]]


In [68]:
print(selector.threshold_) # 0.6905659148858644

0.6905659148858644


In [69]:
print(abs(selector.estimator_.coef_).mean()*1.25) # 0.6905659148858644

0.6905659148858644


In [70]:
# I can use the .transform() so that I will get X_train that contains ONLY the selected feature (a data subset)
X_reduced = selector.transform(X)

In [71]:
X_reduced

array([[-1.34],
       [-0.02],
       [-0.48],
       [ 1.48]])

We only have a dataset with the second column (feature). Since only the second feature was selected