# Machine Learning Pipeline - Feature Selection

In this notebook, we pick up the transformed datasets that we saved in the previous notebook.

## Reproducibility: Setting the seed

With the aim to ensure reproducibility between runs of the same notebook, but also between the research and production environment, for each step that includes some element of randomness, it is extremely important that we **set the seed**.

In [1]:
# to handle datasets
import pandas as pd
import numpy as np

# for plotting
import matplotlib.pyplot as plt

# to build the models
from sklearn.linear_model import Lasso
from sklearn.feature_selection import SelectFromModel

# to visualise al the columns in the dataframe
pd.pandas.set_option('display.max_columns', None)

In [2]:
# load the train and test set with the engineered variables

# we built and saved these datasets in the previous notebook.
# If you haven't done so, go ahead and check the previous notebook
# to find out how to create these datasets

X_train = pd.read_csv('xtrain.csv')
X_test = pd.read_csv('xtest.csv')

X_train.head()

Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,population,median_income,ocean_proximity,income_bracket,rooms_per_household,bedrooms_per_household,population_per_household,median_income_per_household
0,-1.372821,0.894943,1.859559,0.211547,-0.434088,1.887371,2.032736,1.692556,1.080911,-0.106259,0.086694,-0.026318
1,0.657329,-0.780212,0.987199,-1.494151,-0.672195,-1.391316,-0.897604,-0.508495,-0.943396,-0.106259,0.086694,-0.026318
2,0.772055,-0.719382,1.225115,-0.125347,-0.366916,1.398167,0.079176,1.692556,1.080911,-0.106259,0.086694,-0.026318
3,-1.302988,0.815397,0.114838,0.2482,-0.282584,2.308039,2.032736,3.893607,1.51566,-0.106259,0.086694,-0.026318
4,1.166114,-1.065644,-0.916133,-0.418663,-0.212792,-1.089247,-0.897604,-0.508495,-0.943396,-0.106259,0.086694,-0.026318


In [4]:
# load the target (remember that the target is transformed with Yeo-Johnson Transformation)
y_train = pd.read_csv('ytrain.csv')
y_test = pd.read_csv('ytest.csv')

y_train.head()

Unnamed: 0,median_house_value
0,30.324818
1,23.846476
2,30.324818
3,30.324818
4,24.894438


### Feature Selection

Let's go ahead and select a subset of the most predictive features. There is an element of randomness in the Lasso regression, so remember to set the seed.

In [5]:
# We will do the model fitting and feature selection
# altogether in a few lines of code

# first, we specify the Lasso Regression model, and we
# select a suitable alpha (equivalent of penalty).
# The bigger the alpha the less features that will be selected.

# Then we use the selectFromModel object from sklearn, which
# will select automatically the features which coefficients are non-zero

# remember to set the seed, the random state in this function
sel_ = SelectFromModel(Lasso(alpha=0.001, random_state=0))

# train Lasso model and select features
sel_.fit(X_train, y_train)

SelectFromModel(estimator=Lasso(alpha=0.001, random_state=0))

In [6]:
sel_.get_support().sum()

12

In [7]:
# let's visualise those features that were selected.
# (selected features marked with True)

sel_.get_support()

array([ True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True])

In [8]:
# let's print the number of total and selected features

# this is how we can make a list of the selected features
selected_feats = X_train.columns[(sel_.get_support())]

# let's print some stats
print('total features: {}'.format((X_train.shape[1])))
print('selected features: {}'.format(len(selected_feats)))
print('features with coefficients shrank to zero: {}'.format(
    np.sum(sel_.estimator_.coef_ == 0)))

total features: 12
selected features: 12
features with coefficients shrank to zero: 0


In [9]:
# print the selected features
selected_feats

Index(['longitude', 'latitude', 'housing_median_age', 'total_rooms',
       'population', 'median_income', 'ocean_proximity', 'income_bracket',
       'rooms_per_household', 'bedrooms_per_household',
       'population_per_household', 'median_income_per_household'],
      dtype='object')

In [10]:
pd.Series(selected_feats).to_csv('selected_features.csv', index=False)