# Machine Learning Pipeline - Feature Selection

In this notebook, we pick up the transformed datasets that we saved in the previous notebook.

## Reproducibility: Setting the seed

With the aim to ensure reproducibility between runs of the same notebook, but also between the research and production environment, for each step that includes some element of randomness, it is extremely important that we **set the seed**.

In [1]:
# to handle datasets
import pandas as pd
import numpy as np
from datetime import datetime
from collections import Counter
# ignore warnings
import warnings
warnings.filterwarnings("ignore")

# for plotting
import matplotlib.pyplot as plt
import seaborn as sns
import ipywidgets as widgets
from IPython.display import display

# for the yeo-johnson transformation
import scipy.stats as stats

# to divide train and test set
from sklearn.model_selection import train_test_split

# feature scaling
from sklearn.preprocessing import MinMaxScaler

# to save the trained scaler class
import joblib

# to build the models
from sklearn.linear_model import Lasso
from sklearn.feature_selection import SelectFromModel

# to display all the columns of the dataframe in the notebook
pd.pandas.set_option('display.max_columns', None)
#pd.pandas.set_option('display.max_rows', None)

In [2]:
# load the train and test set with the engineered variables which we built and saved in the previous notebook

X_train = pd.read_csv('../Data/xtrain.csv')
X_test = pd.read_csv('../Data/xtest.csv')

In [3]:
#X_train.head()

In [4]:
#X_test.head()

In [5]:
#y_train.head()

In [6]:
#y_train.head()

In [7]:
# load the target (remember that the target is log transformed)
y_train = pd.read_csv('../Data/ytrain.csv')
y_test = pd.read_csv('../Data/ytest.csv')

In [8]:
#y_train.head()

In [9]:
#y_test.head()

### Feature Selection

Let's go ahead and select a subset of the most predictive features. There is an element of randomness in the Lasso regression, so remember to set the seed.

In [10]:
# We will do the model fitting and feature selection altogether in a few lines of code

# first, we specify the Lasso Regression model, and we select a suitable alpha (equivalent of penalty).
# The bigger the alpha the less features that will be selected.

# Then we use the selectFromModel object from sklearn, which will select automatically the features which coefficients are non-zero

# remember to set the seed, the random state in this function
sel_ = SelectFromModel(Lasso(alpha = 0.001, random_state = 0))

# train Lasso model and select features
sel_.fit(X_train, y_train)

In [11]:
sel_.get_support().sum()

36

In [12]:
# let's visualise those features that were selected.
# (selected features marked with True)

sel_.get_support()

array([ True,  True,  True,  True,  True,  True, False,  True, False,
        True,  True, False,  True, False, False,  True,  True,  True,
        True, False,  True,  True,  True,  True, False,  True, False,
        True,  True,  True, False, False,  True, False,  True, False,
        True,  True, False, False, False, False, False, False, False,
        True,  True, False,  True,  True, False, False, False,  True,
        True,  True,  True,  True,  True])

In [13]:
# let's print the number of total and selected features

# this is how we can make a list of the selected features
selected_feats = X_train.columns[(sel_.get_support())]

# let's print some stats
print('total features: {}'.format((X_train.shape[1])))
print('selected features: {}'.format(len(selected_feats)))
print('features with coefficients shrank to zero: {}'.format(np.sum(sel_.estimator_.coef_ == 0)))

total features: 59
selected features: 36
features with coefficients shrank to zero: 23


In [14]:
# print the selected features
print(selected_feats)

Index(['Location', 'Type', 'Bedrooms', 'Bathrooms', 'log_SqYds',
       'Yrs_SinceBlt', 'Floors', 'Lobby', 'Double_Glazed_Windows',
       'Central_Heating', 'Service_Elevators', 'Flooring',
       'Electricity_Backup', 'Servant_Quarters', 'Prayer_Room', 'Powder_Room',
       'Gym', 'Lounge_or_Sitting_Room', 'Business_Center_or_Media_Room',
       'Internet', 'Intercom', 'Conference_Room', 'Community_Gym',
       'First_Aid_or_Medical_Centre', 'Kids_Play_Area', 'Mosque',
       'Nearby_Shopping_Malls', 'Nearby_Restaurants', 'Other_Nearby_Places',
       'Security_Staff', 'Bedrooms_na', 'Bathrooms_na', 'SqYds_na',
       'Floors_na', 'Elevators_na', 'Year_na'],
      dtype='object')


In [15]:
# print the features that are dropped
not_selected = X_train.columns.difference(selected_feats)
print(not_selected)

Index(['Barbeque_Area', 'Central_AC', 'Community_Center',
       'Community_Lawn_or_Garden', 'Community_Swimming_Pool',
       'Day_Care_center', 'Elevators', 'Facilities_for_Disabled', 'Furnished',
       'Jacuzzi', 'Laundry_Room', 'Laundry_or_Dry_Cleaning_Facility',
       'Lawn_or_Garden', 'Maintainance_Staff', 'Nearby_Hospital',
       'Nearby_Public_Transport_Service', 'Nearby_Schools', 'Parking_Spaces',
       'Satellite_or_Cable_TV_Ready', 'Sauna', 'Study_Room', 'Swimming_Pool',
       'Waste_Disposal'],
      dtype='object')


In [16]:
pd.Series(selected_feats).to_csv('../Data/selected_features.csv', index = False)