<img src="https://nyp-aicourse.s3.ap-southeast-1.amazonaws.com/agods/nyp_ago_logo.png" width='400'/>

# Feature Selection

In this lab, you will learn:
- different methods to do feature selection
- differences in statistical approach and machine learning approach

## Import required libraries

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

%matplotlib inline

## Create the Data set

In [2]:
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import PolynomialFeatures
from sklearn.model_selection import train_test_split

boston = pd.read_csv('data/boston.csv', index_col=0)
X = boston.drop('MEDV', axis=1)
y = boston['MEDV']

# Split the data into train and test set
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)

scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

# Reconstruct dataframe
X_train = pd.DataFrame(X_train, columns=scaler.get_feature_names_out())
X_test = pd.DataFrame(X_test, columns=scaler.get_feature_names_out())

# Create degree 2 polynomial features 
poly = PolynomialFeatures(degree=2, include_bias=False)

X_train = poly.fit_transform(X_train)
X_test = poly.transform(X_test)

X_train = pd.DataFrame(X_train, columns=poly.get_feature_names_out())
X_test = pd.DataFrame(X_test, columns=poly.get_feature_names_out())

Here we compute the $R^2$ score (for easier comparison) of linear regression for both train and test score too see if there is overfitting

In [3]:
from sklearn.linear_model import LinearRegression

lr = LinearRegression()
lr.fit(X_train, y_train)
print("Training set score: {:.2f}".format(lr.score(X_train, y_train)))
print("Test set score: {:.2f}".format(lr.score(X_test, y_test)))
print("Number of features used: {}".format(np.sum(lr.coef_ != 0)))

Training set score: 0.94
Test set score: 0.78
Number of features used: 104


We can see that the model is overfitting.  One way to fight overfitting is to use regularized model. Lasso is known to penalize the model by driving the coefficients down to 0, in a way simplify the model. This is like automatic feature selection.

Let's try using Lasso and find the best alpha by doing a grid search.

In [10]:

from sklearn.model_selection import GridSearchCV
from sklearn.linear_model import Lasso 

alphas = np.arange(0.001, 0.01, 0.001)
print('the alphas:', alphas)

param_grid = [{'alpha': alphas}]
lasso = Lasso(selection='random', max_iter=50000)
grid_cv = GridSearchCV(lasso,
             param_grid,
             cv=5, 
             scoring='neg_root_mean_squared_error',
             return_train_score=True)

grid_cv.fit(X_train, y_train)

best_estimator = grid_cv.best_estimator_
print("Training set score: {:.2f}".format(best_estimator.score(X_train, y_train)))
print("Test set score: {:.2f}".format(best_estimator.score(X_test, y_test)))
print("Number of features used: {}".format(np.sum(best_estimator.coef_ != 0)))


the alphas: [0.001 0.002 0.003 0.004 0.005 0.006 0.007 0.008 0.009]
Training set score: 0.94
Test set score: 0.80
Number of features used: 81


Notice that the number of coefficients have been reduced to 81, in other words lasso do a automatic feature selection.

Let's try using scikit learn's feature selection algorithm, e.g. Recursiv Feature Elimination. For comparison, let's also restrict the number of features to 81.

In [5]:
from sklearn.svm import SVR 
from sklearn.feature_selection import RFE

estimator = SVR(kernel="linear")
selector = RFE(estimator, n_features_to_select=81, step=1)
X_train_rfe = selector.fit_transform(X_train, y_train)
X_train_rfe = pd.DataFrame(X_train_rfe, columns=selector.get_feature_names_out())
X_test_rfe = selector.transform(X_test)
X_test_rfe = pd.DataFrame(X_test_rfe, columns=selector.get_feature_names_out())

Using the selected feature, we fit the features with a normal Linear Regression.

In [6]:
lr_rfe = LinearRegression()
lr_rfe.fit(X_train_rfe, y_train)

print("Training set score: {:.2f}".format(lr_rfe.score(X_train_rfe, y_train)))
print("Test set score: {:.2f}".format(lr_rfe.score(X_test_rfe, y_test)))

Training set score: 0.94
Test set score: 0.82


Let's now instead using statistical method to select coefficients based on it's fitted p-values.

In [7]:
len(y_train)

379

In [8]:
import statsmodels.api as sm

model = sm.OLS(y_train.values, X_train)

# Fit your model to your training set
result = model.fit()

# Print summary statistics of the model's performance
result.summary()

0,1,2,3
Dep. Variable:,y,R-squared:,0.945
Model:,OLS,Adj. R-squared:,0.924
Method:,Least Squares,F-statistic:,45.73
Date:,"Mon, 31 Jul 2023",Prob (F-statistic):,1.1799999999999999e-128
Time:,12:32:27,Log-Likelihood:,-838.65
No. Observations:,379,AIC:,1885.0
Df Residuals:,275,BIC:,2295.0
Df Model:,103,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
CRIM,4.9134,18.941,0.259,0.796,-32.375,42.201
ZN,3.5147,16.166,0.217,0.828,-28.310,35.339
INDUS,12.6087,11.648,1.083,0.280,-10.321,35.539
CHAS,-72.6762,24.388,-2.980,0.003,-120.687,-24.665
NOX,-1.2997,0.846,-1.536,0.126,-2.965,0.366
RM,3.7077,0.314,11.794,0.000,3.089,4.327
AGE,-1.7736,0.502,-3.536,0.000,-2.761,-0.786
DIS,-0.2696,0.995,-0.271,0.787,-2.229,1.690
RAD,8.6106,18.262,0.471,0.638,-27.341,44.562

0,1,2,3
Omnibus:,6.419,Durbin-Watson:,1.883
Prob(Omnibus):,0.04,Jarque-Bera (JB):,6.982
Skew:,0.214,Prob(JB):,0.0305
Kurtosis:,3.508,Cond. No.,5380.0


We will now only use those coefficients which has p-value <= 0.05.

In [9]:
columns = result.pvalues[result.pvalues <= 0.05].index

X_train_stat = X_train[columns]
X_test_stat = X_test[columns]

lr_stat = LinearRegression()
lr_stat.fit(X_train_stat, y_train)

print(lr_stat.score(X_train_stat, y_train))
print(lr_stat.score(X_test_stat, y_test))

0.876791428576468
0.7880828990656044
