## Lasso regularisation

**Regularisation** consists in **adding a penalty to the different parameters of the machine learning model** to reduce the freedom of the model and **avoid overfitting**. In **linear model regularization**, the penalty is applied to the **coefficients that multiply each of the predictors**. The **Lasso regularization or l1** has the property that is able to **shrink some of the coefficients to zero**. Therefore, **those features can be removed from the model.**

We see how to select features using the **Lasso regularisation on a regression and classification problem.**

In [15]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.linear_model import Lasso, LogisticRegression
from sklearn.feature_selection import SelectFromModel
from sklearn.preprocessing import StandardScaler

## Classification

In [16]:
data = pd.read_csv('dataset_2.csv')
data.shape

(50000, 109)

In [17]:
data.head()

Unnamed: 0,var_1,var_2,var_3,var_4,var_5,var_6,var_7,var_8,var_9,var_10,...,var_100,var_101,var_102,var_103,var_104,var_105,var_106,var_107,var_108,var_109
0,4.53271,3.280834,17.982476,4.404259,2.34991,0.603264,2.784655,0.323146,12.009691,0.139346,...,2.079066,6.748819,2.941445,18.360496,17.726613,7.774031,1.473441,1.973832,0.976806,2.541417
1,5.821374,12.098722,13.309151,4.125599,1.045386,1.832035,1.833494,0.70909,8.652883,0.102757,...,2.479789,7.79529,3.55789,17.383378,15.193423,8.263673,1.878108,0.567939,1.018818,1.416433
2,1.938776,7.952752,0.972671,3.459267,1.935782,0.621463,2.338139,0.344948,9.93785,11.691283,...,1.861487,6.130886,3.401064,15.850471,14.620599,6.849776,1.09821,1.959183,1.575493,1.857893
3,6.02069,9.900544,17.869637,4.366715,1.973693,2.026012,2.853025,0.674847,11.816859,0.011151,...,1.340944,7.240058,2.417235,15.194609,13.553772,7.229971,0.835158,2.234482,0.94617,2.700606
4,3.909506,10.576516,0.934191,3.419572,1.871438,3.340811,1.868282,0.439865,13.58562,1.153366,...,2.738095,6.565509,4.341414,15.893832,11.929787,6.954033,1.853364,0.511027,2.599562,0.811364


**Important**

Select the features by **examining only the training set** to **avoid overfit.**

In [18]:
X_train, X_test, y_train, y_test = train_test_split(
    data.drop(labels=['target'], axis=1),
    data['target'],
    test_size=0.3,
    random_state=0)
X_train.shape, X_test.shape

((35000, 108), (15000, 108))

**Linear models benefit from feature scaling!**

In [19]:
scaler = StandardScaler()
scaler.fit(X_train)

StandardScaler()

### Select features with Lasso

Do the **model fitting and feature selection altogether in one line of code!** First **specify the Logistic Regression model**, and for example the **Lasso (l1) penalty.** Then use the **selectFromModel** class from sklearn, which will select the **features which coefficients are non-zero!**

In [20]:
sel_ = SelectFromModel(
    LogisticRegression(C=0.5, penalty='l1', solver='liblinear', random_state=10))
sel_.fit(scaler.transform(X_train), y_train)

SelectFromModel(estimator=LogisticRegression(C=0.5, penalty='l1',
                                             random_state=10,
                                             solver='liblinear'))

**Visualise the index of the features that were selected!**

In [21]:
sel_.get_support()

array([ True,  True,  True,  True,  True,  True,  True, False,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
       False,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True, False,  True,  True,  True,
        True, False,  True,  True,  True,  True,  True, False,  True,
        True,  True,  True,  True, False,  True, False,  True, False,
        True,  True,  True,  True,  True,  True,  True,  True, False,
        True, False,  True,  True,  True,  True,  True,  True,  True,
        True,  True, False,  True, False,  True,  True,  True, False,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True, False,  True,  True,  True, False])

**The list with the selected features**

In [22]:
selected_feat = X_train.columns[(sel_.get_support())]
print('total features: {}'.format((X_train.shape[1])))
print('selected features: {}'.format(len(selected_feat)))
print('features with coefficients shrank to zero: {}'.format(
    np.sum(sel_.estimator_.coef_ == 0)))

total features: 108
selected features: 93
features with coefficients shrank to zero: 15


### Examine coefficients that shrank to zero

The number of features which **coefficient was shrank to zero:**

In [23]:
np.sum(sel_.estimator_.coef_ == 0)

15

**Identify the removed features like this:**

In [24]:
removed_feats = X_train.columns[(sel_.estimator_.coef_ == 0).ravel().tolist()]
removed_feats

Index(['var_8', 'var_19', 'var_42', 'var_47', 'var_53', 'var_59', 'var_62',
       'var_64', 'var_73', 'var_75', 'var_85', 'var_87', 'var_91', 'var_105',
       'var_109'],
      dtype='object')

**Remove the features from the training and testing set like this:**

In [25]:
X_train_selected = sel_.transform(X_train)
X_test_selected = sel_.transform(X_test)
X_train_selected.shape, X_test_selected.shape

((35000, 93), (15000, 93))

Remember that sklearn **SelectFromModel** returns a **NumPy array**, so if you need a dataframe, you need to capture the feature names first and then **convert the array to a dataframe**.

### Ridge regularisation does not shrink coefficients to zero

Inspect if the **Ridge Regularization or L2 shrinks coefficients to zero.**

For comparison, fit a **logistic regression with a Ridge regularisation**, and **evaluate the coefficients**! Count the number of coefficients with zero values! 

In [26]:
l1_logit = LogisticRegression(C=0.5, penalty='l2', max_iter=300, random_state=10)
l1_logit.fit(scaler.transform(X_train), y_train)
np.sum(l1_logit.coef_ == 0)  # Zero, as expected

0

Go ahead and play around with the penalty (C) to see if the result changes.

## Regression

In [27]:
data = pd.read_csv('HousingPrices_train.csv')
data.shape

(1460, 81)

In practice, feature selection should be done **after data pre-processing**, so ideally, all the **categorical variables are encoded into numbers**, and then you can assess how deterministic they are of the target! For simplicity we use **only numerical variables**! Select numerical columns:**

In [28]:
numerics = ['int16', 'int32', 'int64', 'float16', 'float32', 'float64']
numerical_vars = list(data.select_dtypes(include=numerics).columns)
data = data[numerical_vars]
data.shape

(1460, 38)

In [29]:
X_train, X_test, y_train, y_test = train_test_split(
    data.drop(labels=['SalePrice'], axis=1),
    data['SalePrice'],
    test_size=0.3,
    random_state=0)
X_train.shape, X_test.shape

((1022, 37), (438, 37))

In [30]:
X_train.fillna(0, inplace=True)
X_test.fillna(0, inplace=True)

**The features in the house dataset are in very different scales, so it helps the regression to scale them!**

In [31]:
scaler = StandardScaler()
scaler.fit(X_train)

StandardScaler()

### Select Coefficients with Lasso

Train a **Lasso Linear regression** and select the **non zero features in one line**!. **Linear regression object from sklearn does not allow for regularisation**. So **import specifically "Lasso"!** **Alpha is the penalisation**, so **set it high** to force the algorithm to **shrink some coefficients!**

In [32]:
sel_ = SelectFromModel(Lasso(alpha=100, random_state=10))
sel_.fit(scaler.transform(X_train), y_train)

SelectFromModel(estimator=Lasso(alpha=100, random_state=10))

In [33]:
sel_.get_support()

array([False,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True, False,  True, False,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True, False,  True,
        True])

**Make a list with the selected features and print the outputs!**

In [34]:
selected_feat = X_train.columns[(sel_.get_support())]
print('total features: {}'.format((X_train.shape[1])))
print('selected features: {}'.format(len(selected_feat)))
print('features with coefficients shrank to zero: {}'.format(
    np.sum(sel_.estimator_.coef_ == 0)))

total features: 37
selected features: 33
features with coefficients shrank to zero: 4


As we can see, **both for linear and logistic regression** we used the **Lasso regularisation** to remove non-important features from the dataset. Keep in mind that **increasing the penalisation will increase the number of features removed.** Therefore, **don't set a penalty too high** so it removes a lot of features, **or too low**, and thus useless features are retained.