# SPH6004 Hands-On 2
## Logistic Regression with Regularization, Feature Select

Contents:
  * Logistic regression with regularization
    * Adam optimizer
    * AUROC and AP metrics
    * Comparison of $L1$ and $L2$ regularization effects
  * Feature Select methods
    * Forward and backward feature selection using `sklearn`
    * Visualize model performance using ROC curve and precision-recall curve


In [154]:
# We will be using the following packages

# PyTorch package and submodules
import torch
import torch.nn as nn
from torch.optim import SGD #gradient descent optimizer

# NumPy for math operations, and Pandas for processing tabular data.
import numpy as np
import pandas as pd

# Plotly plotting package
import plotly.graph_objects as go
import plotly.express as px

# We use toy datasets in scikit-learn package
from sklearn.datasets import load_breast_cancer

# We use AUROC and average precision (AP) scores from sklearn
from sklearn.metrics import roc_auc_score, average_precision_score

We use the breast cancer dataset from `sklearn` for demonstration. Recall that the independent variables have $30$ dimension, and many of the dimensions are correlated.

In [155]:
X_raw, y_df = load_breast_cancer(return_X_y=True, as_frame=True)
X_df = (X_raw-X_raw.mean())/X_raw.std()
X_df.describe().transpose()

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
mean radius,569.0,-1.311195e-16,1.0,-2.027864,-0.688779,-0.214893,0.46898,3.967796
mean texture,569.0,6.243785e-17,1.0,-2.227289,-0.725325,-0.104544,0.583662,4.647799
mean perimeter,569.0,-1.248757e-16,1.0,-1.982759,-0.691347,-0.235773,0.499238,3.972634
mean area,569.0,-2.185325e-16,1.0,-1.453164,-0.666609,-0.294927,0.363188,5.245913
mean smoothness,569.0,-8.366672e-16,1.0,-3.109349,-0.710338,-0.03486,0.63564,4.766717
mean compactness,569.0,1.998011e-16,1.0,-1.608721,-0.746429,-0.221745,0.493423,4.564409
mean concavity,569.0,3.746271e-17,1.0,-1.113893,-0.743094,-0.341939,0.525599,4.239858
mean concave points,569.0,-4.995028e-17,1.0,-1.26071,-0.737295,-0.397372,0.646366,3.924477
mean symmetry,569.0,1.74826e-16,1.0,-2.741705,-0.702621,-0.071564,0.530313,4.480808
mean fractal dimension,569.0,4.838933e-16,1.0,-1.818265,-0.722004,-0.178123,0.470569,4.906602


In [156]:
# We convert dataframe to PyTorch tensor datatype,
# and then split it into training and testing parts.
X = torch.tensor(X_df.to_numpy(),dtype=torch.float32)
m,n = X.shape
y = torch.tensor(y_df.to_numpy(),dtype=torch.float32).reshape(m,1)

# We use an approx 6:4 train test splitting
cases = ['train','test']
case_list = np.random.choice(cases,size=X.shape[0],replace=True,p=[0.6,0.4])
X_train = X[case_list=='train']
X_test = X[case_list=='test']
y_train = y[case_list=='train']
y_test = y[case_list=='test']

We first fit an ordinary logistic regression model as a baseline. We will use AUROC and average precision (AP) scores to test model performance.

In [157]:
h = torch.nn.Linear(
    in_features=n,
    out_features=1,
    bias=True
)
sigma = torch.nn.Sigmoid()

# Logistic model is linear+sigmoid
f = torch.nn.Sequential(
    h,
    sigma
)

J_BCE = torch.nn.BCELoss()
# We use the Adam optimizer, which is
# a variant of gradient descent method with momentum.
GD_optimizer = torch.optim.Adam(lr=0.01,params=f.parameters())

nIter = 5000
printInterval = 500

for i in range(nIter):
    GD_optimizer.zero_grad()
    pred = f(X_train)
    loss = J_BCE(pred,y_train)
    loss.backward()
    GD_optimizer.step()
    if i == 0 or ((i+1)%printInterval) == 0:
        print('Iter {}: average BCE loss is {:.3f}'.format(i+1,loss.item()))

with torch.no_grad():
    pred_test = f(X_test)

auroc = roc_auc_score(y_test,pred_test)
ap = average_precision_score(y_test,pred_test)
print('On test dataset: AUROC {:.3f}, AP {:.3f}'.format(auroc,ap))

Iter 1: average BCE loss is 0.922
Iter 500: average BCE loss is 0.054
Iter 1000: average BCE loss is 0.042
Iter 1500: average BCE loss is 0.034
Iter 2000: average BCE loss is 0.028
Iter 2500: average BCE loss is 0.023
Iter 3000: average BCE loss is 0.020
Iter 3500: average BCE loss is 0.017
Iter 4000: average BCE loss is 0.014
Iter 4500: average BCE loss is 0.012
Iter 5000: average BCE loss is 0.010
On test dataset: AUROC 0.981, AP 0.979


In [158]:
# Let use save the model parameters for later comparisons.
# Recall that PyTorch do automatic differentiation, so
# model parameters (i.e. h.weight) has gradients information. This
# may cause problem for later analysis when gradient is not needed.
# Use .detach() to detach gradient information from the weights.

weight = h.weight.detach().squeeze().clone()

### Logistic Regression with $L_2$ regularization

Logistic regression with $L_2$ regularization is also called *ridge logistic regression*. Recall that we add $L_2$ regularization to model parameters:

$$
J_{L_2}(\theta) =  J(\theta) + \lambda/2 ||\theta||_2^2,
$$

where large parameter values are penalized by the $L_2$ norm term

$$
||\theta||_2^2 = \sum_j \theta_j^2.
$$

In PyTorch the $L_2$ regularization of model parameters are supported in its build-in optimizers.

In [159]:
h_L2 = torch.nn.Linear(
    in_features=n,
    out_features=1,
    bias=True
)
sigma = torch.nn.Sigmoid()

# Logistic model is linear+sigmoid
f_L2 = torch.nn.Sequential(
    h_L2,
    sigma
)

J_BCE = torch.nn.BCELoss()

# PyTorch optimizer support L2 regularization by
# setting the weight_decay parameter, which corresponds to
# the regularization strength.
GD_optimizer = torch.optim.Adam(lr=0.01,params=f_L2.parameters(),weight_decay=0.05)

nIter = 500
printInterval = 50

for i in range(nIter):
    GD_optimizer.zero_grad()
    pred = f_L2(X_train)
    loss = J_BCE(pred,y_train)
    loss.backward()
    GD_optimizer.step()
    if i == 0 or ((i+1)%printInterval) == 0:
        print('Iter {}: average BCE loss is {:.3f}'.format(i+1,loss.item()))

with torch.no_grad():
    pred_test = f_L2(X_test)

auroc = roc_auc_score(y_test,pred_test)
ap = average_precision_score(y_test,pred_test)
print('On test dataset: AUROC {:.3f}, AP {:.3f}'.format(auroc,ap))

Iter 1: average BCE loss is 0.715
Iter 50: average BCE loss is 0.130
Iter 100: average BCE loss is 0.114
Iter 150: average BCE loss is 0.110
Iter 200: average BCE loss is 0.109
Iter 250: average BCE loss is 0.108
Iter 300: average BCE loss is 0.108
Iter 350: average BCE loss is 0.108
Iter 400: average BCE loss is 0.108
Iter 450: average BCE loss is 0.108
Iter 500: average BCE loss is 0.108
On test dataset: AUROC 0.996, AP 0.997


In [160]:
weight_L2 = h_L2.weight.detach().squeeze().clone()

### Logistic Regression with $L_1$ Regularization

Logistic regression with $L_1$ regularization is also called *LASSO logistic regression*. For the LASSO logistic regression, we add $L_1$ regularization to model parameters:

$$
J_{L_1}(\theta) = J(\theta) + \lambda ||\theta||_1,
$$

where large model parameters are $L_1$ norm

$$
||\theta||_1 = \sum_j |\theta_j|.
$$

There is no build-in support for $L_1$ regularization in PyTorch, so we will need to build our own code. The idea is to iterate over all parameters of our model, calculate the sum of their absolute value, and the let PyTorch do the differentiation and optimization.

In [161]:
h_L1 = torch.nn.Linear(
    in_features=n,
    out_features=1,
    bias=True
)
sigma = torch.nn.Sigmoid()

# Logistic model is linear+sigmoid
f_L1 = torch.nn.Sequential(
    h_L1,
    sigma
)

J_BCE = torch.nn.BCELoss()

GD_optimizer = torch.optim.Adam(lr=0.01,params=f_L1.parameters(),weight_decay=0.05)

# Define L_1 regularization
def L1_reg(model,lbd):
    result = torch.tensor(0)
    for param in model.parameters(): # iterate over all parameters of our model
        result = result + param.abs().sum()

    return lbd*result


nIter = 500
printInterval = 50
lbd = 0.05 # L1 reg strength

for i in range(nIter):
    GD_optimizer.zero_grad()
    pred = f_L1(X_train)
    loss = J_BCE(pred,y_train)
    (loss+L1_reg(f_L1,lbd)).backward()
    GD_optimizer.step()
    if i == 0 or ((i+1)%printInterval) == 0:
        print('Iter {}: average BCE loss is {:.3f}'.format(i+1,loss.item()))

with torch.no_grad():
    pred_test = f_L1(X_test)

auroc = roc_auc_score(y_test,pred_test)
ap = average_precision_score(y_test,pred_test)
print('On test dataset: AUROC {:.3f}, AP {:.3f}'.format(auroc,ap))

Iter 1: average BCE loss is 0.587
Iter 50: average BCE loss is 0.214
Iter 100: average BCE loss is 0.212
Iter 150: average BCE loss is 0.210
Iter 200: average BCE loss is 0.210
Iter 250: average BCE loss is 0.209
Iter 300: average BCE loss is 0.209
Iter 350: average BCE loss is 0.210
Iter 400: average BCE loss is 0.209
Iter 450: average BCE loss is 0.209
Iter 500: average BCE loss is 0.210
On test dataset: AUROC 0.991, AP 0.992


In [162]:
weight_L1 = h_L1.weight.detach().squeeze().clone()

Let us now visualize and compare the weights of the three models.

In [163]:
weight_df = pd.DataFrame(
    {
        'vanilla':weight,
        'L2':weight_L2,
        'L1':weight_L1
    }
).melt(id_vars=[],value_vars=['vanilla','L2','L1'])
weight_df

Unnamed: 0,variable,value
0,vanilla,-0.928586
1,vanilla,-0.225938
2,vanilla,-0.931878
3,vanilla,-1.983683
4,vanilla,-2.196660
...,...,...
85,L1,-0.006153
86,L1,-0.079893
87,L1,-0.381906
88,L1,-0.071407


In [164]:
fig = px.box(
    weight_df,
    y='value',
    facet_col='variable',
    color='variable',
    points='all',
    title='Logistic Regression Weights Distributions'
)
fig.update_yaxes(
    matches=None,
    showticklabels=True
)
fig.update_traces(jitter=0.5)

We observe the following:
1.  Using regularization make training much faster.
    * The vanilla logistic regression takes more than 5000 iterations to converge.
    * Both regularization methods only take 500 iterations to converge.
    * The vanilla logistic regression is not numerically stable
        * If you have more iterations, you will see all model parameters become larger and larger.
        * Both regularization methods do not suffer from this problem.
1.  The $L_2$ regularization reduced the range of the logistic regression weights.
2.  The $L_1$ regularization not only reduced the range of the weights, but also annihilated many coefficients to zero.

The last observation not only apply to logistic regression models with $L_1$ regularization. A common feature selection method is to use the $L_1$ regularization and only select features that have non-zero coefficients. If you increase the regularization strength by enlarging the `lbd` variable in the above codes, you will reduce more parameters to zero.

### Forward and backward selection methods for features selection

Lastly we demonstrate how to perform forward and backward feature selections. Implementing these selection techniques from scratch require a lot of work (encapsulate model definition and training, write custom cross-validation dataloader, etc.) Therefore in this section we borrow everything from `sklearn`, and this simplifies feature selection and model fitting into just a few lines of codes. The disadvantage is that the implementation detail is like a black box: you need to dig deep into source codes in `sklearn` to find it out!

In [165]:
from sklearn.linear_model import LogisticRegression as logit # use build-in logistic regression model in sklearn
from sklearn.feature_selection import SequentialFeatureSelector as SFS
from sklearn.metrics import roc_curve, precision_recall_curve

In [166]:
# sklearn support fitting model on pandas dataframe.
# We use same train test split

X_df_train = X_df.iloc[case_list=='train',:]
X_df_test = X_df.iloc[case_list=='test',:]
y_df_train = y_df.iloc[case_list=='train']
y_df_test = y_df.iloc[case_list=='test']

In [188]:
model = logit(penalty='l1',C=1/10,solver='liblinear') # c: 1/(strength of L1 regularization)

# Forward feature selection.
forward_selection = SFS(
    model, n_features_to_select=5, direction="forward"
).fit(X_df_train, y_df_train)

# Backward feature selection.
backward_selection = SFS(
    model, n_features_to_select=5, direction="backward"
).fit(X_df_train, y_df_train)

In [189]:
forward_selection.get_feature_names_out()

array(['mean radius', 'mean texture', 'mean concavity', 'worst radius',
       'worst smoothness'], dtype=object)

In [190]:
backward_selection.get_feature_names_out()

array(['mean concave points', 'worst texture', 'worst perimeter',
       'worst smoothness', 'worst concavity'], dtype=object)

We see that the forward and backward selections do not agree on the top-5 most important features.

After feature selection, we can use `forward_selection` and `backward_selection` to automatically reduce input dimension. Let use fit logistic regression models on the 1) full feature, 2) selected feature and visualize their model performance on test dataset.

In [191]:
# Full model
model.fit(X_df_train,y_df_train)
y_pred_full = model.predict_proba(X_df_test)

# Model with forward selected features
model.fit(forward_selection.transform(X_df_train),y_df_train)
y_pred_FS = model.predict_proba(forward_selection.transform(X_df_test))

# Model with backward selected features
model.fit(backward_selection.transform(X_df_train),y_df_train)
y_pred_BS = model.predict_proba(backward_selection.transform(X_df_test))

In [192]:
# roc_curve
fpr_full, tpr_full, _ = roc_curve(y_df_test,y_pred_full[:,1])
fpr_FS, tpr_FS, _ = roc_curve(y_df_test,y_pred_FS[:,1])
fpr_BS, tpr_BS, _ = roc_curve(y_df_test,y_pred_BS[:,1])

roc_df = pd.DataFrame(
    {
        'False Positive Rate':np.hstack([fpr_full,fpr_FS,fpr_BS]),
        'True Positive Rate':np.hstack([tpr_full,tpr_FS,tpr_BS]),
        'method':['full_model']*len(fpr_full)+['FS']*len(fpr_FS)+['BS']*len(fpr_BS)
    }
)

In [193]:
# Visualize ROC curve
fig = px.line(roc_df,y='True Positive Rate',x='False Positive Rate',facet_col='method',color='method')
fig

In [194]:
# roc_curve
p_full, r_full, _ = precision_recall_curve(y_df_test,y_pred_full[:,1])
p_FS, r_FS, _ = precision_recall_curve(y_df_test,y_pred_FS[:,1])
p_BS, r_BS, _ = precision_recall_curve(y_df_test,y_pred_BS[:,1])

pr_df = pd.DataFrame(
    {
        'Precision':np.hstack([p_full,p_FS,p_BS]),
        'Recall':np.hstack([r_full,r_FS,r_BS]),
        'method':['Full Model']*len(p_full)+['Forward Selection']*len(p_FS)+['Backward Selection']*len(p_BS)
    }
)

In [195]:
# Visualize precision recall curve
fig = px.line(pr_df,x='Recall',y='Precision',facet_col='method',color='method')
fig

From both the ROC and precision-recall curves we see that the full model has similar performance with the models with only 5 features.