# SPH6004 Hands-On 2
## Logistic Regression with Regularization, Feature Selection, and Imbalanced Dataset

Contents:
  * Logistic regression with regularization
    * Adam optimizer
    * AUROC and average precision (AP) metrics
    * Comparison of $L_1$ and $L_2$ regularization effects
  * Feature selection methods with `sklearn`
    * Forward and backward feature selection using `sklearn`
    * Visualize model performance using ROC curve and precision-recall curve
  * Data with class imbalance and the **Synthetic Minority Oversampling Technique** (SMOTE)


In [1]:
# We will use the following packages

# PyTorch package and submodules
import torch
import torch.nn as nn
from torch.optim import SGD #gradient descent optimizer

# NumPy for math operations, and Pandas for processing tabular data.
import numpy as np
import pandas as pd

# Plotly plotting package
import plotly.graph_objects as go
import plotly.express as px

# We use toy datasets in scikit-learn package
from sklearn.datasets import load_breast_cancer

# We use AUROC and average precision (AP) scores from sklearn
from sklearn.metrics import roc_auc_score, average_precision_score

We use the breast cancer dataset from `sklearn` for demonstration. Recall that the independent variables have $30$ dimension, and many of the dimensions are correlated.

In [2]:
X_raw, y_df = load_breast_cancer(return_X_y=True, as_frame=True)
X_df = (X_raw-X_raw.mean())/X_raw.std()
X_df.describe().transpose()

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
mean radius,569.0,-1.311195e-16,1.0,-2.027864,-0.688779,-0.214893,0.46898,3.967796
mean texture,569.0,6.243785e-17,1.0,-2.227289,-0.725325,-0.104544,0.583662,4.647799
mean perimeter,569.0,-1.248757e-16,1.0,-1.982759,-0.691347,-0.235773,0.499238,3.972634
mean area,569.0,-2.185325e-16,1.0,-1.453164,-0.666609,-0.294927,0.363188,5.245913
mean smoothness,569.0,-8.366672e-16,1.0,-3.109349,-0.710338,-0.03486,0.63564,4.766717
mean compactness,569.0,1.998011e-16,1.0,-1.608721,-0.746429,-0.221745,0.493423,4.564409
mean concavity,569.0,3.746271e-17,1.0,-1.113893,-0.743094,-0.341939,0.525599,4.239858
mean concave points,569.0,-4.995028e-17,1.0,-1.26071,-0.737295,-0.397372,0.646366,3.924477
mean symmetry,569.0,1.74826e-16,1.0,-2.741705,-0.702621,-0.071564,0.530313,4.480808
mean fractal dimension,569.0,4.838933e-16,1.0,-1.818265,-0.722004,-0.178123,0.470569,4.906602


In [3]:
# We convert dataframe to PyTorch tensor datatype,
# and then split it into training and testing parts.
X = torch.tensor(X_df.to_numpy(),dtype=torch.float32)
m,n = X.shape
y = torch.tensor(y_df.to_numpy(),dtype=torch.float32).reshape(m,1)

# We use an approx 6:4 train test splitting
cases = ['train','test']
case_list = np.random.choice(cases,size=X.shape[0],replace=True,p=[0.6,0.4])
X_train = X[case_list=='train']
X_test = X[case_list=='test']
y_train = y[case_list=='train']
y_test = y[case_list=='test']

We first fit an ordinary logistic regression model as a baseline. We will use AUROC and average precision (AP) scores to test model performance.

In [4]:
h = torch.nn.Linear(
    in_features=n,
    out_features=1,
    bias=True
)
sigma = torch.nn.Sigmoid()

# Logistic model is linear+sigmoid
f = torch.nn.Sequential(
    h,
    sigma
)

J_BCE = torch.nn.BCELoss()
# We use the Adam optimizer, which is
# a variant of gradient descent method with momentum.
GD_optimizer = torch.optim.Adam(lr=0.01,params=f.parameters())

nIter = 5000
printInterval = 500

for i in range(nIter):
    GD_optimizer.zero_grad()
    pred = f(X_train)
    loss = J_BCE(pred,y_train)
    loss.backward()
    GD_optimizer.step()
    if i == 0 or ((i+1)%printInterval) == 0:
        print('Iter {}: average BCE loss is {:.3f}'.format(i+1,loss.item()))

with torch.no_grad():
    pred_test = f(X_test)

auroc = roc_auc_score(y_test,pred_test)
ap = average_precision_score(y_test,pred_test)
print('On test dataset: AUROC {:.3f}, AP {:.3f}'.format(auroc,ap))

Iter 1: average BCE loss is 0.757
Iter 500: average BCE loss is 0.046
Iter 1000: average BCE loss is 0.034
Iter 1500: average BCE loss is 0.027
Iter 2000: average BCE loss is 0.022
Iter 2500: average BCE loss is 0.018
Iter 3000: average BCE loss is 0.015
Iter 3500: average BCE loss is 0.012
Iter 4000: average BCE loss is 0.009
Iter 4500: average BCE loss is 0.007
Iter 5000: average BCE loss is 0.006
On test dataset: AUROC 0.971, AP 0.966


In [5]:
# Let use save the model parameters for later comparisons.
# Recall that PyTorch do automatic differentiation, so
# model parameters (i.e. h.weight) has gradients information. This
# may cause problem for later analysis when gradient is not needed.
# Use .detach() to detach gradient information from the weights.

weight = h.weight.detach().squeeze().clone()

### Logistic Regression with $L_2$ regularization

Logistic regression with $L_2$ regularization is also called *ridge logistic regression*. Recall that we add $L_2$ regularization to model parameters:

$$
J_{L_2}(\theta) =  J(\theta) + \lambda/2 ||\theta||_2^2,
$$

where large parameter values are penalized by the $L_2$ norm term

$$
||\theta||_2^2 = \sum_j \theta_j^2.
$$

In PyTorch the $L_2$ regularization of model parameters are supported in its build-in optimizers.

In [6]:
h_L2 = torch.nn.Linear(
    in_features=n,
    out_features=1,
    bias=True
)
sigma = torch.nn.Sigmoid()

# Logistic model is linear+sigmoid
f_L2 = torch.nn.Sequential(
    h_L2,
    sigma
)

J_BCE = torch.nn.BCELoss()

# PyTorch optimizer support L2 regularization by
# setting the weight_decay parameter, which corresponds to
# the regularization strength.
GD_optimizer = torch.optim.Adam(lr=0.01,params=f_L2.parameters(),weight_decay=0.05)

nIter = 500
printInterval = 50

for i in range(nIter):
    GD_optimizer.zero_grad()
    pred = f_L2(X_train)
    loss = J_BCE(pred,y_train)
    loss.backward()
    GD_optimizer.step()
    if i == 0 or ((i+1)%printInterval) == 0:
        print('Iter {}: average BCE loss is {:.3f}'.format(i+1,loss.item()))

with torch.no_grad():
    pred_test = f_L2(X_test)

auroc = roc_auc_score(y_test,pred_test)
ap = average_precision_score(y_test,pred_test)
print('On test dataset: AUROC {:.3f}, AP {:.3f}'.format(auroc,ap))

Iter 1: average BCE loss is 0.882
Iter 50: average BCE loss is 0.135
Iter 100: average BCE loss is 0.114
Iter 150: average BCE loss is 0.108
Iter 200: average BCE loss is 0.105
Iter 250: average BCE loss is 0.104
Iter 300: average BCE loss is 0.104
Iter 350: average BCE loss is 0.103
Iter 400: average BCE loss is 0.103
Iter 450: average BCE loss is 0.103
Iter 500: average BCE loss is 0.103
On test dataset: AUROC 0.994, AP 0.995


In [7]:
weight_L2 = h_L2.weight.detach().squeeze().clone()

### Logistic Regression with $L_1$ Regularization

Logistic regression with $L_1$ regularization is also called *LASSO logistic regression*. For the LASSO logistic regression, we add $L_1$ regularization to model parameters:

$$
J_{L_1}(\theta) = J(\theta) + \lambda ||\theta||_1,
$$

where large model parameters are $L_1$ norm

$$
||\theta||_1 = \sum_j |\theta_j|.
$$

There is no build-in support for $L_1$ regularization in PyTorch, so we will need to build our own code. The idea is to iterate over all parameters of our model, calculate the sum of their absolute value, and the let PyTorch do the differentiation and optimization.

In [8]:
h_L1 = torch.nn.Linear(
    in_features=n,
    out_features=1,
    bias=True
)
sigma = torch.nn.Sigmoid()

# Logistic model is linear+sigmoid
f_L1 = torch.nn.Sequential(
    h_L1,
    sigma
)

J_BCE = torch.nn.BCELoss()

GD_optimizer = torch.optim.Adam(lr=0.01,params=f_L1.parameters())

# Define L_1 regularization
def L1_reg(model,lbd):
    result = torch.tensor(0)
    for param in model.parameters(): # iterate over all parameters of our model
        result = result + param.abs().sum()

    return lbd*result


nIter = 500
printInterval = 50
lbd = 0.03 # L1 reg strength

for i in range(nIter):
    GD_optimizer.zero_grad()
    pred = f_L1(X_train)
    loss = J_BCE(pred,y_train)
    (loss+L1_reg(f_L1,lbd)).backward()
    GD_optimizer.step()
    if i == 0 or ((i+1)%printInterval) == 0:
        print('Iter {}: average BCE loss is {:.3f}'.format(i+1,loss.item()))

with torch.no_grad():
    pred_test = f_L1(X_test)

auroc = roc_auc_score(y_test,pred_test)
ap = average_precision_score(y_test,pred_test)
print('On test dataset: AUROC {:.3f}, AP {:.3f}'.format(auroc,ap))

Iter 1: average BCE loss is 0.736
Iter 50: average BCE loss is 0.158
Iter 100: average BCE loss is 0.149
Iter 150: average BCE loss is 0.144
Iter 200: average BCE loss is 0.143
Iter 250: average BCE loss is 0.142
Iter 300: average BCE loss is 0.140
Iter 350: average BCE loss is 0.139
Iter 400: average BCE loss is 0.139
Iter 450: average BCE loss is 0.138
Iter 500: average BCE loss is 0.137
On test dataset: AUROC 0.994, AP 0.995


In [9]:
weight_L1 = h_L1.weight.detach().squeeze().clone()

Let us now visualize and compare the weights of the three models.

In [10]:
weight_df = pd.DataFrame(
    {
        'vanilla':weight,
        'L2':weight_L2,
        'L1':weight_L1
    }
).melt(id_vars=[],value_vars=['vanilla','L2','L1'])
weight_df

Unnamed: 0,variable,value
0,vanilla,0.414906
1,vanilla,0.591732
2,vanilla,-0.200690
3,vanilla,0.029234
4,vanilla,-2.873187
...,...,...
85,L1,-0.002185
86,L1,-0.100998
87,L1,-0.475548
88,L1,-0.196790


In [11]:
fig = px.box(
    weight_df,
    y='value',
    facet_col='variable',
    color='variable',
    points='all',
    title='Logistic Regression Weights Distributions'
)
fig.update_yaxes(
    matches=None,
    showticklabels=True
)
fig.update_traces(jitter=0.5)

We observe the following:
1.  Using regularization makes training much faster.
    * The vanilla logistic regression takes more than 5000 iterations to converge.
    * Both regularization methods only take 500 iterations to converge.
    * The vanilla logistic regression is not numerically stable
        * If you have more iterations, you will see all model parameters become larger and larger.
        * Both regularization methods do not suffer from this problem.
1.  The $L_2$ regularization reduced the range of the logistic regression weights.
2.  The $L_1$ regularization not only reduced the range of the weights, but also annihilated many coefficients to zero.

The last observation not only apply to logistic regression models with $L_1$ regularization. A common feature selection method is to use the $L_1$ regularization and only select features that have non-zero coefficients. If you increase the regularization strength by enlarging the `lbd` variable in the above codes, you will reduce more parameters to zero.

### Forward and backward selection methods for features selection

Here we demonstrate how to perform forward and backward feature selections.

Implementing these selection techniques from scratch require a lot of work (encapsulate model definition and training, write custom cross-validation dataloader, etc.)

Therefore we will borrow tools from `sklearn`, and this simplifies feature selection and model fitting into just a few lines of codes. The disadvantage is that the implementation detail is like a black box: you need to dig deep into source codes in `sklearn` to find it out!

In [12]:
from sklearn.linear_model import LogisticRegression as logit # use build-in logistic regression model in sklearn
from sklearn.feature_selection import SequentialFeatureSelector as SFS
from sklearn.metrics import roc_curve, precision_recall_curve

In [13]:
# sklearn support fitting model on pandas dataframe.
# We use same train test split as before

X_df_train = X_df.iloc[case_list=='train',:]
X_df_test = X_df.iloc[case_list=='test',:]
y_df_train = y_df.iloc[case_list=='train']
y_df_test = y_df.iloc[case_list=='test']

In [14]:
model = logit(penalty='l1',C=1/10,solver='liblinear') # c: 1/(strength of L1 regularization)

# Forward feature selection.
forward_selection = SFS(
    model, n_features_to_select=3, direction="forward"
).fit(X_df_train, y_df_train)

# Backward feature selection.
backward_selection = SFS(
    model, n_features_to_select=3, direction="backward"
).fit(X_df_train, y_df_train)

In [15]:
forward_selection.get_feature_names_out()

array(['mean concave points', 'radius error', 'worst concave points'],
      dtype=object)

In [16]:
backward_selection.get_feature_names_out()

array(['mean concavity', 'worst radius', 'worst texture'], dtype=object)

We see that the forward and backward selections do not agree on the top-5 most important features.

After feature selection, we can use `forward_selection` and `backward_selection` to automatically reduce input dimension. Let use fit logistic regression models on the 1) full feature, 2) selected feature and visualize their model performance on test dataset.

In [17]:
# Full model
model.fit(X_df_train,y_df_train)
y_pred_full = model.predict_proba(X_df_test)

# Model with forward selected features
model.fit(forward_selection.transform(X_df_train),y_df_train)
y_pred_FS = model.predict_proba(forward_selection.transform(X_df_test))

# Model with backward selected features
model.fit(backward_selection.transform(X_df_train),y_df_train)
y_pred_BS = model.predict_proba(backward_selection.transform(X_df_test))

In [18]:
# roc_curve
fpr_full, tpr_full, _ = roc_curve(y_df_test,y_pred_full[:,1])
fpr_FS, tpr_FS, _ = roc_curve(y_df_test,y_pred_FS[:,1])
fpr_BS, tpr_BS, _ = roc_curve(y_df_test,y_pred_BS[:,1])

roc_df = pd.DataFrame(
    {
        'False Positive Rate':np.hstack([fpr_full,fpr_FS,fpr_BS]),
        'True Positive Rate':np.hstack([tpr_full,tpr_FS,tpr_BS]),
        'method':['full_model']*len(fpr_full)+['FS']*len(fpr_FS)+['BS']*len(fpr_BS)
    }
)

In [19]:
# Visualize ROC curve
fig = px.line(roc_df,y='True Positive Rate',x='False Positive Rate',facet_col='method',color='method')
fig

In [20]:
# precision recall curves
p_full, r_full, _ = precision_recall_curve(y_df_test,y_pred_full[:,1])
p_FS, r_FS, _ = precision_recall_curve(y_df_test,y_pred_FS[:,1])
p_BS, r_BS, _ = precision_recall_curve(y_df_test,y_pred_BS[:,1])

pr_df = pd.DataFrame(
    {
        'Precision':np.hstack([p_full,p_FS,p_BS]),
        'Recall':np.hstack([r_full,r_FS,r_BS]),
        'method':['Full Model']*len(p_full)+['Forward Selection']*len(p_FS)+['Backward Selection']*len(p_BS)
    }
)

In [21]:
# Visualize precision recall curve
fig = px.line(pr_df,x='Recall',y='Precision',facet_col='method',color='method')
fig

From both the ROC and precision-recall curves we see that the full model has similar performance with the models with only 3 features.

### Data with imbalanced classes

One common problem for modeling is when the classes in the data are highly imbalance, e.g. there are much more negative cases than positive cases. In this section we learn two techniques to settle imbalanced data.

#### Generate imbalanced data

In [22]:
# The original breast cancer data statistics.
y_df.value_counts()

1    357
0    212
Name: target, dtype: int64

For the breast cancer dataset there are more positive cases, but the ratio
of number of two classes is not too bad (<2).

We synthesize a highly imbalanced dataset, with much more negative cases.
1.  We sample with replacement to generate new data with $10000$ records, where the probability to select a positive case ($y=1$) is 0.05, and the probability to select a negative case ($y=0$) is 0.95.
2.  Note that data sampled in this say has many duplicate records. We add some Gaussian noise to the independent variable so that the same record in the sampled dataset has slightly different $x$ values.
3.  We do a $6:4$ train-test split on the synthesized data.

In [23]:
# 0.1 probability to sample a negative case.
p_array = np.where(y_df.to_numpy()==1,0.05,0.95)

# Normalize original data first.
X_df = (X_raw-X_raw.mean())/X_raw.std()

# Combine input and target.
Xy_df = pd.concat([X_df,y_df],axis=1)

# Sample 10000 records with prescribed probability.
Xy_df_resampled = Xy_df.sample(n=10000,replace=True,weights=p_array,random_state=12,axis=0)

# Split input and target from sampled data.
X_df_resampled, y_df_resampled = Xy_df_resampled.iloc[:,:-1], Xy_df_resampled.iloc[:,-1]

# Add noise to input.
X_np_resampled = X_df_resampled.to_numpy()+np.random.normal(scale=3,size=X_df_resampled.to_numpy().shape)
X_df_resampled = pd.DataFrame(X_np_resampled,columns=X_df_resampled.columns)

In [24]:
# The resulting synthetic dataset is highly imbalanced
y_df_resampled.value_counts()

0    9176
1     824
Name: target, dtype: int64

In [25]:
# We use an approx 6:4 train test splitting
np.random.seed(12)
cases = ['train','test']
case_list = np.random.choice(cases,size=len(X_df_resampled),replace=True,p=[0.6,0.4])

X_df_RTrain, y_df_RTrain = X_df_resampled.iloc[case_list=='train'], y_df_resampled.iloc[case_list=='train']
X_df_RTest, y_df_RTest = X_df_resampled.iloc[case_list=='test'], y_df_resampled.iloc[case_list=='test']

#### Vanilla logistic regression on synthesized data

We first fit a vanilla logistic regression on the synthesized data. Note that since this data is highly imbalanced, the accuracy metric can be misleading.

In [26]:
# We first fit a standard logistic regression.
model = logit(solver='liblinear')
model.fit(X_df_RTrain,y_df_RTrain)

In [27]:
pred = model.predict_proba(X_df_RTest)

We will visualize and compare the prediction results later.

#### Logistic regression with balanced class weight

In [28]:
model_BW = logit(solver='liblinear',class_weight='balanced')
model_BW.fit(X_df_RTrain,y_df_RTrain)

In [29]:
pred_BW = model_BW.predict_proba(X_df_RTest)

#### Using SMOTE to over sample positive cases

Another way to cope with highly imbalanced data is to use SMOTE to over-sample minority cases (for our example the positive case). On a high level, SMOTE works in the following way to generate new samples of the minority cases:
1.  Randomly select a minority case $x$.
2.  Among all the minority cases, find the $k$ minority cases $x_1,x_2,\cdots,x_k$ that are the top-$k$ closest cases to $x$. Here $k$ is a hyper parameter (usually one set $k=5$).
3.  Randomly select $x_i$ from the top-$k$ minority cases.
4.  Draw a line segment from $x$ to $x_i$, and generate a new minority sample by randomly select a point on this line segment.

In [30]:
# Use imbalanced learn package
from imblearn.over_sampling import SMOTE

In [31]:
# Use SMOTE to resample minority class.
smote_sampler = SMOTE(random_state=12,sampling_strategy='minority')
X_df_SMOTE, y_df_SMOTE = smote_sampler.fit_resample(X_df_RTrain, y_df_RTrain)

In [32]:
y_df_RTrain.value_counts()

0    5758
1     297
Name: target, dtype: int64

In [33]:
y_df_SMOTE.value_counts()

0    5758
1    5758
Name: target, dtype: int64

After applying SMOTE we now have equal number of negative and positive cases in the training dataset.

WARNING: you should only apply SMOTE or any other resampling techniques on the **training** dataset, and the testing dataset should be kept untouched.

In [34]:
model_SMOTE = logit(solver='liblinear')
model_SMOTE.fit(X_df_SMOTE,y_df_SMOTE)

In [35]:
pred_SMOTE = model_SMOTE.predict_proba(X_df_RTest)

In [36]:
# Gather test results for imbalanced dataset

IM_results_df = pd.DataFrame(
    {
        'pred':np.vstack([pred,pred_BW,pred_SMOTE])[:,1],
        'label':pd.concat([y_df_RTest]*3),
        'method':['vanilla']*pred.shape[0]+['balanced weight']*pred_BW.shape[0]+['SMOTE']*pred_SMOTE.shape[0]
    }
)

threshold = 0.5

IM_results_df['binary pred'] = (IM_results_df['pred']>threshold).astype('int')
IM_results_df

Unnamed: 0,pred,label,method,binary pred
353,0.034024,0,vanilla,0
501,0.046796,0,vanilla,0
492,0.006674,0,vanilla,0
533,0.003480,0,vanilla,0
260,0.004869,0,vanilla,0
...,...,...,...,...
498,0.071034,0,SMOTE,0
501,0.597778,0,SMOTE,1
262,0.596512,0,SMOTE,1
502,0.484966,1,SMOTE,0


In [37]:
IM_results_df.groupby(['label','method'])['binary pred'].value_counts().unstack()

Unnamed: 0_level_0,binary pred,0,1
label,method,Unnamed: 2_level_1,Unnamed: 3_level_1
0,SMOTE,2882,536
0,balanced weight,2848,570
0,vanilla,3404,14
1,SMOTE,115,412
1,balanced weight,107,420
1,vanilla,466,61


From the above table we see that vanilla logistic regression performed very badly, as it predicted almost all cases with positive label as negative cases.

For such highly unbalanced data, it is very important to choose a proper metric to measure model performance. When the minority case is the positive case (i.e. label with 1), you can use F1 score. Please try by yourself what happens if you use average precision, accuracy, or AUROC as metric.

(The F1 score is a score with values in $[0,1]$, and it measures the harmonic mean of precision and recall at the given threshold. The higher the value, the better the model performance.)

In [38]:
from sklearn.metrics import f1_score



for method in ['vanilla','balanced weight','SMOTE']:
    score = f1_score(IM_results_df.query('method==@method')['label'],IM_results_df.query('method==@method')['binary pred'])
    print('F1 score is {:.4f} for method {} '.format(score,method))


F1 score is 0.2027 for method vanilla 
F1 score is 0.5537 for method balanced weight 
F1 score is 0.5586 for method SMOTE 
