Notebook prepared by Henrique Lopes Cardoso (hlc@fe.up.pt).

# REGULARIZATION AND SGD

Regularization is a technique that allows us to avoid overfitting by penalizing excessive feature weights. Several classifiers, such as [Logistic Regression](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html) and [SVM](https://scikit-learn.org/stable/modules/generated/sklearn.svm.LinearSVC.html), include the option for choosing which regularization term to use.


In this notebook we'll explore the usage of different regularization terms. For that, we'll use a restaurant reviews classification task.


In [30]:
# Importing the required packages

import pandas as pd
import re

import nltk
from nltk.stem.porter import PorterStemmer
from nltk.corpus import stopwords

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.linear_model import LogisticRegression, SGDClassifier
from sklearn.svm import LinearSVC
from sklearn.model_selection import train_test_split
from sklearn.metrics import (
    confusion_matrix,
    precision_score,
    recall_score,
    f1_score,
    accuracy_score,
)

In [31]:
# Importing the dataset
dataset = pd.read_csv("../data/restaurant_reviews.tsv", delimiter="\t", quoting=3)

print(dataset["Liked"].value_counts())
dataset.head()

1    500
0    500
Name: Liked, dtype: int64


Unnamed: 0,Review,Liked
0,Wow... Loved this place.,1
1,Crust is not good.,0
2,Not tasty and the texture was just nasty.,0
3,Stopped by during the late May bank holiday of...,1
4,The selection on the menu was great and so wer...,1


In [32]:
# Cleaning the text
corpus = []
ps = PorterStemmer()
for i in range(0, 1000):
    # get review and remove non alpha chars
    review = re.sub("[^a-zA-Z]", " ", dataset["Review"][i])
    # to lower-case and tokenize
    review = review.lower().split()
    # stemming and stop word removal
    review = " ".join([ps.stem(w) for w in review if not w in set(stopwords.words("english"))])
    corpus.append(review)

In [33]:
# Creating a bag-of-words model
vectorizer = CountVectorizer(max_features=1500)
X = vectorizer.fit_transform(corpus).toarray()
y = dataset["Liked"]

print(X.shape, y.shape)

(1000, 1500) (1000,)


In [34]:
# Splitting the dataset into training and test sets
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.20, random_state=0, stratify=y
)

print(X_train.shape, y_train.shape)
print(X_test.shape, y_test.shape)

print(y_train.value_counts())
print(y_test.value_counts())

(800, 1500) (800,)
(200, 1500) (200,)
1    400
0    400
Name: Liked, dtype: int64
0    100
1    100
Name: Liked, dtype: int64


In [35]:
def get_features_weights(model, vectorizer, clean=False):
    """
    Returns a dataframe with the features and their weights
    """

    weights = model.coef_[0]
    features = vectorizer.get_feature_names_out()
    df = pd.DataFrame({"feature": features, "weight": weights})
    df = df.sort_values("weight", ascending=False)

    if clean:
        df = df[df["weight"] != 0]

    df.reset_index(drop=True, inplace=True)
    return df

In [36]:
def evaluate(y_test, y_pred):
    """
    Evaluates the model, given the test set and the model predictions
    """

    print(f"Confusion matrix:\n{confusion_matrix(y_test, y_pred)}\n")
    print(f"Accuracy: {accuracy_score(y_test, y_pred)}")
    print(f"Precision: {precision_score(y_test, y_pred)}")
    print(f"Recall: {recall_score(y_test, y_pred)}")
    print(f"F1: {f1_score(y_test, y_pred)}\n")

## Logistic Regression

Scikit-learn's [Logistic Regression](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html) includes both L1 and L2 regularizations. L2 is the default.


In [37]:
clf_l2 = LogisticRegression(penalty="l2")  # l2 regularization is the default
clf_l2.fit(X_train, y_train)
y_pred = clf_l2.predict(X_test)

evaluate(y_test, y_pred)

Confusion matrix:
[[83 17]
 [22 78]]

Accuracy: 0.805
Precision: 0.8210526315789474
Recall: 0.78
F1: 0.8



Print the feature weights that we've obtained.


In [38]:
# your code here
fw_l2 = get_features_weights(clf_l2, vectorizer)
print(f"Number of features: {fw_l2.shape[0]}")
fw_l2.head()


Number of features: 1500


Unnamed: 0,feature,weight
0,great,2.831205
1,delici,2.020061
2,amaz,1.673413
3,fantast,1.664433
4,awesom,1.489705


How many features are actually being used? (I.e., how many non-zero weights are there?)


In [39]:
# your code here
fw_l2_clean = get_features_weights(clf_l2, vectorizer, True)
print(f"Number of features actually being used (w/ a non-zero weight): {fw_l2_clean.shape[0]}")

Number of features actually being used (w/ a non-zero weight): 1311


L1 regularization typically obtains sparser weight vectors. Try using L1 regularization (check the documentation for additional changes you might need). How many non-zero weights do you have now?


In [40]:
# your code here
clf_l1 = LogisticRegression(penalty="l1", solver="liblinear")
clf_l1.fit(X_train, y_train)
y_pred = clf_l1.predict(X_test)

evaluate(y_test, y_pred)

fw_l1 = get_features_weights(clf_l1, vectorizer, True)
print(f"Number of features actually being used (w/ a non-zero weight): {fw_l1.shape[0]}")

Confusion matrix:
[[89 11]
 [30 70]]

Accuracy: 0.795
Precision: 0.8641975308641975
Recall: 0.7
F1: 0.7734806629834253

Number of features actually being used (w/ a non-zero weight): 149


You can also try using a mix of L1 and L2 (check the documentation for how to do it).


In [41]:
# your code here
clf_l12 = LogisticRegression(
    penalty="elasticnet", 
    solver="saga", 
    l1_ratio=0.5, 
    max_iter=1000
)
clf_l12.fit(X_train, y_train)
y_pred = clf_l12.predict(X_test)

evaluate(y_test, y_pred)

fw_l12 = get_features_weights(clf_l12, vectorizer, True)
print(f"Number of features actually being used (w/ a non-zero weight): {fw_l12.shape[0]}")

Confusion matrix:
[[84 16]
 [28 72]]

Accuracy: 0.78
Precision: 0.8181818181818182
Recall: 0.72
F1: 0.7659574468085107

Number of features actually being used (w/ a non-zero weight): 380


## SVM

Scikit-learn's [SVM](https://scikit-learn.org/stable/modules/generated/sklearn.svm.LinearSVC.html) also includes both L1 and L2 regularizations. L2 is the default.


In [42]:
clf = LinearSVC(penalty="l2")  # l2 regularization is the default
clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)

evaluate(y_test, y_pred)

Confusion matrix:
[[82 18]
 [20 80]]

Accuracy: 0.81
Precision: 0.8163265306122449
Recall: 0.8
F1: 0.8080808080808082



How many features are actually being used? (I.e., how many non-zero weights are there?)


In [43]:
# your code here
fw_svcl1 = get_features_weights(clf, vectorizer, True)
print(f"Number of features actually being used (w/ a non-zero weight): {fw_svcl1.shape[0]}")

Number of features actually being used (w/ a non-zero weight): 1084


Try using L1 regularization (check the documentation for additional changes you might need). How many non-zero weights do you have now?


In [44]:
# your code here
clf = LinearSVC(penalty="l1", dual=False, max_iter=1000)
clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)

evaluate(y_test, y_pred)

fw_svcl2 = get_features_weights(clf, vectorizer, True)
print(f"Number of features actually being used (w/ a non-zero weight): {fw_svcl2.shape[0]}")

Confusion matrix:
[[86 14]
 [27 73]]

Accuracy: 0.795
Precision: 0.8390804597701149
Recall: 0.73
F1: 0.7807486631016043

Number of features actually being used (w/ a non-zero weight): 421


## SGD Classifier

Scikit-learn's [SGD Classifier](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.SGDClassifier.html) implements regularized linear models (such as SVM and Logistic Regression) with stochastic gradient descent (SGD) learning: the gradient of the loss is estimated each sample at a time and the model is updated along the way with a decreasing learning rate.

Several loss functions can be used, namely _hinge loss_ (which corresponds to SVM) and _log loss_ (which corresponds to Logistic Regression). And as before, you can use L1 and/or L2 regularization.

The _max_iter_ parameter allows you to set the maximum number of epochs, where an epoch corresponds to going through the whole dataset for training. Also, _learning_rate_ allows you to set a learning rate schedule.

Several parameters allow you to define stopping criteria: _tol_ specifies a tolerance loss value or stopping criterion, while _n_iter_no_change_ indicates the number of iterations with no improvement that should be observed before stopping; _early_stopping_ allows us to use a validation set (a fraction _validation_fraction_ of the training data) on which the stopping criterion will be checked (instead of checking the loss on the training data).

The _verbose_ parameter allows you to set a verbosity (output) level.

Try using SGD, and explore different parameters!


In [45]:
# your code here
clf = SGDClassifier(loss="perceptron", verbose=1)
clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)

evaluate(y_test, y_pred)

fw_sgd = get_features_weights(clf, vectorizer, True)
print(f"Number of features actually being used (w/ a non-zero weight): {fw_sgd.shape[0]}")

-- Epoch 1
Norm: 177.62, NNZs: 724, Bias: -0.120071, T: 800, Avg. loss: 3.764285
Total training time: 0.00 seconds.
-- Epoch 2
Norm: 144.12, NNZs: 876, Bias: -0.092645, T: 1600, Avg. loss: 1.403807
Total training time: 0.00 seconds.
-- Epoch 3
Norm: 121.34, NNZs: 931, Bias: -0.104234, T: 2400, Avg. loss: 0.444371
Total training time: 0.01 seconds.
-- Epoch 4
Norm: 103.37, NNZs: 945, Bias: -0.207467, T: 3200, Avg. loss: 0.212196
Total training time: 0.01 seconds.
-- Epoch 5
Norm: 90.53, NNZs: 965, Bias: -0.341325, T: 4000, Avg. loss: 0.145401
Total training time: 0.01 seconds.
-- Epoch 6
Norm: 80.35, NNZs: 972, Bias: -0.415416, T: 4800, Avg. loss: 0.094352
Total training time: 0.02 seconds.
-- Epoch 7
Norm: 72.64, NNZs: 985, Bias: -0.479108, T: 5600, Avg. loss: 0.086488
Total training time: 0.02 seconds.
-- Epoch 8
Norm: 66.13, NNZs: 989, Bias: -1.848487, T: 6400, Avg. loss: 0.047598
Total training time: 0.02 seconds.
-- Epoch 9
Norm: 60.75, NNZs: 992, Bias: 0.709440, T: 7200, Avg. loss

Stochastic gradient descent updates the model weights base on one example at a time. Instead, we can compute the gradient over batches of training instances before updating the weights.

SGDClassifier allows us to do so via [_partial_fit_](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.SGDClassifier.html#sklearn.linear_model.SGDClassifier.partial_fit), which corresponds to training the model with a specific set of examples for a single epoch. To properly use this method, we need to split our data into mini-batches and then iterate through them for as many epochs as we want.
Matters such as objective convergence, early stopping, and learning rate adjustments must be handled manually.

Try it out!


In [48]:
# your code here
import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)

n_iter = 10

def batch(iter_X, iter_y, n=1):
    l = len(iter_X)
    for ndx in range(0, l, n):
        yield iter_X[ndx : min(ndx + n, l)], iter_y[ndx : min(ndx + n, l)]


clf = SGDClassifier(
    alpha=0.0001,
    loss="log_loss",
    penalty="l2",
    n_jobs=-1,
    shuffle=True,
    max_iter=100,
    verbose=0,
    tol=0.001,
)

for _ in range(n_iter):
    batcherator = batch(X_train, y_train, 10)
    for index, (chunk_X, chunk_y) in enumerate(batcherator):
        clf.partial_fit(chunk_X, chunk_y, classes=[0, 1])
        y_predicted = clf.predict(X_test)
        evaluate(y_test, y_predicted)
        print("-----------------------------")

Confusion matrix:
[[ 6 94]
 [ 1 99]]

Accuracy: 0.525
Precision: 0.5129533678756477
Recall: 0.99
F1: 0.6757679180887373

-----------------------------
Confusion matrix:
[[85 15]
 [73 27]]

Accuracy: 0.56
Precision: 0.6428571428571429
Recall: 0.27
F1: 0.3802816901408451

-----------------------------
Confusion matrix:
[[99  1]
 [86 14]]

Accuracy: 0.565
Precision: 0.9333333333333333
Recall: 0.14
F1: 0.24347826086956526

-----------------------------
Confusion matrix:
[[28 72]
 [24 76]]

Accuracy: 0.52
Precision: 0.5135135135135135
Recall: 0.76
F1: 0.6129032258064517

-----------------------------
Confusion matrix:
[[32 68]
 [24 76]]

Accuracy: 0.54
Precision: 0.5277777777777778
Recall: 0.76
F1: 0.6229508196721312

-----------------------------
Confusion matrix:
[[94  6]
 [65 35]]

Accuracy: 0.645
Precision: 0.8536585365853658
Recall: 0.35
F1: 0.49645390070921985

-----------------------------
Confusion matrix:
[[80 20]
 [57 43]]

Accuracy: 0.615
Precision: 0.6825396825396826
Recall: 0.4