# Foundations of AI/ML by IIIT-Hyderabad & Talent Sprint
# Lab08 Experiment 04

## Regularization ##

Regularization involves adding an extra term to the loss function, which penalizes certain parameter configurations.

In principle, adding a regularization term to the loss will encourage smooth network mappings in a neural network (by penalizing large values of the parameters, which decreases the amount of nonlinearity that the network models).

The regularization term in loss acts as a weight decay factor during back propagation and is used for controlling complexity of model (to prevent overfitting) by penalizing weights with large magnitude.

The L1 regularization will shrink some parameters to zero. Hence some variables will not play any role in the model, L1 regression can be seen as a way to select features in a model. However, the model is not able to learn complex pattern with so few parameters remaining. Here, for each weight $w$ we add the term $\lambda∣w∣$ to the objective.

In this experiment, we are implementing L1 regularization in combination with L2 Regularization. The L2 regularization adds a penalty equal to the sum of the squared value of the coefficients. The L2 regularization will force the parameters to be relatively small, the bigger the penalization, the smaller (and the more robust) the coefficients are. It can be implemented by penalizing the squared magnitude of all parameters directly in the objective. That is, for every weight $w$ in the network, we add the term $\lambda\times(1/2)\times w^2$ to the objective, where λ is the regularization strength.

In [115]:
import numpy as np
from scipy import ndimage
from matplotlib import pyplot as plt
from sklearn import manifold, datasets
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import Perceptron
from sklearn.metrics import accuracy_score
from sklearn.datasets import fetch_mldata

#Load MNIST datset 
digits = datasets.load_digits(n_class=10)
X = digits.data
Y = digits.target
print(X.shape, Y.shape)
num_examples = X.shape[0]      ## training set size
nn_input_dim = X.shape[1]      ## input layer dimensionality
nn_output_dim = len(np.unique(Y))       ## output layer dimensionality

params = {
    "lr":0.0001,        ## learning_rate
    "max_iter":500,
    "h_dimn":50,     ## hidden_layer_size
    "regL1":1,
    "regL2":1,
}
print(np.unique(Y))

(1797, 64) (1797,)
[0 1 2 3 4 5 6 7 8 9]


In [116]:
def softmax(x):
    exp_scores = np.exp(x)
    probs = exp_scores / np.sum(exp_scores, axis=1, keepdims=True)
    return probs

def build_model():
    hdim = params["h_dimn"]
    # Initialize the parameters to random values.
    np.random.seed(0)
    W1 = np.random.randn(nn_input_dim, hdim) / np.sqrt(nn_input_dim)
    b1 = np.random.randn(1, hdim)
    W2 = np.random.randn(hdim, nn_output_dim) / np.sqrt(hdim)
    b2 = np.random.randn(1, nn_output_dim)

    model = { 'W1': W1, 'b1': b1, 'W2': W2, 'b2': b2}
    return model

def feedforward(model, x):
    W1, b1, W2, b2 = model['W1'], model['b1'], model['W2'], model['b2']
    z1 = x.dot(W1) + b1
    a1 = np.tanh(z1)
    z2 = a1.dot(W2) + b2
    probs = softmax(z2)
    return a1, probs

def backpropagation(model, x, y, a1, probs):
    W1, b1, W2, b2 = model['W1'], model['b1'], model['W2'], model['b2']
    
    delta3 = probs
    delta3[range(y.shape[0]), y] -= 1
    dW2 = (a1.T).dot(delta3)
    db2 = np.sum(delta3, axis=0, keepdims=True)
    delta2 = delta3.dot(W2.T) * (1 - np.power(a1, 2))
    dW1 = np.dot(x.T, delta2)
    db1 = np.sum(delta2, axis=0)
    return dW2, db2, dW1, db1

def calculate_loss(model, params, x, y):
    W1, b1, W2, b2 = model['W1'], model['b1'], model['W2'], model['b2']
    
    # Forward propagation to calculate predictions
    _, probs = feedforward(model, x)
    
    # Calculating the cross entropy loss
    corect_logprobs = -np.log(probs[range(y.shape[0]), y])
    data_loss = np.sum(corect_logprobs)
#     data_loss = -sum(np.matmul(np.log(probs).T,y))
    
    # Add regulatization terms to loss  =  reg_factor*(1/2)*(||W||^2) + reg_factor*|W|
    data_loss += params["regL2"]/2 * (np.sum(np.square(W1)) + np.sum(np.square(W2)))
    data_loss += params["regL1"] * (np.linalg.norm(W1, ord=1) + np.linalg.norm(W2, ord=1))
    
    return 1./y.shape[0] * data_loss

def test(model, x):
    W1, b1, W2, b2 = model['W1'], model['b1'], model['W2'], model['b2']
    # Forward propagation to calculate predictions
    _, probs = feedforward(model, x)
    preds = np.argmax(probs, axis=1)
    return preds

def train(model, X_train, X_test, Y_train, Y_test, verbose=True):
    # Gradient descent. For each batch...
    W1, b1, W2, b2 = model['W1'], model['b1'], model['W2'], model['b2']
    for i in range(0, params["max_iter"]):

        # Forward propagation
        a1, probs = feedforward(model, X_train)

        # Backpropagation
        dW2, db2, dW1, db1 = backpropagation(model, X_train, Y_train, a1, probs)

        # Add regularization terms (b1 and b2 don't have regularization terms)
        dW2 += params["regL2"] * W2    ## = derivative of [[ reg_factor*(1/2)*(||W||^2) ]]
        dW1 += params["regL2"] * W1
        dW1 += params["regL1"]         ## derivative of[[  reg_factor*|W|  ]]
        dW2 += params["regL1"]
        
        # Gradient descent parameter update
        W1 += -params["lr"] * dW1
        b1 += -params["lr"] * db1
        W2 += -params["lr"] * dW2
        b2 += -params["lr"] * db2
        
        # Assign new parameters to the model
        model = { 'W1': W1, 'b1': b1, 'W2': W2, 'b2': b2}
        if verbose and i % 10 == 0:
            preds = test(model, X_test)
            print("Loss after iteration %i: %f" %(i, calculate_loss(model, params, X_train, Y_train)),
                  ", Test accuracy:", np.count_nonzero(Y_test==preds)/Y_test.shape[0], "\n")
    return model

In [117]:
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.6)
reg = [0, 0.1, 1, 10]
params["regL1"] = 0    ## L2 is better than L1 in practice
for i in range(4):
    params["regL2"] = reg[i]
    print(params)
    model = build_model()
    model = train(model, X_train, X_test, Y_train, Y_test, verbose=False)
    preds = test(model, X_test)
    test_acc = np.count_nonzero(Y_test==preds)/Y_test.shape[0]
    preds = test(model, X_train)
    train_acc = np.count_nonzero(Y_train==preds)/Y_train.shape[0]
    print("test accuracy", test_acc, "\n")

{'lr': 0.0001, 'max_iter': 500, 'h_dimn': 50, 'regL1': 0, 'regL2': 0}
test accuracy 0.9647822057460612 

{'lr': 0.0001, 'max_iter': 500, 'h_dimn': 50, 'regL1': 0, 'regL2': 0.1}
test accuracy 0.9647822057460612 

{'lr': 0.0001, 'max_iter': 500, 'h_dimn': 50, 'regL1': 0, 'regL2': 1}
test accuracy 0.9657089898053753 

{'lr': 0.0001, 'max_iter': 500, 'h_dimn': 50, 'regL1': 0, 'regL2': 10}
test accuracy 0.9712696941612604 



The effect of regularization is not that visible for the above dataset because of the less number of parameters and no overfitting.

Overfitting is a very common problem when the dataset is too small compared with the number of model parameters that need to be learned. This problem is thus particularly acute in deep neural networks. Therefore, we will check its effects on a larger network using a optimized MLP Classifier (provided by sklearn). The above exercise was mainly put to understand how we can optimize our model by adding regularizations in the objective function for a neural network. There are other forms of regularization (like dropout, etc) too but we will not go into those details here.

In [None]:
from sklearn.neural_network import MLPClassifier
mnist = fetch_mldata('MNIST original')
X, Y = mnist.data, mnist.target
Y = Y.astype(int)
X = X[::10,:]
Y = Y[::10]
print(X.shape, Y.shape)
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.6)

reg = [10, 1, 0.001, 0]
for i in range(4):
    clf = MLPClassifier(hidden_layer_sizes = (350,120,50), solver = 'sgd', alpha=reg[i], max_iter=2000, shuffle = False)
    clf = clf.fit(X_train, Y_train)
    preds = clf.predict(X_test)
    test_acc = np.count_nonzero(Y_test==preds)/Y_test.shape[0]
    print("reg= ", reg[i], " , acc = ", test_acc)

(70000, 784) (70000,)
