<a href="https://colab.research.google.com/github/jeffheaton/t81_558_deep_learning/blob/pytorch/t81_558_class_05_3_keras_l1_l2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# T81-558: Applications of Deep Neural Networks
**Module 5: Regularization and Dropout**
* Instructor: [Jeff Heaton](https://sites.wustl.edu/jeffheaton/), McKelvey School of Engineering, [Washington University in St. Louis](https://engineering.wustl.edu/Programs/Pages/default.aspx)
* For more information visit the [class website](https://sites.wustl.edu/jeffheaton/t81-558/).

# Module 5 Material

* Part 5.1: Part 5.1: Introduction to Regularization: Ridge and Lasso [[Video]](https://www.youtube.com/watch?v=jfgRtCYjoBs&list=PLjy4p-07OYzulelvJ5KVaT2pDlxivl_BN) [[Notebook]](https://github.com/jeffheaton/t81_558_deep_learning/blob/pytorch/t81_558_class_05_1_reg_ridge_lasso.ipynb)
* Part 5.2: Using K-Fold Cross Validation with PyTorch [[Video]](https://www.youtube.com/watch?v=maiQf8ray_s&list=PLjy4p-07OYzulelvJ5KVaT2pDlxivl_BN) [[Notebook]](https://github.com/jeffheaton/t81_558_deep_learning/blob/pytorch/t81_558_class_05_2_kfold.ipynb)
* **Part 5.3: Using L1 and L2 Regularization with PyTorch to Decrease Overfitting** [[Video]](https://www.youtube.com/watch?v=JEWzWv1fBFQ&list=PLjy4p-07OYzulelvJ5KVaT2pDlxivl_BN) [[Notebook]](https://github.com/jeffheaton/t81_558_deep_learning/blob/pytorch/t81_558_class_05_3_keras_l1_l2.ipynb)
* Part 5.4: Drop Out for PyTorch to Decrease Overfitting [[Video]](https://www.youtube.com/watch?v=bRyOi0L6Rs8&list=PLjy4p-07OYzulelvJ5KVaT2pDlxivl_BN) [[Notebook]](https://github.com/jeffheaton/t81_558_deep_learning/blob/pytorch/t81_558_class_05_4_dropout.ipynb)
* Part 5.5: Benchmarking PyTorch Deep Learning Regularization Techniques [[Video]](https://www.youtube.com/watch?v=1NLBwPumUAs&list=PLjy4p-07OYzulelvJ5KVaT2pDlxivl_BN) [[Notebook]](https://github.com/jeffheaton/t81_558_deep_learning/blob/pytorch/t81_558_class_05_5_bootstrap.ipynb)


# Google CoLab Instructions

The following code ensures that Google CoLab is running and maps Google Drive if needed.

In [1]:
try:
    %tensorflow_version 2.x
    COLAB = True
    print("Note: using Google CoLab")
except:
    print("Note: not using Google CoLab")
    COLAB = False

Note: using Google CoLab


# Part 5.3: L1 and L2 Regularization to Decrease Overfitting

L1 and L2 regularization are two common regularization techniques that can reduce the effects of overfitting [[Cite:ng2004feature]](http://cseweb.ucsd.edu/~elkan/254spring05/Hammon.pdf). These algorithms can either work with an objective function or as a part of the backpropagation algorithm. In both cases, the regularization algorithm is attached to the training algorithm by adding an objective.  

These algorithms work by adding a weight penalty to the neural network training. This penalty encourages the neural network to keep the weights to small values. Both L1 and L2 calculate this penalty differently. You can add this penalty calculation to the calculated gradients for gradient-descent-based algorithms, such as backpropagation. The penalty is negatively combined with the objective score for objective-function-based training, such as simulated annealing.

Both L1 and L2 work differently in that they penalize the size of the weight. L2 will force the weights into a pattern similar to a Gaussian distribution; the L1 will force the weights into a pattern similar to a Laplace distribution, as demonstrated in Figure 5.L1L2.

**Figure 5.L1L2: L1 vs L2**
![L1 vs L2](https://raw.githubusercontent.com/jeffheaton/t81_558_deep_learning/master/images/class_9_l1_l2.png "L1 vs L2")

As you can see, L1 algorithm is more tolerant of weights further from 0, whereas the L2 algorithm is less tolerant. We will highlight other important differences between L1 and L2 in the following sections. 

We begin by accessing CUDA if a GPU is available; otherwise, we will use the CPU.

In [2]:
import torch

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(f"Using device: {device}")

Using device: cuda


Next, we define the functions that calculate L1 and L2 normalization loss. We calculate the loss across all weights biases, which is not a universal practice. Some implementations may calculate the loss over just the weights, excluding biases. Other implementations may choose a specific layer. For simplicity, we sum over all parameters, including weights and biases.

In [3]:
def add_l2_norm_loss(model, l2_lambda = 0.001):
  l2_norm = sum(p.pow(2.0).sum()
    for p in model.parameters())
  return l2_lambda * l2_norm
  
def add_l1_norm_loss(model, l1_lambda = 0.001):
  l1_norm = sum(p.abs().sum()
    for p in model.parameters())
  return l1_lambda * l1_norm

In [4]:
import io
import copy

class EarlyStopping():
  def __init__(self, patience=5, min_delta=0, restore_best_weights=True):
    self.patience = patience
    self.min_delta = min_delta
    self.restore_best_weights = restore_best_weights
    self.best_model = None
    self.best_loss = None
    self.counter = 0
    self.status = ""
    
  def __call__(self, model, val_loss):
    if self.best_loss == None:
      self.best_loss = val_loss
      self.best_model = copy.deepcopy(model)
    elif self.best_loss - val_loss > self.min_delta:
      self.best_loss = val_loss
      self.counter = 0
      self.best_model.load_state_dict(model.state_dict())
    elif self.best_loss - val_loss < self.min_delta:
      self.counter += 1
      if self.counter >= self.patience:
        self.status = f"Stopped on {self.counter}"
        if self.restore_best_weights:
          model.load_state_dict(self.best_model.state_dict())
        return True
    self.status = f"{self.counter}/{self.patience}"
    return False

In [5]:
import pandas as pd
from scipy.stats import zscore

# Read the data set
df = pd.read_csv(
    "https://data.heatonresearch.com/data/t81-558/jh-simple-dataset.csv",
    na_values=['NA','?'])

# Generate dummies for job
df = pd.concat([df,pd.get_dummies(df['job'],prefix="job")],axis=1)
df.drop('job', axis=1, inplace=True)

# Generate dummies for area
df = pd.concat([df,pd.get_dummies(df['area'],prefix="area")],axis=1)
df.drop('area', axis=1, inplace=True)

# Missing values for income
med = df['income'].median()
df['income'] = df['income'].fillna(med)

# Standardize ranges
df['income'] = zscore(df['income'])
df['aspect'] = zscore(df['aspect'])
df['save_rate'] = zscore(df['save_rate'])
df['age'] = zscore(df['age'])
df['subscriptions'] = zscore(df['subscriptions'])

# Convert to numpy - Classification
x_columns = df.columns.drop('product').drop('id')
x = df[x_columns].values
dummies = pd.get_dummies(df['product']) # Classification
products = dummies.columns
y = dummies.values

We now create a PyTorch network with L1 regression.

In [6]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

import torch.nn as nn
import torch.nn.functional as F
import numpy as np
from torch.autograd import Variable
from sklearn import preprocessing
from torch.utils.data import DataLoader, TensorDataset
from sklearn.model_selection import KFold
from sklearn import metrics
import tqdm
import time

EPOCHS=500
BATCH_SIZE = 16

# Define the PyTorch Neural Network
class Net(nn.Module):
    def __init__(self, in_count, out_count):
        super(Net, self).__init__()
        self.fc1 = nn.Linear(in_count, 50)
        self.fc2 = nn.Linear(50, 25)
        self.fc3 = nn.Linear(25, out_count)
        self.softmax = nn.Softmax(dim=1)

    def forward(self, x):
        x = F.relu(self.fc1(x))
        x = F.relu(self.fc2(x))
        return self.softmax(self.fc3(x))

# Cross-Validate
kf = KFold(5, shuffle=True, random_state=42) # Use for KFold classification
oos_y_list = []
oos_pred_list = []

fold = 0
for train, test in kf.split(x):
    fold+=1
    print(f"Fold #{fold}")
        
    x_train = x[train]
    y_train = y[train]
    x_test = x[test]
    y_test = y[test]

    # Numpy to PyTorch
    x_train = torch.Tensor(x_train).float()
    y_train = torch.Tensor(y_train).float()

    x_test = torch.Tensor(x_test).float().to(device)
    y_test = torch.Tensor(y_test).float().to(device)

    # Create datasets
    dataset_train = TensorDataset(x_train, y_train)
    dataloader_train = DataLoader(dataset_train,\
      batch_size=BATCH_SIZE, shuffle=True)

    dataset_test = TensorDataset(x_test, y_test)
    dataloader_test = DataLoader(dataset_test,\
      batch_size=BATCH_SIZE, shuffle=True)

    # Train the network
    model = Net(x.shape[1],len(products)).to(device)

    # Define the loss function for classification
    loss_fn = nn.CrossEntropyLoss()# cross entropy loss

    # Define the optimizer
    optimizer = torch.optim.Adam(model.parameters(), lr=0.01)

    es = EarlyStopping()

    epoch = 0
    done = False
    while epoch<1000 and not done:
      epoch += 1
      steps = list(enumerate(dataloader_train))
      pbar = tqdm.tqdm(steps)
      model.train()
      for i, (x_batch, y_batch) in pbar:
        y_batch_pred = model(x_batch.to(device))
        loss = loss_fn(y_batch_pred, y_batch.to(device))
        loss += add_l1_norm_loss(model)
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()
        
        loss, current = loss.item(), (i + 1)* len(x_batch)
        if i == len(steps)-1:
          model.eval()
          pred = model(x_test)
          vloss = loss_fn(pred, y_test)
          if es(model,vloss): done = True
          pbar.set_description(f"Epoch: {epoch}, tloss: {loss}, vloss: {vloss:>7f}, EStop:[{es.status}]")
        else:
          pbar.set_description(f"Epoch: {epoch}, tloss {loss:}")
    
    pred = model(x_test)
    
    oos_y_list.append(y_test.cpu().detach())
    oos_pred_list.append(pred.cpu().detach())    

    # Measure this fold's RMSE
    #score = np.sqrt(metrics.mean_squared_error(pred.cpu().detach(),y_test.cpu().detach()))
    #print(f"Fold score (RMSE): {score}")

    # Measure this fold's accuracy
    y_compare = np.argmax(y_test.cpu().detach(),axis=1) # For accuracy calculation
    pred = np.argmax(pred.cpu().detach(),axis=1) # For accuracy calculation
    score = metrics.accuracy_score(y_compare, pred)
    print(f"Fold score (accuracy): {score}")

Fold #1


Epoch: 1, tloss: 1.752314805984497, vloss: 1.685029, EStop:[0/5]: 100%|██████████| 100/100 [00:02<00:00, 44.64it/s]
Epoch: 2, tloss: 1.5662773847579956, vloss: 1.681054, EStop:[0/5]: 100%|██████████| 100/100 [00:01<00:00, 91.07it/s]
Epoch: 3, tloss: 1.6772552728652954, vloss: 1.679820, EStop:[0/5]: 100%|██████████| 100/100 [00:01<00:00, 74.93it/s]
Epoch: 4, tloss: 1.5350453853607178, vloss: 1.672797, EStop:[0/5]: 100%|██████████| 100/100 [00:01<00:00, 81.93it/s]
Epoch: 5, tloss: 1.6077457666397095, vloss: 1.650317, EStop:[0/5]: 100%|██████████| 100/100 [00:01<00:00, 96.04it/s]
Epoch: 6, tloss: 1.5640835762023926, vloss: 1.577708, EStop:[0/5]: 100%|██████████| 100/100 [00:01<00:00, 92.39it/s]
Epoch: 7, tloss: 1.4834266901016235, vloss: 1.537069, EStop:[0/5]: 100%|██████████| 100/100 [00:01<00:00, 76.59it/s]
Epoch: 8, tloss: 1.6142799854278564, vloss: 1.525642, EStop:[0/5]: 100%|██████████| 100/100 [00:01<00:00, 74.16it/s]
Epoch: 9, tloss: 1.7124179601669312, vloss: 1.562716, EStop:[1/5]

Fold score (accuracy): 0.6575
Fold #2


Epoch: 1, tloss: 1.736241340637207, vloss: 1.693863, EStop:[0/5]: 100%|██████████| 100/100 [00:00<00:00, 130.98it/s]
Epoch: 2, tloss: 1.7167763710021973, vloss: 1.693886, EStop:[1/5]: 100%|██████████| 100/100 [00:00<00:00, 126.23it/s]
Epoch: 3, tloss: 1.5063915252685547, vloss: 1.700532, EStop:[2/5]: 100%|██████████| 100/100 [00:00<00:00, 134.01it/s]
Epoch: 4, tloss: 1.5894280672073364, vloss: 1.647491, EStop:[0/5]: 100%|██████████| 100/100 [00:00<00:00, 125.84it/s]
Epoch: 5, tloss: 1.700634479522705, vloss: 1.530557, EStop:[0/5]: 100%|██████████| 100/100 [00:00<00:00, 134.82it/s]
Epoch: 6, tloss: 1.676051139831543, vloss: 1.520396, EStop:[0/5]: 100%|██████████| 100/100 [00:00<00:00, 135.74it/s]
Epoch: 7, tloss: 1.474263072013855, vloss: 1.515694, EStop:[0/5]: 100%|██████████| 100/100 [00:00<00:00, 134.37it/s]
Epoch: 8, tloss: 1.4682118892669678, vloss: 1.537373, EStop:[1/5]: 100%|██████████| 100/100 [00:00<00:00, 127.53it/s]
Epoch: 9, tloss: 1.4142489433288574, vloss: 1.514477, EStop:

Fold score (accuracy): 0.6675
Fold #3


Epoch: 1, tloss: 1.6426591873168945, vloss: 1.664748, EStop:[0/5]: 100%|██████████| 100/100 [00:00<00:00, 128.59it/s]
Epoch: 2, tloss: 1.6889688968658447, vloss: 1.680957, EStop:[1/5]: 100%|██████████| 100/100 [00:00<00:00, 129.49it/s]
Epoch: 3, tloss: 1.6923986673355103, vloss: 1.681552, EStop:[2/5]: 100%|██████████| 100/100 [00:00<00:00, 127.29it/s]
Epoch: 4, tloss: 1.7985219955444336, vloss: 1.681252, EStop:[3/5]: 100%|██████████| 100/100 [00:00<00:00, 121.76it/s]
Epoch: 5, tloss: 1.7435836791992188, vloss: 1.681841, EStop:[4/5]: 100%|██████████| 100/100 [00:00<00:00, 129.93it/s]
Epoch: 6, tloss: 1.621694564819336, vloss: 1.681574, EStop:[Stopped on 5]: 100%|██████████| 100/100 [00:00<00:00, 129.94it/s]


Fold score (accuracy): 0.485
Fold #4


Epoch: 1, tloss: 1.6960283517837524, vloss: 1.665394, EStop:[0/5]: 100%|██████████| 100/100 [00:00<00:00, 125.49it/s]
Epoch: 2, tloss: 1.7043899297714233, vloss: 1.663143, EStop:[0/5]: 100%|██████████| 100/100 [00:00<00:00, 130.55it/s]
Epoch: 3, tloss: 1.6685619354248047, vloss: 1.660332, EStop:[0/5]: 100%|██████████| 100/100 [00:00<00:00, 125.95it/s]
Epoch: 4, tloss: 1.6569678783416748, vloss: 1.644966, EStop:[0/5]: 100%|██████████| 100/100 [00:00<00:00, 124.59it/s]
Epoch: 5, tloss: 1.5483955144882202, vloss: 1.585966, EStop:[0/5]: 100%|██████████| 100/100 [00:00<00:00, 129.24it/s]
Epoch: 6, tloss: 1.6016675233840942, vloss: 1.551459, EStop:[0/5]: 100%|██████████| 100/100 [00:00<00:00, 123.49it/s]
Epoch: 7, tloss: 1.5955419540405273, vloss: 1.572304, EStop:[1/5]: 100%|██████████| 100/100 [00:00<00:00, 120.14it/s]
Epoch: 8, tloss: 1.6457254886627197, vloss: 1.549885, EStop:[0/5]: 100%|██████████| 100/100 [00:00<00:00, 126.09it/s]
Epoch: 9, tloss: 1.4217344522476196, vloss: 1.542323, ES

Fold score (accuracy): 0.63
Fold #5


Epoch: 1, tloss: 1.566257357597351, vloss: 1.688106, EStop:[0/5]: 100%|██████████| 100/100 [00:00<00:00, 125.82it/s]
Epoch: 2, tloss: 1.8711503744125366, vloss: 1.689323, EStop:[1/5]: 100%|██████████| 100/100 [00:00<00:00, 132.07it/s]
Epoch: 3, tloss: 1.6826667785644531, vloss: 1.689814, EStop:[2/5]: 100%|██████████| 100/100 [00:00<00:00, 126.14it/s]
Epoch: 4, tloss: 1.92525315284729, vloss: 1.689516, EStop:[3/5]: 100%|██████████| 100/100 [00:00<00:00, 129.52it/s]
Epoch: 5, tloss: 1.551872730255127, vloss: 1.689397, EStop:[4/5]: 100%|██████████| 100/100 [00:00<00:00, 124.12it/s]
Epoch: 6, tloss: 1.801450252532959, vloss: 1.689434, EStop:[Stopped on 5]: 100%|██████████| 100/100 [00:00<00:00, 128.84it/s]

Fold score (accuracy): 0.475





In [7]:
# Build the oos prediction list and calculate the error.
oos_y = np.concatenate(oos_y_list)
oos_pred = np.concatenate(oos_pred_list)
oos_y = np.argmax(oos_y,axis=1)
oos_pred = np.argmax(oos_pred,axis=1)
score = metrics.accuracy_score(oos_pred,oos_y)
print(f"Final OOS score (accuracy): {score}")

Final OOS score (accuracy): 0.583
