<a href="https://colab.research.google.com/github/jeffheaton/t81_558_deep_learning/blob/pytorch/t81_558_class_05_4_dropout.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# T81-558: Applications of Deep Neural Networks
**Module 5: Regularization and Dropout**
* Instructor: [Jeff Heaton](https://sites.wustl.edu/jeffheaton/), McKelvey School of Engineering, [Washington University in St. Louis](https://engineering.wustl.edu/Programs/Pages/default.aspx)
* For more information visit the [class website](https://sites.wustl.edu/jeffheaton/t81-558/).

# Module 5 Material

* Part 5.1: Part 5.1: Introduction to Regularization: Ridge and Lasso [[Video]](https://www.youtube.com/watch?v=jfgRtCYjoBs&list=PLjy4p-07OYzulelvJ5KVaT2pDlxivl_BN) [[Notebook]](https://github.com/jeffheaton/t81_558_deep_learning/blob/master/t81_558_class_05_1_reg_ridge_lasso.ipynb)
* Part 5.2: Using K-Fold Cross Validation with Keras [[Video]](https://www.youtube.com/watch?v=maiQf8ray_s&list=PLjy4p-07OYzulelvJ5KVaT2pDlxivl_BN) [[Notebook]](https://github.com/jeffheaton/t81_558_deep_learning/blob/master/t81_558_class_05_2_kfold.ipynb)
* Part 5.3: Using L1 and L2 Regularization with Keras to Decrease Overfitting [[Video]](https://www.youtube.com/watch?v=JEWzWv1fBFQ&list=PLjy4p-07OYzulelvJ5KVaT2pDlxivl_BN) [[Notebook]](https://github.com/jeffheaton/t81_558_deep_learning/blob/master/t81_558_class_05_3_keras_l1_l2.ipynb)
* **Part 5.4: Drop Out for Keras to Decrease Overfitting** [[Video]](https://www.youtube.com/watch?v=bRyOi0L6Rs8&list=PLjy4p-07OYzulelvJ5KVaT2pDlxivl_BN) [[Notebook]](https://github.com/jeffheaton/t81_558_deep_learning/blob/master/t81_558_class_05_4_dropout.ipynb)
* Part 5.5: Benchmarking Keras Deep Learning Regularization Techniques [[Video]](https://www.youtube.com/watch?v=1NLBwPumUAs&list=PLjy4p-07OYzulelvJ5KVaT2pDlxivl_BN) [[Notebook]](https://github.com/jeffheaton/t81_558_deep_learning/blob/master/t81_558_class_05_5_bootstrap.ipynb)


# Google CoLab Instructions

The following code ensures that Google CoLab is running and maps Google Drive if needed.

In [1]:
import torch

try:
    COLAB = True
    print("Note: using Google CoLab")
except:
    print("Note: not using Google CoLab")
    COLAB = False

import io
import copy

# Define class for early stopping. For more information, see module 3.4.
class EarlyStopping():
  def __init__(self, patience=5, min_delta=0, restore_best_weights=True):
    self.patience = patience
    self.min_delta = min_delta
    self.restore_best_weights = restore_best_weights
    self.best_model = None
    self.best_loss = None
    self.counter = 0
    self.status = ""
    
  def __call__(self, model, val_loss):
    if self.best_loss == None:
      self.best_loss = val_loss
      self.best_model = copy.deepcopy(model)
    elif self.best_loss - val_loss > self.min_delta:
      self.best_loss = val_loss
      self.counter = 0
      self.best_model.load_state_dict(model.state_dict())
    elif self.best_loss - val_loss < self.min_delta:
      self.counter += 1
      if self.counter >= self.patience:
        self.status = f"Stopped on {self.counter}"
        if self.restore_best_weights:
          model.load_state_dict(self.best_model.state_dict())
        return True
    self.status = f"{self.counter}/{self.patience}"
    return False

# Make use of a GPU if one is available.
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(f"Using device: {device}")

Note: using Google CoLab
Using device: cpu


# Part 5.4: Drop Out for PyTorch to Decrease Overfitting

Hinton, Srivastava, Krizhevsky, Sutskever, & Salakhutdinov (2012) introduced the dropout regularization algorithm. [[Cite:srivastava2014dropout]](http://www.jmlr.org/papers/volume15/nandan14a/nandan14a.pdf) Although dropout works differently than L1 and L2, it accomplishes the same goal—the prevention of overfitting. However, the algorithm does the task by actually removing neurons and connections—at least temporarily. Unlike L1 and L2, no weight penalty is added. Dropout does not directly seek to train small weights.
Dropout works by causing hidden neurons of the neural network to be unavailable during part of the training. Dropping part of the neural network causes the remaining portion to be trained to still achieve a good score even without the dropped neurons. This technique decreases co-adaptation between neurons, which results in less overfitting. 

Most neural network frameworks implement dropout as a separate layer. Dropout layers function like a regular, densely connected neural network layer. The only difference is that the dropout layers will periodically drop some of their neurons during training. You can use dropout layers on regular feedforward neural networks. 

The program implements a dropout layer as a dense layer that can eliminate some of its neurons. Contrary to popular belief about the dropout layer, the program does not permanently remove these discarded neurons. A dropout layer does not lose any of its neurons during the training process, and it will still have the same number of neurons after training. In this way, the program only temporarily masks the neurons rather than dropping them. 
Figure 5.DROPOUT shows how a dropout layer might be situated with other layers.

**Figure 5.DROPOUT: Dropout Regularization**
![Dropout Regularization](https://raw.githubusercontent.com/jeffheaton/t81_558_deep_learning/master/images/class_9_dropout.png "Dropout Regularization")

The discarded neurons and their connections are shown as dashed lines. The input layer has two input neurons as well as a bias neuron. The second layer is a dense layer with three neurons and a bias neuron. The third layer is a dropout layer with six regular neurons even though the program has dropped 50% of them. While the program drops these neurons, it neither calculates nor trains them. However, the final neural network will use all of these neurons for the output. As previously mentioned, the program only temporarily discards the neurons. 

The program chooses different sets of neurons from the dropout layer during subsequent training iterations. Although we chose a probability of 50% for dropout, the computer will not necessarily drop three neurons. It is as if we flipped a coin for each of the dropout candidate neurons to choose if that neuron was dropped out. You must know that the program should never drop the bias neuron. Only the regular neurons on a dropout layer are candidates.
The implementation of the training algorithm influences the process of discarding neurons. The dropout set frequently changes once per training iteration or batch. The program can also provide intervals where all neurons are present. Some neural network frameworks give additional hyper-parameters to allow you to specify exactly the rate of this interval. 

Why dropout is capable of decreasing overfitting is a common question. The answer is that dropout can reduce the chance of codependency developing between two neurons. Two neurons that develop codependency will not be able to operate effectively when one is dropped out. As a result, the neural network can no longer rely on the presence of every neuron, and it trains accordingly. This characteristic decreases its ability to memorize the information presented, thereby forcing generalization.

Dropout also decreases overfitting by forcing a bootstrapping process upon the neural network. Bootstrapping is a prevalent ensemble technique. Ensembling is a technique of machine learning that combines multiple models to produce a better result than those achieved by individual models. The ensemble is a term that originates from the musical ensembles in which the final music product that the audience hears is the combination of many instruments.  

Bootstrapping is one of the most simple ensemble techniques. The bootstrapping programmer simply trains several neural networks to perform precisely the same task. However, each neural network will perform differently because of some training techniques and the random numbers used in the neural network weight initialization. The difference in weights causes the performance variance. The output from this ensemble of neural networks becomes the average output of the members taken together. This process decreases overfitting through the consensus of differently trained neural networks.  

Dropout works somewhat like bootstrapping. You might think of each neural network that results from a different set of neurons being dropped out as an individual member in an ensemble. As training progresses, the program creates more neural networks in this way. However, dropout does not require the same amount of processing as bootstrapping. The new neural networks created are temporary; they exist only for a training iteration. The final result is also a single neural network rather than an ensemble of neural networks to be averaged together.

The following animation shows how dropout works: [animation link](https://yusugomori.com/projects/deep-learning/dropout-relu)

In [2]:
import pandas as pd
from scipy.stats import zscore

# Read the data set
df = pd.read_csv(
    "https://data.heatonresearch.com/data/t81-558/jh-simple-dataset.csv",
    na_values=['NA','?'])

# Generate dummies for job
df = pd.concat([df,pd.get_dummies(df['job'],prefix="job")],axis=1)
df.drop('job', axis=1, inplace=True)

# Generate dummies for area
df = pd.concat([df,pd.get_dummies(df['area'],prefix="area")],axis=1)
df.drop('area', axis=1, inplace=True)

# Missing values for income
med = df['income'].median()
df['income'] = df['income'].fillna(med)

# Standardize ranges
df['income'] = zscore(df['income'])
df['aspect'] = zscore(df['aspect'])
df['save_rate'] = zscore(df['save_rate'])
df['age'] = zscore(df['age'])
df['subscriptions'] = zscore(df['subscriptions'])

# Convert to numpy - Classification
x_columns = df.columns.drop('product').drop('id')
x = df[x_columns].values
dummies = pd.get_dummies(df['product']) # Classification
products = dummies.columns
y = dummies.values

Now we will see how to apply dropout to classification.

In [3]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

import torch.nn as nn
import torch.nn.functional as F
import numpy as np
from torch.autograd import Variable
from sklearn import preprocessing
from torch.utils.data import DataLoader, TensorDataset
from sklearn.model_selection import KFold
from sklearn import metrics
import tqdm
import time

EPOCHS=500
BATCH_SIZE = 16

# Define the PyTorch Neural Network
class Net(nn.Module):
    def __init__(self, in_count, out_count):
        super(Net, self).__init__()
        self.fc1 = nn.Linear(in_count, 50)
        self.fc2 = nn.Linear(50, 25)
        self.fc3 = nn.Linear(25, out_count)
        self.softmax = nn.Softmax(dim=1)
        self.dropout = nn.Dropout(0.25)

    def forward(self, x):
        x = F.relu(self.fc1(x))
        x = self.dropout(x)
        x = F.relu(self.fc2(x))
        return self.softmax(self.fc3(x))

# Cross-Validate
kf = KFold(5, shuffle=True, random_state=42) # Use for KFold classification
oos_y_list = []
oos_pred_list = []

fold = 0
for train, test in kf.split(x):
    fold+=1
    print(f"Fold #{fold}")
        
    x_train = x[train]
    y_train = y[train]
    x_test = x[test]
    y_test = y[test]

    # Numpy to PyTorch
    x_train = torch.Tensor(x_train).float()
    y_train = torch.Tensor(y_train).float()

    x_test = torch.Tensor(x_test).float().to(device)
    y_test = torch.Tensor(y_test).float().to(device)

    # Create datasets
    dataset_train = TensorDataset(x_train, y_train)
    dataloader_train = DataLoader(dataset_train,\
      batch_size=BATCH_SIZE, shuffle=True)

    dataset_test = TensorDataset(x_test, y_test)
    dataloader_test = DataLoader(dataset_test,\
      batch_size=BATCH_SIZE, shuffle=True)

    # Train the network
    model = Net(x.shape[1],len(products)).to(device)

    # Define the loss function for classification
    loss_fn = nn.CrossEntropyLoss()# cross entropy loss

    # Define the optimizer
    optimizer = torch.optim.Adam(model.parameters(), lr=0.01)

    es = EarlyStopping()

    epoch = 0
    done = False
    while epoch<1000 and not done:
      epoch += 1
      steps = list(enumerate(dataloader_train))
      pbar = tqdm.tqdm(steps)
      model.train()
      for i, (x_batch, y_batch) in pbar:
        y_batch_pred = model(x_batch.to(device))
        loss = loss_fn(y_batch_pred, y_batch.to(device))
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()
        
        loss, current = loss.item(), (i + 1)* len(x_batch)
        if i == len(steps)-1:
          model.eval()
          pred = model(x_test)
          vloss = loss_fn(pred, y_test)
          if es(model,vloss): done = True
          pbar.set_description(f"Epoch: {epoch}, tloss: {loss}, vloss: {vloss:>7f}, EStop:[{es.status}]")
        else:
          pbar.set_description(f"Epoch: {epoch}, tloss {loss:}")
    
    pred = model(x_test)
    
    oos_y_list.append(y_test.cpu().detach())
    oos_pred_list.append(pred.cpu().detach())    

    # Measure this fold's RMSE
    #score = np.sqrt(metrics.mean_squared_error(pred.cpu().detach(),y_test.cpu().detach()))
    #print(f"Fold score (RMSE): {score}")

    # Measure this fold's accuracy
    y_compare = np.argmax(y_test.cpu().detach(),axis=1) # For accuracy calculation
    pred = np.argmax(pred.cpu().detach(),axis=1) # For accuracy calculation
    score = metrics.accuracy_score(y_compare, pred)
    print(f"Fold score (accuracy): {score}")

Fold #1


Epoch: 1, tloss: 1.402353048324585, vloss: 1.644313, EStop:[0/5]: 100%|██████████| 100/100 [00:01<00:00, 96.17it/s]
Epoch: 2, tloss: 1.4827700853347778, vloss: 1.514132, EStop:[0/5]: 100%|██████████| 100/100 [00:01<00:00, 83.41it/s]
Epoch: 3, tloss: 1.3619990348815918, vloss: 1.483695, EStop:[0/5]: 100%|██████████| 100/100 [00:01<00:00, 94.75it/s]
Epoch: 4, tloss: 1.5625371932983398, vloss: 1.489477, EStop:[1/5]: 100%|██████████| 100/100 [00:01<00:00, 90.33it/s] 
Epoch: 5, tloss: 1.4772801399230957, vloss: 1.511736, EStop:[2/5]: 100%|██████████| 100/100 [00:01<00:00, 93.77it/s]
Epoch: 6, tloss: 1.536592721939087, vloss: 1.523030, EStop:[3/5]: 100%|██████████| 100/100 [00:01<00:00, 71.57it/s]
Epoch: 7, tloss: 1.4883153438568115, vloss: 1.490681, EStop:[4/5]: 100%|██████████| 100/100 [00:01<00:00, 79.86it/s]
Epoch: 8, tloss: 1.6731081008911133, vloss: 1.516463, EStop:[Stopped on 5]: 100%|██████████| 100/100 [00:00<00:00, 114.82it/s]


Fold score (accuracy): 0.68
Fold #2


Epoch: 1, tloss: 1.3519165515899658, vloss: 1.704222, EStop:[0/5]: 100%|██████████| 100/100 [00:01<00:00, 93.50it/s]
Epoch: 2, tloss: 1.5541309118270874, vloss: 1.683767, EStop:[0/5]: 100%|██████████| 100/100 [00:00<00:00, 102.28it/s]
Epoch: 3, tloss: 1.4977548122406006, vloss: 1.606455, EStop:[0/5]: 100%|██████████| 100/100 [00:01<00:00, 91.68it/s]
Epoch: 4, tloss: 1.5953760147094727, vloss: 1.591688, EStop:[0/5]: 100%|██████████| 100/100 [00:01<00:00, 97.96it/s] 
Epoch: 5, tloss: 1.5990331172943115, vloss: 1.586289, EStop:[0/5]: 100%|██████████| 100/100 [00:01<00:00, 95.39it/s]
Epoch: 6, tloss: 1.653628945350647, vloss: 1.660341, EStop:[1/5]: 100%|██████████| 100/100 [00:00<00:00, 115.44it/s]
Epoch: 7, tloss: 1.6890020370483398, vloss: 1.529499, EStop:[0/5]: 100%|██████████| 100/100 [00:00<00:00, 101.93it/s]
Epoch: 8, tloss: 1.518317699432373, vloss: 1.525989, EStop:[0/5]: 100%|██████████| 100/100 [00:01<00:00, 98.40it/s] 
Epoch: 9, tloss: 1.46916663646698, vloss: 1.582204, EStop:[1/

Fold score (accuracy): 0.6575
Fold #3


Epoch: 1, tloss: 1.665410041809082, vloss: 1.680460, EStop:[0/5]: 100%|██████████| 100/100 [00:01<00:00, 71.28it/s]
Epoch: 2, tloss: 1.5456373691558838, vloss: 1.666368, EStop:[0/5]: 100%|██████████| 100/100 [00:00<00:00, 104.22it/s]
Epoch: 3, tloss: 1.6036372184753418, vloss: 1.638480, EStop:[0/5]: 100%|██████████| 100/100 [00:00<00:00, 140.57it/s]
Epoch: 4, tloss: 1.5593489408493042, vloss: 1.636862, EStop:[0/5]: 100%|██████████| 100/100 [00:00<00:00, 131.07it/s]
Epoch: 5, tloss: 1.7288930416107178, vloss: 1.636376, EStop:[0/5]: 100%|██████████| 100/100 [00:00<00:00, 137.92it/s]
Epoch: 6, tloss: 1.5479084253311157, vloss: 1.635964, EStop:[0/5]: 100%|██████████| 100/100 [00:00<00:00, 145.53it/s]
Epoch: 7, tloss: 1.7553894519805908, vloss: 1.642339, EStop:[1/5]: 100%|██████████| 100/100 [00:00<00:00, 141.19it/s]
Epoch: 8, tloss: 1.7903996706008911, vloss: 1.635041, EStop:[0/5]: 100%|██████████| 100/100 [00:00<00:00, 136.74it/s]
Epoch: 9, tloss: 1.6186096668243408, vloss: 1.637189, ESto

Fold score (accuracy): 0.53
Fold #4


Epoch: 1, tloss: 1.790421485900879, vloss: 1.662916, EStop:[0/5]: 100%|██████████| 100/100 [00:00<00:00, 131.63it/s]
Epoch: 2, tloss: 1.727921962738037, vloss: 1.662842, EStop:[0/5]: 100%|██████████| 100/100 [00:00<00:00, 136.79it/s]
Epoch: 3, tloss: 1.8544259071350098, vloss: 1.640962, EStop:[0/5]: 100%|██████████| 100/100 [00:00<00:00, 147.57it/s]
Epoch: 4, tloss: 1.6669265031814575, vloss: 1.640880, EStop:[0/5]: 100%|██████████| 100/100 [00:00<00:00, 143.79it/s]
Epoch: 5, tloss: 1.7237110137939453, vloss: 1.630943, EStop:[0/5]: 100%|██████████| 100/100 [00:00<00:00, 147.12it/s]
Epoch: 6, tloss: 1.727921962738037, vloss: 1.629874, EStop:[0/5]: 100%|██████████| 100/100 [00:00<00:00, 149.64it/s]
Epoch: 7, tloss: 1.790421962738037, vloss: 1.630765, EStop:[1/5]: 100%|██████████| 100/100 [00:00<00:00, 136.72it/s]
Epoch: 8, tloss: 1.6653416156768799, vloss: 1.629217, EStop:[0/5]: 100%|██████████| 100/100 [00:00<00:00, 142.13it/s]
Epoch: 9, tloss: 1.4778081178665161, vloss: 1.627920, EStop:

Fold score (accuracy): 0.545
Fold #5


Epoch: 1, tloss: 1.790421962738037, vloss: 1.690422, EStop:[0/5]: 100%|██████████| 100/100 [00:01<00:00, 98.15it/s]
Epoch: 2, tloss: 1.7279078960418701, vloss: 1.690422, EStop:[0/5]: 100%|██████████| 100/100 [00:01<00:00, 96.23it/s]
Epoch: 3, tloss: 1.7920873165130615, vloss: 1.688936, EStop:[0/5]: 100%|██████████| 100/100 [00:01<00:00, 86.22it/s]
Epoch: 4, tloss: 1.4543278217315674, vloss: 1.566558, EStop:[0/5]: 100%|██████████| 100/100 [00:00<00:00, 115.54it/s]
Epoch: 5, tloss: 1.3614859580993652, vloss: 1.560618, EStop:[0/5]: 100%|██████████| 100/100 [00:00<00:00, 142.28it/s]
Epoch: 6, tloss: 1.5441558361053467, vloss: 1.571875, EStop:[1/5]: 100%|██████████| 100/100 [00:00<00:00, 142.49it/s]
Epoch: 7, tloss: 1.540421962738037, vloss: 1.607793, EStop:[2/5]: 100%|██████████| 100/100 [00:00<00:00, 146.30it/s]
Epoch: 8, tloss: 1.6661550998687744, vloss: 1.525931, EStop:[0/5]: 100%|██████████| 100/100 [00:00<00:00, 139.94it/s]
Epoch: 9, tloss: 1.540419340133667, vloss: 1.541427, EStop:[1

Fold score (accuracy): 0.6425
