RL Assignment 3  - **DL Training with Optimizations**



1.   Adding He initialization (Kaiming) + Compare training results with the base model
2.   Adding Nadam optimization + Comparing training results with the base model
3.   Combining the two modifications into a single implementation and trying to justify and explain the results from the two enhancements



In [None]:
import torch 
import torch.nn as nn
import torchvision
import torchvision.transforms as transforms

In [None]:
# For reproducibility
torch.manual_seed(0)

<torch._C.Generator at 0x7f28a34c45a0>

In [None]:
# Device configuration
device = torch.device('cuda:0' if torch.cuda.is_available() else 'cpu')

In [None]:
# Initializing Hyper parameters
num_epochs = 5
num_classes = 10
batch_size = 100
learning_rate = 0.001

In [None]:
# Obtaining MNIST dataset
train_dataset = torchvision.datasets.MNIST(root='../../data/',
                                           train=True, 
                                           transform=transforms.ToTensor(),
                                           download=True)

test_dataset = torchvision.datasets.MNIST(root='../../data/',
                                          train=False, 
                                          transform=transforms.ToTensor())

In [None]:
# Init dataloaders
train_loader = torch.utils.data.DataLoader(dataset=train_dataset,
                                           batch_size=batch_size, 
                                           shuffle=True)

test_loader = torch.utils.data.DataLoader(dataset=test_dataset,
                                          batch_size=batch_size, 
                                          shuffle=False)

In [None]:
# CNN (two convolutional layers)
class ConvNet(nn.Module):
    def __init__(self, num_classes=10):
        super(ConvNet, self).__init__()
        self.layer1 = nn.Sequential(
            nn.Conv2d(1, 16, kernel_size=5, stride=1, padding=2),
            nn.BatchNorm2d(16),
            nn.ReLU(),
            nn.MaxPool2d(kernel_size=2, stride=2))
        self.layer2 = nn.Sequential(
            nn.Conv2d(16, 32, kernel_size=5, stride=1, padding=2),
            nn.BatchNorm2d(32),
            nn.ReLU(),
            nn.MaxPool2d(kernel_size=2, stride=2))
        self.fc = nn.Linear(7*7*32, num_classes)
        
    def forward(self, x):
        out = self.layer1(x)
        out = self.layer2(out)
        out = out.reshape(out.size(0), -1)
        out = self.fc(out)
        return out

In [None]:
mdl = ConvNet(num_classes).to(device)

 Initializing the weights of the model with He initialization (Kaiming)

In [None]:
def weights_init(m):
    if isinstance(m, nn.Conv2d):
        nn.init.kaiming_uniform_(m.weight.data)
        nn.init.zeros_(m.bias.data)

In [None]:
# Applying He initialization
mdl.apply(weights_init)

ConvNet(
  (layer1): Sequential(
    (0): Conv2d(1, 16, kernel_size=(5, 5), stride=(1, 1), padding=(2, 2))
    (1): BatchNorm2d(16, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    (2): ReLU()
    (3): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)
  )
  (layer2): Sequential(
    (0): Conv2d(16, 32, kernel_size=(5, 5), stride=(1, 1), padding=(2, 2))
    (1): BatchNorm2d(32, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    (2): ReLU()
    (3): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)
  )
  (fc): Linear(in_features=1568, out_features=10, bias=True)
)

In [None]:
# Loss 
criterion = nn.CrossEntropyLoss()
# optimizer
optimizer = torch.optim.Adam(mdl.parameters(), lr=learning_rate)

Adding Nadam optimization to this model (Ref: https://github.com/pytorch/)

In [None]:
import math
from torch.optim.optimizer import Optimizer

class Nadam(Optimizer):
    """Implements Nadam algorithm.
    It has been proposed in `Incorporating Nesterov Momentum into Adam`_.
    Arguments:
        params (iterable): iterable of parameters to optimize or dicts defining
            parameter groups
        lr (float, optional): learning rate (default: 2e-3)
        betas (Tuple[float, float], optional): coefficients used for computing
            running averages of gradient and its square (default: (0.975, 0.999))
        eps (float, optional): term added to the denominator to improve
            numerical stability (default: 1e-8)
        schedule_decay (float, optional): beta1 decay factor (default: 0)
        weight_decay (float, optional): weight decay (L2 penalty) (default: 0)
    .. _Incorporating Nesterov Momentum into Adam
        https://openreview.net/pdf?id=OM0jvwB8jIp57ZJjtNEZ
    """

    def __init__(self, params, lr=2e-3, betas=(0.975, 0.999), eps=1e-8,
                 schedule_decay=0, weight_decay=0):
        defaults = dict(lr=lr, betas=betas, eps=eps,
                        schedule_decay=schedule_decay, weight_decay=weight_decay,
                        prod_beta1=1.)
        super(Nadam, self).__init__(params, defaults)

    def step(self, closure=None):
        """Performs a single optimization step.
        Arguments:
            closure (callable, optional): A closure that reevaluates the model
                and returns the loss.
        """
        loss = None
        if closure is not None:
            loss = closure()

        for group in self.param_groups:
            beta1, beta2 = group['betas']

            for p in group['params']:
                if p.grad is None:
                    continue
                grad = p.grad.data
                state = self.state[p]

                # State initialization
                if len(state) == 0:
                    state['step'] = 0
                    # Exponential moving average of gradient values
                    state['exp_avg'] = grad.new().resize_as_(grad).zero_()
                    # Exponential moving average of squared gradient values
                    state['exp_avg_sq'] = grad.new().resize_as_(grad).zero_()

                exp_avg, exp_avg_sq = state['exp_avg'], state['exp_avg_sq']
                state['step'] += 1

                if group['weight_decay'] != 0:
                    grad = grad.add(group['weight_decay'], p.data)

                schedule_decay = group['schedule_decay']
                cur_beta1 = beta1 * (1. - 0.5 * (0.96 ** (state['step'] * schedule_decay)))
                next_beta1 = beta1 * (1. - 0.5 * (0.96 ** ((state['step'] + 1) * schedule_decay)))
                prod_beta1 = group['prod_beta1']
                prod_beta1 *= cur_beta1
                next_prod_beta1 = prod_beta1 * next_beta1
                bias_correction1 = (1 - cur_beta1) / (1 - prod_beta1)
                next_bias_correction1 = next_beta1 / (1 - next_prod_beta1)

                # Decay the first and second moment running average coefficient
                exp_avg.mul_(cur_beta1).add_(1 - cur_beta1, grad)
                exp_avg_sq.mul_(beta2).addcmul_(1 - beta2, grad, grad)

                sqrt_bias_correction2 = math.sqrt((1 - beta2 ** state['step']) / beta2)
                step_size = group['lr'] * sqrt_bias_correction2

                denom = exp_avg_sq.sqrt().add_(group['eps'])

                # For memory efficiency, separate update into two
                p.data.addcdiv_(-step_size * next_bias_correction1, exp_avg, denom)
                p.data.addcdiv_(-step_size * bias_correction1, grad, denom)

                # update prod_beta1
                group['prod_beta1'] = prod_beta1

        return loss

Applying Nadam optimization and training the model

In [None]:
optimizer = Nadam(mdl.parameters(), lr=learning_rate)
total_step = len(train_loader)
for epoch in range(num_epochs):
    for i, (images, labels) in enumerate(train_loader):
        images = images.to(device)
        labels = labels.to(device)
        
        # Forward pass
        outputs = mdl(images)
        loss = criterion(outputs, labels)
        
        # Backward and optimize
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()
        
        if (i+1) % 100 == 0:
            print ('Epoch [{}/{}], Step [{}/{}], Loss: {:.4f}' 
                   .format(epoch+1, num_epochs, i+1, total_step, loss.item()))

Epoch [1/5], Step [100/600], Loss: 0.1423
Epoch [1/5], Step [200/600], Loss: 0.1485
Epoch [1/5], Step [300/600], Loss: 0.1188
Epoch [1/5], Step [400/600], Loss: 0.0834
Epoch [1/5], Step [500/600], Loss: 0.1503
Epoch [1/5], Step [600/600], Loss: 0.0579
Epoch [2/5], Step [100/600], Loss: 0.0927
Epoch [2/5], Step [200/600], Loss: 0.0527
Epoch [2/5], Step [300/600], Loss: 0.0666
Epoch [2/5], Step [400/600], Loss: 0.0888
Epoch [2/5], Step [500/600], Loss: 0.0490
Epoch [2/5], Step [600/600], Loss: 0.0474
Epoch [3/5], Step [100/600], Loss: 0.0338
Epoch [3/5], Step [200/600], Loss: 0.0321
Epoch [3/5], Step [300/600], Loss: 0.0265
Epoch [3/5], Step [400/600], Loss: 0.0364
Epoch [3/5], Step [500/600], Loss: 0.0206
Epoch [3/5], Step [600/600], Loss: 0.0232
Epoch [4/5], Step [100/600], Loss: 0.0623
Epoch [4/5], Step [200/600], Loss: 0.0209
Epoch [4/5], Step [300/600], Loss: 0.0590
Epoch [4/5], Step [400/600], Loss: 0.0174
Epoch [4/5], Step [500/600], Loss: 0.0094
Epoch [4/5], Step [600/600], Loss:

In [None]:
# Model testing
mdl.eval() 
with torch.no_grad():
    correct = 0
    total = 0
    for images, labels in test_loader:
        images = images.to(device)
        labels = labels.to(device)
        outputs = model(images)
        _, predicted = torch.max(outputs.data, 1)
        total += labels.size(0)
        correct += (predicted == labels).sum().item()

    print('Test Accuracy of the model on the 10000 test images: {} %'.format(100 * correct / total))

Test Accuracy of the model on the 10000 test images: 98.87 %


In [None]:
# Model checkpoint saving
torch.save(mdl.state_dict(), 'mdl.ckpt')

## Conclusion

Base model Accuracy: 99.06% (with no initialization & Adam optimizer)
He Init model Accuracy: 
1.   Adding He initialization (Kaiming) resulted in an accuracy of 98.87%
2.   Adding Nadam optimizer resulted in an accuracy of 98.83% which is a drop from base model accuracy
3.   Combining the two modification yielded an accuracy of 98.87% (also a drop)

