# NNIA Assignment 8

**DEADLINE: 11. 01. 2023 08:00 CET**
Submission more than 10 minutes past the deadline will **not** be graded!

- Name & ID 1 (Teams username e.g. s8xxxxx):
- Name & ID 2 (Teams username e.g. s8xxxxx):
- Name & ID 3 (Teams username e.g. s8xxxxx):
- Hours of work per person:

# Submission Instructions

**IMPORTANT** Please make sure you read the following instructions carefully. If you are unclear about any part of the assignment, ask questions **before** the assignment deadline. All course-related questions can be addressed on the course **[Piazza Platform](https://piazza.com/class/kvc3vzhsvh55rt)**.

* Assignments are to be submitted in a **team of 2 or 3**.
* Please include your **names**, **ID's**, **Teams usernames**, and **approximate total time spent per person** at the beginning of the Notebook in the space provided
* Make sure you appropriately comment your code wherever required.
* Your final submission should contain this completed Jupyter Notebook, including the bonus question (if you attempt it), and any necessary Python files.
* Do **not** submit any **data or cache files** (e.g. `__pycache__`, the dataset PyTorch downloads, etc.). 
* Upload the **zipped** folder (*.zip* is the only accepted extension) in **Teams**.
* Only **one member** of the group should make the submisssion.
* **Important** please name the submitted zip folder as: `Name1_id1_Name2_id2.zip`. The Jupyter Notebook should also be named: `Name1_id1_Name2_id2.ipynb`. This is **very important** for our internal organization epeatedly students fail to do this.

## 1 SGD, Batch, Mini-Batch  (1.5 pts)

Typically neural networks are large and are trained with millions of data points. It is thus often infeasible to compute the gradient $\nabla_{\theta} \tilde J(\theta)$ that requires the accumulation of the gradient over the entire training set. 

There are various online resources on Stochastic, Batch, and Minit-Gradient Descent methods in addition to what was covered during the lecture. Here are a few:

- [Medium: Batch , Mini-Batch and Stochastic gradient descent](https://sweta-nit.medium.com/batch-mini-batch-and-stochastic-gradient-descent-e9bc4cacd461)
- [DeepLearningAI: Batch vs Mini-Batch](https://youtu.be/4qJaSmvhxi8)

**Discuss pros and cons of (1) stochastic (m=1), (2) batch (m = size of dataset) and (3) mini-batch gradient descent** (m is the number of points passed at a time).

## <font color="red">To Do</font>


### 1

- Stochastic: 

- Batch: 

- Mini-Batch: 
  

## 2 Possible Problems (2.5 pts)

1. One of the optimization challenges is ill-conditioning. To answer the following questions read [this article](https://medium.com/@shaikhz94/understanding-ill-conditioning-in-deep-neural-networks-2396d6fb0098) (6 min read). Answer the questions in your own words. (1.5 pts) 
  - Read part [8.2.1 Ill-Conditioning](https://www.deeplearningbook.org/contents/optimization.html) of the Deep Learning Book and explain why very small steps increase cost function when the Hessian matrix is ill-conditioned. Start from the equation of the second-order Taylor series expansion of the cost function.  
  - In practice, how can we spot ill-conditioning?
  - What can we do to solve the problem of ill-conditioning?
		
2. Explain what the exploding gradient problem is and when it occurs. What can be done to solve the exploding gradient problem? (1 pt)


## <font color="red">To Do</font>


### 2

1.  

2. 
  

## 3 Implementation (6 points)

Now you will be implementing and testing different approaches to optimize the training of a neural network. Here you will be working with the [CIFAR10](https://pytorch.org/vision/stable/generated/torchvision.datasets.CIFAR10.html#cifar10) dataset, which consists of images of airplanes, cars, birds, cats, deer, dogs, frogs, horses, ships, and trucks. Once again [cross entropy loss](https://pytorch.org/docs/stable/generated/torch.nn.CrossEntropyLoss.html) will be used as the loss criterion.

### 3.1 Early Stopping (2 points)

Although not covered explicitly in the last lecture as an optimization strategy, early stopping can be used as both a regularization technique and an optimization technique.

Early stopping is necessary since we are training and reducing empirical risk, thus our algorithms will not halt, they will simply be stopped once a convergence criteria has been met. Up until this point we have used a very crude strategy of simply limiting our learning to a number of epochs or steps.

In this exercise, you are tasked with implementing a simple version of early stopping. The way to usually implement early stopping is to stop the training once a specific metric does not change more than an arbitrary value $\varepsilon$ after a number of epochs (or steps) $n$, i.e.: $\forall i \in {1..n} \,\, \left\| M_i - M_{i-1} \right\| < \varepsilon$ where $M_i$ is the value of the metric at epoch $i$.

**Note**: The $n$ value above is usually called `patience` in common implementations.

In [None]:
!pip install -q torcheval

In [None]:
# Let's import some libraries
import torch
import torchvision
import torchvision.transforms as transforms
import torch.optim as optim
import torch.nn as nn
import torch.nn.functional as F
from torcheval.metrics import MulticlassAccuracy

import numpy as np
from tqdm import tqdm

#### CIFAR-10: Some notes about our data

This dataset presents a multi-class classification task. More details can be found [here](https://www.cs.toronto.edu/~kriz/cifar.html).

Important details:

* Each image is a color image of 32x32 pixels
* There are 10 classes (laid out in the following code)
* You should use the [MulticlassAccuracy](https://torchmetrics.readthedocs.io/en/stable/classification/accuracy.html#multiclassaccuracy) from the `torcheval.metrics` library (already installed and imported for you).

A quick overview of how to use metrics on pytorch can be found [here](https://torchmetrics.readthedocs.io/en/stable/pages/overview.html).

The TL;DR of it all is to compute metrics per batch using the `update` method and computing a final metric value with `compute`. 

In [None]:
# Let's load our data before getting into it
transform = transforms.Compose(
    [transforms.ToTensor(),
     transforms.Normalize((0.5, 0.5, 0.5), (0.5, 0.5, 0.5))])

batch_size = 32

trainset = torchvision.datasets.CIFAR10(root='./data', train=True,
                                        download=True, transform=transform)
trainloader = torch.utils.data.DataLoader(trainset, batch_size=batch_size,
                                          shuffle=True, num_workers=2)

testset = torchvision.datasets.CIFAR10(root='./data', train=False,
                                       download=True, transform=transform)
testloader = torch.utils.data.DataLoader(testset, batch_size=batch_size,
                                         shuffle=False, num_workers=2)

classes = ('plane', 'car', 'bird', 'cat',
           'deer', 'dog', 'frog', 'horse', 'ship', 'truck')

In [None]:
# A model is given for ease of reproducibility
class MyNetwork(nn.Module):
    def __init__(self, lr=0.0001):
        super(MyNetwork, self).__init__()
        self.learning_rate = lr

        self.network = nn.Sequential(
            nn.Linear(32*32*3, 64),
            nn.ReLU(),
            nn.Linear(64, 10),
            nn.Softmax(dim=1)
        )

    def forward(self, x):
        x = torch.flatten(x, 1) # flatten all dimensions except batch
        return self.network(x)


In [None]:
# Skeleton code 
class EarlyStopChecker:
  def __init__(self,
               delta: float,
               patience: float,
               lower_better: bool = False):
    """
    Params:
    delta: The max difference allowed
    patience: Number of epochs to tolerate without significant changes
    lower_better: The metric value is better the lower it is if `True` is passed
    """
    # TODO: Implement
    raise NotImplementedError
  
  def check_early_stop(self,
                       metric_value: float) -> bool:
    """
    A function which upon receiving the latest metric value, determines
    whether training should be stopped (by returning `True`) or not 
    (by returning `False`)

    Params:
    metric_value: The value of the metric at this time step

    Returns `True` if training should stop, `False` otherwise.
    """
    # TODO: Implement
    raise NotImplementedError

In [None]:
# High learning rate so we can see the early stop effects quickly
net = MyNetwork(lr=0.5)

In [None]:
# Training loop code
epochs = 20
early_stopper = EarlyStopChecker(0.1, 3)

criterion = nn.CrossEntropyLoss()
optimizer = optim.SGD(net.parameters(), lr=net.learning_rate)
metric = MulticlassAccuracy()

for epoch in range(epochs):

  running_loss = 0.0
  for n_batch, mini_batch in (pbar := tqdm(enumerate(trainloader, 1),
                                           total=len(trainloader))):
    # Training code
    inputs, labels = mini_batch
    optimizer.zero_grad()

    # Do optimization step
    outputs = net(inputs)
    loss = criterion(outputs, labels)
    loss.backward()
    optimizer.step()

    # Print some statistics
    running_loss += loss.item()
    pbar.set_description(f"Loss: {running_loss / n_batch:.4f}")
    pbar.refresh()

  # TODO: compute metrics on the test/validation dataset
  current_metric = ...

  # TODO: Use `early_stopper.check_early_stop()` to check
  # whether training should continue


  print(f"Accuracy for epoch #{epoch}: {current_metric.item()}")

  # Reset the metric in preparation for the next epoch
  metric.reset()

### 3.2 Trying different optimizers (2 points)

### 3.2.1 Using different optimizers in pytorch (1 point)

It's time to try out the different optimizers covered in the lecture.

Create a function that returns an optimizer object for each of the following:

1. [SGD with Momentum](https://pytorch.org/docs/stable/generated/torch.optim.SGD.html): note that you should set the momentum parameter to $0.9$. This is the value usually picked as the hyperparameter although other approaches to picking a value exist.
1. [AdaGrad](https://pytorch.org/docs/stable/generated/torch.optim.Adagrad.html)
1. [RMSProp](https://pytorch.org/docs/stable/generated/torch.optim.RMSprop.html)
1. [Adam](https://pytorch.org/docs/stable/generated/torch.optim.Adam.html)


In [None]:
# Set our seed for reproducibility
torch.manual_seed(0)

In [None]:
def get_optimizer(optimizer_name: str,
                  model: nn.Module,
                  learning_rate: float):
  """
  Function that returns an optimizer object that will be used
  for training

  Params:
  optimizer_name: This will indicate the type of optimizer to return
  model: Model which has the parameters to be optimized
  learning_rate: learning rate to be used
  """
  # TODO: return the right object
  raise NotImplementedError

In [None]:
# Usage:
sgd_mom_optimizer = get_optimizer('momentum')
adagrad_optimizer = get_optimizer('adagrad')
rms_optimizer = get_optimizer('rmsprop')
adam_optimizer = get_optimizer('adam')

#### 3.2.2 Discussion (1 point)

What result do you expect for each of the learning optimizers? Motivate your answers while keeping them concise.

## <font color="red">To Do</font>


#### 3.2.2

### 3.3 Analyze your results (2 points)

Modify your training code from previous assignments (and the previous exercise if you wish to use early stopping) and experiment with the different optimization approaches on the CIFAR-10 dataset.

Plot the training loss for each of the optimizers and discuss your results.

How do they compare with your expectations from the previous exercise?

You will be using the same dataset as above. You should also use the same model class (`MyNetwork`) for this exercise as in the previous exercise.

## <font color="red">To Do</font>


### 3.3

In [None]:
# Train for each optimizer object
for epoch in range(10):
  for minibatch in trainloader:
    # TODO: Training code here
    # Train with each of the optimizers listed above
    # Track your results (number of epochs until convergence, final loss, final accuracy) 

In [None]:
# Plot the resulting losses