# NHR Summer School – Data-Parallel Neural Networks with `Pytorch`
#### Dr. Charlotte Debus (charlotte.debus@kit.edu), Dr. Marie Weiel (marie.weiel@kit.edu), and David Li (david.li@kit.edu)
## Agenda

| W H E N           | W H A T                                                 |
| :-----------------| :------------------------------------------------------ |
| **09:00 - 10:15** | **Introduction to Neural Networks**                     |  
|                   | Backpropagation and Stochastic Gradient Descent (SGD)   |  
|                   | Layer Architectures                                     |  
|                   | Training a Neural Network                               |  
| **10:30 - 12:00** | **Hands-on Session: Neural Networks with `PyTorch`**    |  
| 12:00 - 13:00     | *Enjoy your lunch break!*                               |  
| **13:00 - 14:15** | **Data-Parallel Neural Networks**                       |  
|                   | Parallelization Strategies for Neural Networks          |  
|                   | Distributed SGD                                         |  
|                   | IID and Large Minibatch Effects                         |  
| **14:30 - 16:00** | **Hands-on Session: `PyTorch DistributedDataParallel`** |


## Hands-on Session: Neural Networks with `Pytorch`
In this hands-on tutorial, you will learn how to train a neural network in `Pytorch`. This exercise serves as a prerequisite for training the network in a data-parallel fashion later on. We will use the example of *AlexNet*, a convolutional neural network (CNN) for image classification.

### Background and Introduction

#### Convolutional Neural Networks 

In deep learning, CNNs are a class of artificial neural networks most commonly applied to analyze visual images. 
A CNN consists of one or more convolutional layers, each followed by a so-called pooling layer. In the network, you can repeat this unit as often as you wish. 
Compared to fully-connected layers, there are three major differences:
- 2D or 3D arrangement of neurons
- Shared weights
- Local connectivity

You can find a nice overview of CNNs here: https://www.youtube.com/watch?v=YRhxdVk_sIs&list=PLZbbT5o_s2xq7LwI2y8_QtvuXZedL6tQU&index=19  
##### What is a convolutional layer?

Usually, the input is a 2D or 3D matrix representing the pixels of a single grayscale or color image sample. The neurons in a convolutional layer are arranged accordingly. 
Each neuron's input is calculated via a discrete convolution by moving a small convolution matrix, the so-called filter kernel, stepwise over the input. The input corresponds to the inner product of the filter kernel with the currently underlying image section. 
Neighboring neurons in the convolutional layer thus react to overlapping areas in local environments of the input. 
Similar to the biological receptive field, a neuron only responds to stimuli in a local neighborhood of the previous layer. 
The values in the kernel correspond to the weights and are learned independently. 
They are the same for all neurons of a layer which is why CNNs are translationally invariant. This leads to the fact that, for example, each neuron in the first convolutional layer encodes the intensity of edges in a certain local area of the input.

To handle the border regions of the input, various padding methods exist. 
You can find an overview of zero padding in CNNs here: https://www.youtube.com/watch?v=qSTv_m-KFk0&list=PLZbbT5o_s2xq7LwI2y8_QtvuXZedL6tQU&index=23). 

After determining each neuron's input as described above, it is transformed into the output by an activation function, for CNNs usually **Re**ctified **L**inear **U**nit, $ReLU\left(x\right) = \text{max}\left(0,x\right)$. 
Since backpropagation requires the computation of gradients, a differentiable approximation of ReLU is used in practice: $f\left(x\right)=\text{ln}\left(1+e^x\right)$
Such non-linear activations are what create the complex mappings between the model's inputs and outputs. They are applied after linear transformations to introduce nonlinearity, helping neural networks learn a wide variety of phenomena.

##### What is a pooling layer?
The subsequent step, pooling, is used to discard unnecessary information. 
For example, for object recognition in images, an edge's exact position is not important - the approximate location of a feature is sufficient. The biological counterpart to pooling is lateral inhibition in the visual cortex. 
Different types of pooling exist. By far the most common is max-pooling, where only the activity of the most active neuron is kept from each 2 x 2 square of neurons in the convolutional layer. 
The remaining neurons' activities are thrown away. Alternatives like mean pooling turned out to be less efficient in practice. 
Despite the data reduction, the network's performance is generally not reduced; in fact, pooling even offers some significant advantages:

- Reduced memory requirements and increased computational speed; thus, the ability to create deeper networks that can solve more complex tasks
- Analogously to the visual cortex, automatic growth of the size of the receptive fields (without explicitly increasing the size of the filter kernels) and increasing complexity of recognized features, e.g., parts of a face, in deeper convolutional layers
- Prevention of overfitting


![Max Pooling.](Max_pooling.png "Max Pooling with a 2×2 filter and step size = 2. The step size indicates by how many pixels the filter is shifted per operation.")  

Source: By Aphex34 - own work, CC BY-SA 4.0, https://commons.wikimedia.org/w/index.php?curid=45673581  
Useful overview of max-pooling: https://www.youtube.com/watch?v=ZjM_XQa5s6s&list=PLZbbT5o_s2xq7LwI2y8_QtvuXZedL6tQU&index=24

#### AlexNet

*AlexNet* is a CNN architecture developed by Alex Krizhevsky together with Ilya Sutskever and Geoffrey Hinton. 
The network solves the problem of image classification, originally on the ImageNet dataset. 
The input is an RGB image of size 256 x 256 of one of 1000 classes (e.g., cats, dogs, etc.) and the output is a vector of 1000 numbers that sum up to 1. 
Thus, the $i^\text{th}$ element of the output vector can be interpreted as the probability that the input image belongs to the $i^\text{th}$ class. 

In the 2012 ImageNet Large Scale Visual Recognition Challenge (ILSVRC), a software competition in image classification, AlexNet achieved a top-5 error of 15.3%, that is more than 10.8% lower than the runner-up. 
The original work's main finding was that the model's depth is crucial to its high performance, which requires the extensive use of computational resources, but was made possible by training the network on graphics processing units (GPUs).  

*AlexNet* consists of eight layers: the first five are convolutional layers, some followed by max-pooling, the last three are fully-connected layers. It uses the non-saturating ReLU activation function, which offers better training performance than the tanh and sigmoid activation functions.
*AlexNet* is one of the most influential papers in the field of computer vision, as it provided the impulse for many other papers that used CNNs and GPUs to accelerate deep learning. 
According to Google Scholar, it has been cited over 131,000 times (as of May 2, 2023). 


**Imagenet classification with deep convolutional neural networks**

Krizhevsky, Alex, Ilya Sutskever, and Geoffrey E. Hinton. *Advances in neural information processing systems* 25 (2012): 1097-1105  
https://proceedings.neurips.cc/paper/2012/file/c399862d3b9d6b76c8436e924a68c45b-Paper.pdf  

*Abstract*  
We trained a large, deep convolutional neural network to classify the 1.2 million high-resolution images in the ImageNet LSVRC-2010 contest into the 1000 different classes. 
On the test data, we achieved top-1 and top-5 error rates of 37.5% and 17.0%, respectively, which is considerably better than the previous state-of-the-art. 
The neural network, which has 60 million parameters and 650,000 neurons, consists of five convolutional layers, some of which are followed by max-pooling layers, and three fully connected layers with a final 1000-way softmax. 
To make training faster, we used non-saturating neurons and a very efficient GPU implementation of the convolution operation. 
To reduce overfitting in the fully connected layers we employed a recently developed regularization method called "dropout" that proved to be very effective. 
We also entered a variant of this model in the ILSVRC-2012 competition and achieved a winning top-5 test error rate of 15.3%, compared to 26.2% achieved by the second-best entry. 

![Architektur von AlexNet.](AlexNet-1.png "Architecture of AlexNet: AlexNet consists of eight layers: the first five are convolutional layers, some followed by max-pooling layers, the last three are fully connected layers. It uses the non-saturating ReLU activation function.")  

Bildquelle: https://learnopencv.com/understanding-alexnet/

#### ImageNet

ImageNet (https://image-net.org/index.php) is a database of images released in 2009 at the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) that is used to train CNNs. 
Each image is associated with a noun. 
The nouns are arranged hierarchically by the WordNet project. 
For each noun, there are more than 500 images on average. 
For more than 14 million images, the objects depicted were documented by hand. 
In at least one million of the images, these objects are framed. 
ImageNet contains more than 20,000 categories in English, with typical categories such as "balloon" or "strawberry". 
The database of third-party image URL annotations is freely accessible directly through ImageNet, although the actual images are not owned by ImageNet.

Since 2010, the ImageNet project has held an annual software competition, the ImageNet Large Scale Visual Recognition Challenge (ILSVRC). 
Here, software systems from the fields of deep learning and object recognition compete to correctly classify and recognize objects and scenes. 
In this competition, a reduced list of a thousand non-overlapping classes is used.

#### CIFAR-10
Since the ImageNet dataset is relatively large, we use CIFAR-10 here. 
This dataset contains 60,000 color images of size 32 x 32 from 10 classes, where each class holds 6000 images. 
The dataset is divided into five training batches and one test batch, each containing 10,000 images. 
The test batch contains exactly 1000 randomly selected images from each class. 
The training batches contain the remaining images in random order, and some training batches may contain more images from one class than others. 
In total, the training batches contain exactly 5000 images from each class.

Below you can see the classes of the dataset and 10 random images from each class:  
![CIFAR-10-Dataset.](Cifar.png " ")  

Source: https://www.cs.toronto.edu/~kriz/cifar.html

### What you will do now
Implement *AlexNet*, along with a sequential training loop, to classify the CIFAR10 dataset in `PyTorch`. 
Below, you will find 

- the backbone of the neural network, 
- some helper functions, and 
- a function for loading the data.  

Complete the code gap text to create functional Python code with all needed class and function definitions as well as the main part. 
**Normal comments with '#' describe the code as usual, in lines with '##' you have to add code.** 

### 1. Define your model
Neural networks comprise of layers or modules that perform operations on data. 
`PyTorch`'s `torch.nn` namespace provides all the building blocks you need to build your own neural network [doc](https://pytorch.org/tutorials/beginner/basics/buildmodel_tutorial.html#). 
A neural network is a module itself that consists of other modules (layers). 
This nested structure allows you to build complex architectures easily. 
Every neural network module in `PyTorch` should subclass the `nn.Module` base class. 
We thus start with implementing *AlexNet* as a custom model subclass of `nn.Module`. 

Furthermore, every custom model class in `Pytorch` needs to override the `__init__` and the `forward` method. 
The network's layers are initialized in `__init__`, which defines our model's architecture by providing type and order of its layers. Operations on the input data are implemented in the `forward` method, which defines the computation performed at every call, i.e., the so-called forward pass.  
*Side note: Although the recipe for the forward pass is defined in the* `forward` *method, you should call the* `nn.Module` *instance in your actual code instead. This will take care of running the registered hooks while a plain call of* `model.forward()` *will silently ignore them. Just take this as a technical thing to take into account.* 

In addition, the `torchvision` package provides popular datasets (such as the CIFAR10 dataset we will use here), model architectures, and common image transformations for computer vision [doc](https://pytorch.org/vision/main/index.html).

To define the *AlexNet* model in `__init__`, you will need the following layers as building blocks:
- Convolutional layer `torch.nn.Conv2d(in_channels, out_channels, kernel_size, stride, padding)` [doc](https://pytorch.org/docs/stable/generated/torch.nn.Conv2d.html)
- ReLU activation function `torch.nn.ReLU()` [doc](https://pytorch.org/docs/stable/generated/torch.nn.ReLU.html)
- Max-pooling layer `torch.nn.MaxPool2d(kernel_size, stride)` [doc](https://pytorch.org/docs/stable/generated/torch.nn.MaxPool2d.html)
- Dropout `torch.nn.Dropout(p)` [doc](https://pytorch.org/docs/stable/generated/torch.nn.Dropout.html)
- Fully connected layer `torch.nn.Linear(in_features, out_features)` [doc](https://pytorch.org/docs/stable/generated/torch.nn.Linear.html)

Last but not least: Always remember, [Google](https://www.google.com) is your best friend.

In [1]:
# IMPORTS
import torch
import torchvision
import os
import time
import numpy as np
import random

In [2]:
# MODEL
# Define neural network by subclassing PyTorch's nn.Module. 
class AlexNet(torch.nn.Module):
    
    # Initialize neural network layers in __init__. 
    def __init__(self, num_classes = 1000, dropout = 0.5):
        super().__init__()
        self.features = torch.nn.Sequential(
            # AlexNet has 8 layers: 5 convolutional layers, some followed by max-pooling (see figure),
            # and 3 fully connected layers. In this model, we use nn.ReLU between our layers, 
            # but there are other activations to introduce non-linearity in a model.
            # nn.Sequential is an ordered container of modules. 
            # The data is passed through all the modules in the same order as defined. 
            # You can use sequential containers to put together a quick network.
            #
            # IMPLEMENT FEATURE-EXTRACTOR PART OF ALEXNET HERE!
            # 1st convolutional layer (+ max-pooling)
            torch.nn.Conv2d(3, 64, kernel_size=11, stride=4, padding=2),
            torch.nn.ReLU(inplace=True),
            torch.nn.MaxPool2d(kernel_size=3, stride=2),
            ## 2nd convolutional layer (+ max-pooling)
            ## 3rd + 4th convolutional layer
            ## 5th convolutional layer (+ max-pooling)
        )
        # Average pooling to downscale possibly larger input images.
        self.avgpool = torch.nn.AdaptiveAvgPool2d((6, 6))
        self.classifier = torch.nn.Sequential( 
            # IMPLEMENT FULLY CONNECTED MULTI-LAYER PERCEPTRON PART HERE!
            # 6th, 7th + 8th fully connected layer 
            # The linear layer is a module that applies a linear transformation 
            # on the input using its stored weights and biases.
            # 6th fully connected layer (+ dropout)
            torch.nn.Dropout(p=dropout),
            torch.nn.Linear(256 * 6 * 6, 4096),
            torch.nn.ReLU(inplace=True),
            ## 7th fully connected layer (+ dropout)
            # 8th (output) layer
            torch.nn.Linear(4096, num_classes),
        )
    # Forward pass: Implement operations on the input data, i.e., apply model to input x.
    def forward(self, x):
        # IMPLEMENT OPERATIONS ON INPUT DATA x HERE!
        ## Apply feature-extractor part to input.
        ## Apply average-pooling part.
        x = torch.flatten(x, 1) # Flatten.
        ## Apply fully connected multilayer perceptron part.
        return x

### 2. Define useful helper functions
Next, we will define some useful helper functions:
- `set_all_seeds`: Set all random seeds to a fixed value. This is important if you want to make your experiments reproducible, which is helpful when debugging code.
- `compute_accuracy`: Compute the accuracy of your model's predictions on a given dataset. You will need this function for validating your model during training and for testing it on a held-out test dataset after the training is done. 

In [None]:
# NOTE: To run the code on a cluster, it is beneficial to put the helper
# functions in a separate module file called, e.g., helper.py and import them via:
# from helper import set_all_seeds

def set_all_seeds(seed):
    """
    Set all seeds to make experiments reproducible.
    
    Params
    ------
    seed : int
           seed to use
    """
    os.environ["PL_GLOBAL_SEED"] = str(seed) # seed passed to spawned subprocesses
    random.seed(seed)
    np.random.seed(seed)
    torch.manual_seed(seed)
    torch.cuda.manual_seed_all(seed)
    return


def compute_accuracy(model, data_loader, device):
    """
    Compute accuracy of model predictions on given labeled data.
    
    Params
    ------
    model : torch.nn.Module
            Model.
    data_loader : torch.utils.data.Dataloader
                  Dataloader.
    device : torch.device
             device to use
    
    Returns
    -------
    float : The model's accuracy on the given dataset in percent.
    """
    with torch.no_grad():  # Disable gradient calculation to reduce memory consumption.

        # Initialize number of correctly predicted samples + overall number of samples.
        correct_pred, num_examples = 0, 0

        for i, (features, targets) in enumerate(data_loader):
            # CONVERT DATASET TO USED DEVICE.
            ## features = ...
            ## targets = ...
            #
            # CALCULATE PREDICTIONS OF CURRENT MODEL ON FEATURES OF INPUT DATA.
            ## logits = ...
            ## Determine class with highest score.
            ## Compare predictions to actual labels to determine number of correctly predicted samples.
            ## Determine overall number of samples.
            
    # RETURN ACCURACY AS PERCENTAGE OF CORRECTLY PREDICTED SAMPLES.
    ## acc = ...
    ## return acc    

### 3. Define dataloaders
Next, you need to get the data in somehow. 
As code for pre-processing data can get messy fast, we ideally want our dataset code to be decoupled from our model training code for better readability and modularity. 
`PyTorch` provides two data primitives that allow you to use pre-loaded datasets as well as your own data [doc](https://pytorch.org/tutorials/beginner/basics/data_tutorial.html):
- `torch.utils.data.Dataset`: Stores the data samples and their corresponding labels (targets).
- `torch.utils.data.DataLoader`: Wraps an iterable around the `Dataset` to enable easy access to the samples. 

As the data does not always come in its final processed form required for training your model, you can use transforms with a `Dataset` to manipulate the data and make it suitable for training. 
The thus obtained `Dataset` retrieves the dataset's features and labels one sample at a time. 
However, you have already learned that when training a model, we typically want to pass samples in mini-batches and reshuffle the data at every epoch to reduce model overfitting.
`DataLoader` is an iterable that abstracts this complexity for us in an easy API. 
Simply speaking, it combines the plain dataset with a sampling strategy. 
We use the CIFAR10 `Dataset` provided by `torchvision` and define the dataloaders for our classification problem below:

In [None]:
def get_dataloaders_cifar10(batch_size, 
                            num_workers=0,
                            root='data',
                            validation_fraction=0.1,
                            train_transforms=None,
                            test_transforms=None):
    """
    Get CIFAR10 dataloaders for training, validation, and testing.
    
    Params
    ------
    batch_size : int
                 batch size
    num_workers : int
                  How many workers to use for data loading.
    root : str
           path to data dir
    validation_fraction : float
                          fraction of train dataset used for validation
    train_transforms : torchvision.transforms.<transformation>
                       How to preprocess the training data.
    test_transforms : torchvision.transforms.<transformation>
                      How to preprocess the test data.
                      
    Returns
    -------
    torch.utils.data.Dataloader : training dataloader
    torch.utils.data.Dataloader : validation dataloader
    torch.utils.data.Dataloader : testing dataloader   
    """
    if train_transforms is None:
        train_transforms = torchvision.transforms.ToTensor()

    if test_transforms is None:
        test_transforms = torchvision.transforms.ToTensor()

    train_dataset = torchvision.datasets.CIFAR10(root=root,
                                                 train=True,
                                                 transform=train_transforms,
                                                 download=True)

    valid_dataset = torchvision.datasets.CIFAR10(root=root,
                                                 train=True,
                                                 transform=test_transforms)

    test_dataset = torchvision.datasets.CIFAR10(root=root,
                                                train=False,
                                                transform=test_transforms)

    # Perform index-based train-validation split of original training data. 
    total = len(train_dataset) # Get overall number of samples in original training data.
    idx = list(range(total)) # Make index list.
    np.random.shuffle(idx) # Shuffle indices.
    vnum = int(validation_fraction * total) # Determine number of validation samples from validation split.
    train_indices, valid_indices = idx[vnum:], idx[0:vnum] # Extract train and validation indices.
    
    train_sampler = torch.utils.data.SubsetRandomSampler(train_indices)
    valid_sampler = torch.utils.data.SubsetRandomSampler(valid_indices)

    valid_loader = torch.utils.data.DataLoader(dataset=valid_dataset,
                                               batch_size=batch_size,
                                               num_workers=num_workers,
                                               sampler=valid_sampler)

    train_loader = torch.utils.data.DataLoader(dataset=train_dataset,
                                               batch_size=batch_size,
                                               num_workers=num_workers,
                                               drop_last=True,
                                               sampler=train_sampler)

    test_loader = torch.utils.data.DataLoader(dataset=test_dataset,
                                              batch_size=batch_size,
                                              num_workers=num_workers,
                                              shuffle=False)

    return train_loader, valid_loader, test_loader

### 4. Define the training loop
Now that we have our `AlexNet` model and the CIFAR10 data, we want to actually train, validate, and test it by optimizing its parameters on our data. 
When training neural networks, the most frequently used algorithm is backpropagation, where the parameters, or model weights, are adjusted according to the gradient of the loss function with respect to the given parameter. 
To compute those gradients automatically, `PyTorch` has a built-in differentiation engine called `torch.autograd`. 
Training a model is an iterative process. In each iteration, the model predicts the output for a given input, calculates the error in its prediction as quantified by the loss function, collects the derivatives of the loss w.r.t. its parameters, and optimizes these parameters using gradient descent. 
For a more detailed walkthrough of this process, check out this video on backpropagation from [3Blue1Brown](https://www.youtube.com/watch?v=tIeHLnjs5U8).

The training is influenced by adjustable hyperparameters (HPs) that let you control the model optimization process.
Before training your model, you need to set the following HPs:
- **Number of epochs:** How many times to iterate over the complete dataset.
- **Batch size:** How many data samples to propagate through the network before updating the parameters.
- **Learning rate (LR):** How much to change the model parameters at each update step. Smaller values yield slow learning speed, while large values may result in unpredictable behavior during training.

Once you set these HPs, you can train and optimize your model within an optimization loop, where each iteration is an epoch consisting of two main parts:
- **Training loop:** Iterate over the mini-batched training dataset and try to converge to optimal parameters.
- **Validation loop:** Iterate over the validation dataset to check if the model's performance is improving on unseen data.

The most important concepts used in the training loop are explained in more detail below. 
If you are already familiar with these things, you can just skip those parts.

##### Loss function
When presented with some training data, an untrained network is unlikely to give the correct answer. 
We use a loss function to measure the degree of dissimilarity of a prediction to the actual target, and it is the loss function that we want to minimize during training. 
To calculate the loss, we make a prediction using the inputs of our given data sample and somehow compare it with the true data label value. 
A common loss function for classification is cross entropy, which normalizes the logits and computes the prediction error.

##### Optimizer
Optimization is the process of adjusting the model's parameters to reduce the model error in each training step. Optimization algorithms define how this process is performed. 
The `torch.optim` package provides various optimization algorithms [doc](https://pytorch.org/docs/stable/optim.html). 
To use `torch.optim`, you have to construct an `optimizer` object that encapsulates all optimization logic. 
We initialize the `optimizer` by registering the model’s parameters that need to be trained and passing it the LR hyperparameter.
During the training, this `optimizer` will hold the current state and update the parameters based on the computed gradients. 

Here, we use stochastic gradient descent, `torch.optim.SGD`. Additionally, many different optimizers are available in `PyTorch`, such as ADAM and RMSProp, that work better for different kinds of models and data.

##### Training loop
Inside the training loop, optimization technically happens in three steps:
- Call `optimizer.zero_grad()` to reset the gradients of your model's parameters before processing a new mini-batch. Gradients by default add up; to prevent double-counting, we explicitly zero them at each iteration.
- Backpropagate the prediction loss with a call to `loss.backward()`. `PyTorch` deposits the gradients of the loss w.r.t. each parameter. 
- Once we have our gradients, call `optimizer.step()` to adjust the parameters by the gradients collected in the backward pass.

##### Adjust the learning rate
When training deep neural networks, it is often useful to reduce the LR as the training progresses. 
So-called learning rate schedulers adjust the LR during training by reducing it according to a pre-defined schedule. 
`torch.optim.lr_scheduler` provides several methods to do so based on the number of epochs. 
For example, `torch.optim.lr_scheduler.ReduceLROnPlateau` allows dynamic LR reducing based on some validation measurements, like the model's accuracy on the validation dataset.
LR scheduling should be applied after the optimizer's update at the end of each epoch.

We define all of this functionality in the `train_model` function below. 

In [3]:
def train_model(model, num_epochs, train_loader,
                valid_loader, test_loader, optimizer,
                device, logging_interval=50,
                scheduler=None):
    """
    Train your model.
    
    Params
    ------
    model : torch.nn.Module
            model to train
    num_epochs : int
                 number of epochs to train
    train_loader : torch.utils.data.Dataloader
                   training dataloader
    valid_loader : torch.utils.data.Dataloader
                   validation dataloader
    test_loader : torch.utils.data.Dataloader
                  testing dataloader
    optimizer : torch.optim.Optimizer
                optimizer to use
    device : torch.device
             device to train on
    logging_interval : int
                       logging interval
    scheduler : torch.optim.lr_scheduler.<scheduler>
                optional learning rate scheduler
                
    Returns
    -------
    [float] : loss history
    [float] : training accuracy history
    [float] : validation accuracy history
    """
    ## start = ... # Start timer to measure training time.

    # Initialize history lists for loss, training accuracy, and validation accuracy.
    loss_history, train_acc_history, valid_acc_history = [], [], []

    # ACTUAL TRAINING STARTS HERE.    
    for epoch in range(num_epochs): # Loop over epochs.

        # IMPLEMENT TRAINING LOOP HERE.
        #
        ## Set model to training mode.
        #  Thus, layers like dropout which behave differently on train and  
        #  test procedures know what is going on and can behave accordingly. 
        
        for batch_idx, (features, targets) in enumerate(train_loader): # Loop over mini batches.

            # CONVERT DATASET TO USED DEVICE.
            ## features = ...
            ## targets = ...
            #
            # FORWARD & BACKWARD PASS
            ## logits = ... # Get predictions of model with current parameters.
            ## loss = ...   # Calculate cross-entropy loss on current mini-batch.
            ## Zero out gradients.
            ## Backward pass on loss.
            ## Perform single optimization step to update model parameters via optimizer.
            #
            # LOGGING
            ## Append loss to history list.
            
            if not batch_idx % logging_interval:
                print(f'Epoch: {epoch+1:03d}/{num_epochs:03d} '
                      f'| Batch {batch_idx:04d}/{len(train_loader):04d} '
                      f'| Loss: {loss:.4f}')

        # VALIDATION STARTS HERE.
        #
        ## Set model to evaluation mode.
        
        with torch.no_grad(): # Disable gradient calculation to reduce memory consumption.
            
            # COMPUTE ACCURACY OF CURRENT MODEL PREDICTIONS ON TRAINING + VALIDATION DATASETS.
            ## train_acc = compute_accuracy(...)
            ## valid_acc = compute_accuracy(...)
            
            print(f'Epoch: {epoch+1:03d}/{num_epochs:03d} '
                  f'| Train: {train_acc :.2f}% '
                  f'| Validation: {valid_acc :.2f}%')
            
            ## APPEND ACCURACY VALUES TO CORRESPONDING HISTORY LISTS.
            
        ## elapsed = ... # Stop timer and calculate training time elapsed after epoch.
        ## Print training time elapsed after epoch.
        
        if scheduler is not None: 
            scheduler.step(valid_acc_history[-1])
        
    ## elapsed = ... # Stop timer and calculate total training time.
    ## Print overall training time.
    
    # FINAL TESTING STARTS HERE.
    #
    ## test_acc = compute_accuracy(...) # Compute accuracy on test data.
    ## Print test accuracy.

    ## Return history lists for loss, training accuracy, and validation accuracy.

### 5. General settings before we are good to go
Before training your model in the actual main script, you need to make some general settings, like choosing a random seed and the training HPs. 
As we want to train our model on a hardware accelerator, i.e., a GPU, we first check if `torch.cuda` is available, otherwise we use the CPU. 
For reproducibility, we set all seeds to a fixed value `seed`.

In [4]:
# SETTINGS
seed = 123 # random seed.
e = 100    # number of epochs
b = 256    # batch size
lr = 0.1   # learning rate

# Get device used for training, e.g., check via torch.cuda.is_available().
## device = ...
## Print used device.
## Set all seeds to chosen random seed.

Using cpu device.


### 6. Load the data
We define some transforms for data preprocessing and use our dataloader function `get_dataloaders_cifar10` from above to load the data. 
As *AlexNet* was originally intended for classification of the 256 x 256 RGB images in the ImageNet dataset, we need to make some adjustments to make the very same architecture work for classification of the smaller RGB images in the CIFAR10 dataset. 
If you'd use the original 32 x 32 CIFAR10 images as inputs, your samples would disappear at the last convolutional layer. 
To fix this problem, you can upsample the CIFAR10 samples to 64 x 64 pixels before passing them as inputs to *AlexNet*. 
This is what is done in the transforms below, among other things. 

In [5]:
# DATASET
# Transforms on your data allow you to take it from its source state and transform it into ready-for-training data.

# Transforms applied to training data (randomness to make network more robust against overfitting)
train_transforms = torchvision.transforms.Compose([ # Compose several transforms together.
    torchvision.transforms.Resize((70, 70)), # Upsample CIFAR-10 images to make them work with AlexNet.
    torchvision.transforms.RandomCrop((64, 64)), # Randomly crop image to make NN more robust against overfitting.
    torchvision.transforms.ToTensor(), # Convert image into torch tensor.
    torchvision.transforms.Normalize((0.5, 0.5, 0.5), (0.5, 0.5, 0.5)) # Normalize to [-1,1] via (image-mean)/std.
                                      ])
# Transforms applied to validation and test data (no randomness for actual predictions!)
test_transforms = torchvision.transforms.Compose([ # Compose several transforms together.
    torchvision.transforms.Resize((70, 70)), # Upsample CIFAR-10 images to make them work with AlexNet.
    torchvision.transforms.CenterCrop((64, 64)), # Crop images centrally.      
    torchvision.transforms.ToTensor(), # Convert image to torch tensor.     
    torchvision.transforms.Normalize((0.5, 0.5, 0.5), (0.5, 0.5, 0.5))]) # Normalize to [-1,1] via (image-mean)/std.

# GET PYTORCH DATALOADERS FOR TRAINING, TESTING, AND VALIDATION DATASET.
## train_loader, valid_loader, test_loader = get_dataloaders_cifar10(...)

# Check loaded dataset.
for images, labels in train_loader:  
    print('Image batch dimensions:', images.shape)
    print('Image label dimensions:', labels.shape)
    print('Class labels of 10 examples:', labels[:10])
    break

Downloading https://www.cs.toronto.edu/~kriz/cifar-10-python.tar.gz to ./data/cifar-10-python.tar.gz


86.8%

### 7. Set up your model and train it
We create an instance of `AlexNet`, move it to the `device`, and print its structure. 
Remember that to use a model, we just pass it the input data. This executes the model's `forward` method, along with some background operations. Do not call `model.forward()` directly!
Furthermore, we set up an `optimizer` and `scheduler` instance as described above.
To start the training, we call our `train_model` function and pass it the model, the training HPs, the dataloaders, the optimizer, and the scheduler objects as inputs.

In [6]:
## model = ... # Build instance of AlexNet with 10 classes for CIFAR-10 and convert it to used device.
## Print model.

# Set up stochastic gradient descent optimizer using torch.optim package.
# Use a momentum of 0.9 and a learning rate of 0.1.
## optimizer = ... 

# Set up LR scheduler.     
scheduler = torch.optim.lr_scheduler.ReduceLROnPlateau(optimizer, factor=0.1, mode='max', verbose=True)

# TRAIN MODEL.
## loss_list, train_acc_list, valid_acc_list = train_model(...)

# Save history lists for loss, training accuracy, and validation accuracy to files.
torch.save(loss_list, 'loss.pt')
torch.save(train_acc_list, 'train_acc.pt')
torch.save(valid_acc_list, 'valid_acc.pt')

### Congratulations! 
You have successfully trained a deep neural network in `PyTorch`. To analyze your results visually, you can now plot the evolution of the loss, training accuracy, and validation accuracy over the training, e.g., with `matplotlib.pyplot`.