# Overview of the AlexNet Unit

The AlexNet can be broken down into a basic unit architecture. The AlexNext Unit can be described as follows:

<div>
<img src="./assets/AlexNetUnit.png" width = 800px>
</div>

## Local Response normalisation (LRNorm)
AlexNet employs the idea of local response normalisation. The outputs from the of n convolution layers will be normalised using the following formula:

$$ b^i_{x, y} = a^i_{x, y} / \biggl( k + \alpha\sum_{j = max(0, i-n/2)}^{min(N-1,i+n/2)} (a^j_{x, y})^2 \biggr)^{\beta} $$

Essentially, LRNorm normalises $n$ consecutive output of the $N$ convolutation kernels. One efficient way to implement this is to 3D average pool over the channels of the tensor as implmement in PyTorch.

(Todo: add LRNorm animation)


## Overlap Max Pooling

Usually, traditional max pooling layers will have their stride the same length as the kernel size. 

In AlexNet, Max Pooling layers with stride = 2 & kernel size = 3 were used, resulting in some overlaps between each max pooling layer. The paper shared that this helps to avoid overfitting from their training experiments.

## Overcoming 2012 GPU Constraints

AlexNet had to work around the GPU computation limitations back in 2012. To utilise some form of parallel computing during training, the number of kernels in each layer of the network was split equally among 2 GPUs. 

Unless specifically concatenated, the tensors from the 2 GPUs are passed on to the next layer in the same GPU and will not interact with the other GPUs.

<div>
<img src="./assets/AlexNet_Parallel.png" width = 800px>
</div>

In [0]:
import torch
import torch.nn as nn

In [0]:
# This is the abstraction for a AlexNet Unit
# It takes 2 inputs x1 and x2 as the training is parallelised over 2 GPUs
# In the even that there is inter-GPU interaction before the layer, x1 & x2 will be the same tensor
class AlexNetUnit(torch.nn.Module):
    _NORM_N = 5
    _NORM_k = 2
    _NORM_a = 10**-4
    _NORM_b = 0.75
    _POOL_stride = 2
    _POOL_kernel = 3

    def __init__(self, in_channels, out_channels, kernel_size, 
        stride, padding, hasLRNorm = False, hasOLPool = False):
        super().__init__()
        self.hasLRNorm = hasLRNorm
        self.hasOLPool = hasOLPool

        self.conv = nn.Conv2d(in_channels, out_channels, kernel_size, stride, padding)
        self.relu = nn.ReLU()

        if self.hasLRNorm:
            self.lrNorm = nn.LocalResponseNorm(self._NORM_N, self._NORM_a, self._NORM_b, self._NORM_k)
        
        if self.hasOLPool:
            self.olPool = nn.MaxPool2d(self._POOL_kernel, self._POOL_stride)
    
    def forward(self, input):
        out = self.conv(input)
        out = self.relu(out)
        
        if self.hasLRNorm:
            out = self.lrNorm(out)
        if self.hasOLPool:
            out = self.olPool(out)
        
        return out

## The Original AlexNet Structure

There are 5 AlexNet Units, 2 Fully Connected layers, and 1 Output layer in the AlexNet. The network splits the channels in half, effectively conducting parallel training over 2 GPUs. A diagram of the condensed output is shown below:

<div>
<img src="./assets/AlexNet-2012.png" width = 800px>
</div>

## A Smaller AlexNet for CIFAR-10

As a demonstration, we will train a Simplified AlexNet with have the network size. We will also not be training on the 256x256 RGB images in ImageNet. Instead, we will be training on the smaller 32x32 CIFAR-10 RGB dataset.

In [0]:
class AlexNet(torch.nn.Module):
    def __init__(self, nClass):
        super(AlexNet, self).__init__()
        self.features = nn.Sequential(
            AlexNetUnit(3, 48, (7,7), 1, 2, True, True),
            AlexNetUnit(48, 128, (5,5), 1, 2, True, True),
            AlexNetUnit(128, 192, (3,3), 1, 1),
            AlexNetUnit(192, 128, (3,3), 1, 1, hasOLPool=True)
        ) 

        self.classifier = nn.Sequential(
            nn.Dropout(0.5),
            nn.Linear(128*2*2, 2048),
            nn.ReLU(inplace = True),
            nn.Linear(2048, 2048),
            nn.Dropout(0.5),
            nn.ReLU(inplace = True),
            nn.Linear(2048, nClass)
        )
        
    def forward(self, input):
        out = self.features(input)
        out = out.reshape(-1, 128*2*2)
        out = self.classifier(out)

        return out



## CIFAR-10 Dataset

The CIFAR-10 dataset consists of 60000 32x32 colour images in 10 classes, with 6000 images per class. There are 50000 training images and 10000 test images.

Source: [https://cs.toronto.edu/!kriz/cifar.html](https://www.cs.toronto.edu/~kriz/cifar.html)

![CIFAR-10 Dataset Sample](./assets/CIFAR10.png)




In [0]:
from torch.optim import Adam
from torch.nn.init import kaiming_normal_, normal_
from torch.utils.data import DataLoader
from torchvision.transforms import ToTensor


In [0]:
def initWeight(unit):
    if isinstance(unit, (torch.nn.Linear, torch.nn.Conv2d)):
        kaiming_normal_(unit.weight, nonlinearity='relu')
        normal_(unit.bias)
        

In [0]:
device = torch.device('cuda:0' if torch.cuda.is_available() else 'cpu')

hyperparam = {
    'nEpoch': 30,
    'batchSize': 256
}

In [7]:
from torchvision.datasets import CIFAR10

trainset = CIFAR10(
    root = "./data",
    train = True,
    download = True,
    transform = ToTensor()
)

testset = CIFAR10(
    root = "./data",
    train = False,
    download = True,
    transform = ToTensor()
)

Downloading https://www.cs.toronto.edu/~kriz/cifar-10-python.tar.gz to ./data/cifar-10-python.tar.gz


HBox(children=(FloatProgress(value=1.0, bar_style='info', max=1.0), HTML(value='')))

Extracting ./data/cifar-10-python.tar.gz to ./data
Files already downloaded and verified


In [0]:
trainloader = DataLoader(trainset, batch_size=hyperparam['batchSize'], shuffle=True, num_workers=0)
testloader = DataLoader(testset, batch_size=hyperparam['batchSize'], shuffle=True, num_workers=0)

In [0]:
model = AlexNet(10)
model.apply(initWeight)
model.to(device)

lossFn = nn.CrossEntropyLoss()
#sticking to Adam's default hyperparameters from the original paper
optim = Adam(model.parameters())

In [10]:
for epoch in range(hyperparam['nEpoch']):
    for i, (x, y) in enumerate(trainloader):
        x = x.to(device)
        y = y.to(device)
        
        # Forward pass
        outputs = model(x)
        loss = lossFn(outputs, y)
        
        # Backward pass
        optim.zero_grad()
        loss.backward()

        # Parameter update
        optim.step()

        # Console log Progress
        if (i+1) % 50 == 0 or i + 1 == len(trainloader):
            print(f'Epoch [{epoch + 1}/{hyperparam["nEpoch"]}], Step [{i+1}/{len(trainloader)}], Loss: {loss:.4f}')
                   

Epoch [1/30], Step [50/196], Loss: 2.3322
Epoch [1/30], Step [100/196], Loss: 2.2207
Epoch [1/30], Step [150/196], Loss: 2.0320
Epoch [1/30], Step [196/196], Loss: 1.9698
Epoch [2/30], Step [50/196], Loss: 1.9745
Epoch [2/30], Step [100/196], Loss: 1.9818
Epoch [2/30], Step [150/196], Loss: 1.7343
Epoch [2/30], Step [196/196], Loss: 1.8631
Epoch [3/30], Step [50/196], Loss: 1.8131
Epoch [3/30], Step [100/196], Loss: 1.6246
Epoch [3/30], Step [150/196], Loss: 1.5707
Epoch [3/30], Step [196/196], Loss: 1.6057
Epoch [4/30], Step [50/196], Loss: 1.4629
Epoch [4/30], Step [100/196], Loss: 1.6090
Epoch [4/30], Step [150/196], Loss: 1.6101
Epoch [4/30], Step [196/196], Loss: 1.5129
Epoch [5/30], Step [50/196], Loss: 1.4618
Epoch [5/30], Step [100/196], Loss: 1.5029
Epoch [5/30], Step [150/196], Loss: 1.5305
Epoch [5/30], Step [196/196], Loss: 1.4581
Epoch [6/30], Step [50/196], Loss: 1.3011
Epoch [6/30], Step [100/196], Loss: 1.4397
Epoch [6/30], Step [150/196], Loss: 1.3049
Epoch [6/30], Ste

In [11]:
model.eval()
with torch.no_grad():
    correct = 0
    total = 0
    for x, y in testloader:
        x = x.to(device)
        y = y.to(device)
        prob = model(x)
        predicted = prob.argmax(axis=1)
        total += x.size(0)
        correct += (predicted == y).sum().item()

    print('Test Accuracy of the model on the 10000 test images: {} %'.format(100 * correct / total))
    

Test Accuracy of the model on the 10000 test images: 70.82 %
