In [None]:
"""
Notes from the paper:

The Alexnet paper used Convolutional Neural Networks to win the ImageNet competition in 2012.

Goal:
Image Classification

Dataset Used:
Imagenet-1000
Imagenet is a 15 million labelled high-resolution (Relatively speeaking, compared to NIST which was 28 x28, this is 256 x 256) images in 22,000 categories. 
The 1000 category subset was used for this paper.

Method Used:
Convolution layers, occasionally followed by max-pooling layers. The final layers are fully connected layers, with Dropout layers in between.
Ends with a 1000-way softmax layer.

Convolution dimension calculation:
https://madebyollin.github.io/convnet-calculator/

Architecture:
Input (3 x 256, 256)
- Convolutional Layer 1 
    GPU1 - (96 filters, 11 x 11, stride 4, padding 0) -> (output dim: (96, 62, 62))
    GPU2 - (96 filters, 11 x 11, stride 4, padding 0) -> (output dim: (96, 62, 62))
- Max Pooling Layer 1
    GPU1 - (3 x 3, stride 2) -> (output dim: (96, 31, 31))
    GPU2 - (3 x 3, stride 2) -> (output dim: (96, 31, 31))
- Convolutional Layer 2
    GPU1 - (256 filters, 5 x 5, stride 1, padding 2) -> (output dim: (256, 31, 31))
    GPU2 - (256 filters, 5 x 5, stride 1, padding 2) -> (output dim: (256, 31, 31))
- Max Pooling Layer 2
    GPU1 - (3 x 3, stride 2) -> (output dim: (256, 15, 15))
    GPU2 - (3 x 3, stride 2) -> (output dim: (256, 15, 15))
- Convolutional Layer 3
    GPU1 - (384 filters, 3 x 3, stride 1, padding 1) -> (output dim: (384, 15, 15))
    GPU2 - (384 filters, 3 x 3, stride 1, padding 1) -> (output dim: (384, 15, 15))
- Convolutional Layer 4
    GPU1 - (384 filters, 3 x 3, stride 1, padding 1) -> (output dim: (384, 15, 15))
    GPU2 - (384 filters, 3 x 3, stride 1, padding 1) -> (output dim: (384, 15, 15))
- Convolutional Layer 5
    GPU1 - (256 filters, 3 x 3, stride 1, padding 1) -> (output dim: (256, 15, 15))
    GPU2 - (256 filters, 3 x 3, stride 1, padding 1) -> (output dim: (256, 15, 15))
- Max Pooling Layer 3
    GPU1 - (3 x 3, stride 2) -> (output dim: (256, 7, 7))
    GPU2 - (3 x 3, stride 2) -> (output dim: (256, 7, 7))
- Fully Connected Layer 1
    GPU1 - (4096 neurons) -> (output dim: (4096))
    GPU2 - (4096 neurons) -> (output dim: (4096))
- Fully Connected Layer 2
    GPU1 - (4096 neurons) -> (output dim: (4096))
    GPU2 - (4096 neurons) -> (output dim: (4096))
- Fully Connected Layer 3
    1000 Neurons -> (output dim: (1000))
- Softmax Layer
    1000 Neurons -> (output dim: (1000))
    
Instead of using 2 GPUs, we will make the network branching.

Keep in mind this:

```
Now we are ready to describe the overall architecture of our CNN. As depicted in Figure 2, the net
contains eight layers with weights; the first five are convolutional and the remaining three are fullyconnected. The output of the last fully-connected layer is fed to a 1000-way softmax which produces
a distribution over the 1000 class labels. Our network maximizes the multinomial logistic regression
objective, which is equivalent to maximizing the average across training cases of the log-probability
of the correct label under the prediction distribution.
The kernels of the second, fourth, and fifth convolutional layers are connected only to those kernel
maps in the previous layer which reside on the same GPU (see Figure 2). The kernels of the third
convolutional layer are connected to all kernel maps in the second layer. The neurons in the fullyconnected layers are connected to all neurons in the previous layer. Response-normalization layers
follow the first and second convolutional layers. Max-pooling layers, of the kind described in Section
3.4, follow both response-normalization layers as well as the fifth convolutional layer. The ReLU
non-linearity is applied to the output of every convolutional and fully-connected layer
```

Training Parameters / Hyperparamters:
- Data Augmentation: Randomly cropped 224x224 patches from the 256x256 images, and horizontally mirroring them.
    - This means that on test time, the image is resized to 256x256, and then 5 224x224 patches are cropped from it, and mirrored, and the network is run on all of them. The final prediction is the average of the 10 predictions.
- They wrote a Cuda ConvNet from scratch to train the network. BASED
- SGD with momentum 0.9 and weight decay 0.0005
- Batch Size: 128

Metrics Defined:
Error Rate
- Number of misclassified test samples / Total number of test samples

Top 1 vs top 5 error rate
- Top 1 error rate is the number of test samples for which the correct label is not among the top 1 predicted labels
- Top 5 error rate is the number of test samples for which the correct label is not among the top 5 predicted labels

Results:
- Top-1 error rate: 37.5%
- Top-5 error rate: 17.0%
"""

'\nNotes from the paper:\n\nThe Alexnet paper used Convolutional Neural Networks to win the ImageNet competition in 2012.\n\nGoal:\nImage Classification\n\nDataset Used:\nImagenet-1000\nImagenet is a 15 million labelled high-resolution (Relatively speeaking, compared to NIST which was 28 x28, this is 256 x 256) images in 22,000 categories. \nThe 1000 category subset was used for this paper.\n\nMethod Used:\nConvolution layers, occasionally followed by max-pooling layers. The final layers are fully connected layers, with Dropout layers in between.\nEnds with a 1000-way softmax layer.\n\nConvolution dimension calculation:\nhttps://madebyollin.github.io/convnet-calculator/\n\nArchitecture:\nInput (3 x 256, 256)\n- Convolutional Layer 1 \n    GPU1 - (96 filters, 11 x 11, stride 4, padding 0) -> (output dim: (96, 62, 62))\n    GPU2 - (96 filters, 11 x 11, stride 4, padding 0) -> (output dim: (96, 62, 62))\n- Max Pooling Layer 1\n    GPU1 - (3 x 3, stride 2) -> (output dim: (96, 31, 31))\n  

# Model Architecture

In [8]:
import torch
import torch.nn as nn
import torch.nn.functional as F

"""
Architecture:
Input (3 x 256, 256)
- Convolutional Layer 1 
    GPU1 - (96 filters, 11 x 11, stride 4, padding 0) -> (output dim: (96, 62, 62))
    GPU2 - (96 filters, 11 x 11, stride 4, padding 0) -> (output dim: (96, 62, 62))
- Max Pooling Layer 1
    GPU1 - (3 x 3, stride 2) -> (output dim: (96, 31, 31))
    GPU2 - (3 x 3, stride 2) -> (output dim: (96, 31, 31))
- Convolutional Layer 2
    GPU1 - (256 filters, 5 x 5, stride 1, padding 2) -> (output dim: (256, 31, 31))
    GPU2 - (256 filters, 5 x 5, stride 1, padding 2) -> (output dim: (256, 31, 31))
- Max Pooling Layer 2
    GPU1 - (3 x 3, stride 2) -> (output dim: (256, 15, 15))
    GPU2 - (3 x 3, stride 2) -> (output dim: (256, 15, 15))
- Convolutional Layer 3
    GPU1 - (384 filters, 3 x 3, stride 1, padding 1) -> (output dim: (384, 15, 15))
    GPU2 - (384 filters, 3 x 3, stride 1, padding 1) -> (output dim: (384, 15, 15))
- Convolutional Layer 4
    GPU1 - (384 filters, 3 x 3, stride 1, padding 1) -> (output dim: (384, 15, 15))
    GPU2 - (384 filters, 3 x 3, stride 1, padding 1) -> (output dim: (384, 15, 15))
- Convolutional Layer 5
    GPU1 - (256 filters, 3 x 3, stride 1, padding 1) -> (output dim: (256, 15, 15))
    GPU2 - (256 filters, 3 x 3, stride 1, padding 1) -> (output dim: (256, 15, 15))
- Max Pooling Layer 3
    GPU1 - (3 x 3, stride 2) -> (output dim: (256, 7, 7))
    GPU2 - (3 x 3, stride 2) -> (output dim: (256, 7, 7))
- Fully Connected Layer 1
    GPU1 - (4096 neurons) -> (output dim: (4096))
    GPU2 - (4096 neurons) -> (output dim: (4096))
- Fully Connected Layer 2
    GPU1 - (4096 neurons) -> (output dim: (4096))
    GPU2 - (4096 neurons) -> (output dim: (4096))
- Fully Connected Layer 3
    1000 Neurons -> (output dim: (1000))
- Softmax Layer
    1000 Neurons -> (output dim: (1000))
    
Instead of using 2 GPUs, we will make the network branching.

Keep in mind this:

```
Now we are ready to describe the overall architecture of our CNN. As depicted in Figure 2, the net
contains eight layers with weights; the first five are convolutional and the remaining three are fullyconnected. The output of the last fully-connected layer is fed to a 1000-way softmax which produces
a distribution over the 1000 class labels. Our network maximizes the multinomial logistic regression
objective, which is equivalent to maximizing the average across training cases of the log-probability
of the correct label under the prediction distribution.
The kernels of the second, fourth, and fifth convolutional layers are connected only to those kernel
maps in the previous layer which reside on the same GPU (see Figure 2). The kernels of the third
convolutional layer are connected to all kernel maps in the second layer. The neurons in the fullyconnected layers are connected to all neurons in the previous layer. Response-normalization layers
follow the first and second convolutional layers. Max-pooling layers, of the kind described in Section
3.4, follow both response-normalization layers as well as the fifth convolutional layer. The ReLU
non-linearity is applied to the output of every convolutional and fully-connected layer
```
"""

# Since the Sub Sampling as mentioned by Yunn LeCun is not the same as Average Pooling, I will implement it as a trainable layer
class AlexNet(nn.Module):
    def __init__(self):
        super(AlexNet, self).__init__()
        self.conv1 = nn.Conv2d(3, 96, kernel_size=(3, 11, 11), stride=(1, 4, 4), padding=0)
        self.conv2 = nn.Conv2d(96, 256, kernel_size=(96, 5, 5), stride=(1, 1, 1), padding=2)
        self.conv3 = nn.Conv2d(256, 384, kernel_size=(256, 3, 3), stride=(1, 1, 1), padding=1)
        self.conv4 = nn.Conv2d(384, 384, kernel_size=(384, 3, 3), stride=(1, 1, 1), padding=1)
        self.conv5 = nn.Conv2d(384, 256, kernel_size=(384, 3, 3), stride=(1, 1, 1), padding=1)
        self.fc1 = nn.Linear(256*7*7, 4096)
        self.fc2 = nn.Linear(4096, 4096)
        self.fc3 = nn.Linear(4096, 1000)

    def forward(self, x):
        x = F.relu(self.conv1(x))
        x = F.max_pool2d(x, kernel_size=3, stride=2)
        x = F.relu(self.conv2(x))
        x = F.max_pool2d(x, kernel_size=3, stride=2)
        x = F.relu(self.conv3(x))
        x = F.relu(self.conv4(x))
        x = F.relu(self.conv5(x))
        x = F.max_pool2d(x, kernel_size=3, stride=2)
        x = x.view(-1, 256*7*7)
        x = F.relu(self.fc1(x))
        x = F.relu(self.fc2(x))
        x = self.fc3(x)
        return x

In [None]:
model = AlexNet().cuda()
image = torch.randn(1, 3, 256, 256)
output = model(image.cuda()).cpu()
print(output.size()) # torch.Size([1, 1000]) as expected

# Getting ImageNet

In [None]:
!mkdir -p ./data && mkdir -p ./data/Imagenet
!wget https://image-net.org/data/ILSVRC/2012/ILSVRC2012_devkit_t12.tar.gz -O ./data/Imagenet/ILSVRC2012_devkit_t12.tar.gz
# !tar -xvf ./data/Imagenet/ILSVRC2012_devkit_t12.tar.gz
!wget https://image-net.org/data/ILSVRC/2012/ILSVRC2012_img_train.tar -O ./data/Imagenet/ILSVRC2012_img_train.tar
# !tar -xvf ./data/Imagenet/ILSVRC2012_img_train.tar
!wget https://image-net.org/data/ILSVRC/2012/ILSVRC2012_img_val.tar -O ./data/Imagenet/ILSVRC2012_img_val.tar
# !tar -xvf ./data/Imagenet/ILSVRC2012_img_val.tar

# Run the following in a tmuxs session on the server for background download
#mkdir -p ./data && mkdir -p ./data/Imagenet && wget https://image-net.org/data/ILSVRC/2012/ILSVRC2012_devkit_t12.tar.gz -O ./data/Imagenet/ILSVRC2012_devkit_t12.tar.gz && tar -xvf ./data/Imagenet/ILSVRC2012_devkit_t12.tar.gz && wget https://image-net.org/data/ILSVRC/2012/ILSVRC2012_img_train.tar -O ./data/Imagenet/ILSVRC2012_img_train.tar && tar -xvf ./data/Imagenet/ILSVRC2012_img_train.tar && wget https://image-net.org/data/ILSVRC/2012/ILSVRC2012_img_val.tar -O ./data/Imagenet/ILSVRC2012_img_val.tar && tar -xvf ./data/Imagenet/ILSVRC2012_img_val.tar

In [None]:
from torchvision.datasets import ImageNet

train_data = ImageNet(root='./data/Imagenet', split='train')
val_data = ImageNet(root='./data/Imagenet', split='val')
