# Convolutional Neural Networks: InceptionNet

In this notebook, we will design and train simplified InceptionNet on [CIFAR-10 dataset](https://www.cs.toronto.edu/~kriz/cifar.html), which is easily available through [`torchvision.datasets`](https://pytorch.org/vision/stable/datasets.html#torchvision.datasets.CIFAR10). 
Note that InceptionNet was originally trained on ImageNet dataset which is much larger and requires more computational resources.
**To prevent overfitting issues, we will size down the InceptionNet as well as introduce data augmentation.**


*Below is the description of this dataset as on [CIFAR-10 dataset](https://www.cs.toronto.edu/~kriz/cifar.html) website*

The CIFAR-10 dataset consists of 60000 32x32 colour images in 10 classes, with 6000 images per class. There are 50000 training images and 10000 test images.

The dataset is divided into five training batches and one test batch, each with 10000 images. The test batch contains exactly 1000 randomly-selected images from each class. The training batches contain the remaining images in random order, but some training batches may contain more images from one class than another. Between them, the training batches contain exactly 5000 images from each class.

Here are the classes in the dataset, as well as 10 random images from each:

<img src="img/cifar.png" width=500>

The classes are completely mutually exclusive. There is no overlap between automobiles and trucks. "Automobile" includes sedans, SUVs, things of that sort. "Truck" includes only big trucks. Neither includes pickup trucks.

### Setup

For this tutorial, we will need the following python packages 

1. numpy
2. matplotlib
3. torch
4. torchvision

Please follow the instructions [here](https://pytorch.org/) to install the last two libraries. 

In [None]:
# basic imports
import numpy as np
import matplotlib.pyplot as plt
import math
import pathlib

import torch
import torch.nn as nn
import torchvision
from torchvision import transforms

from matplotlib.lines import Line2D

from collections import defaultdict

# fix seed for reproducibility 
rng = np.random.RandomState(1)
torch.manual_seed(rng.randint(np.iinfo(int).max))

# it is a good practice to define `device` globally
if torch.cuda.is_available():
    device = torch.device("cuda")
    print("Using GPU:", device)
else:
    device = torch.device("cpu")
    print("No GPU -> using CPU:", device)


### Data

PyTorch has simple-to-use functions that downloads and loads the datasets. We will use these functions to streamline our deep learning pipeline.

Checkout other image datasets at [torch.datasets](https://pytorch.org/vision/stable/datasets.html).

In [None]:
# download data in data folder. It will create this folder if it doesn't exist
torchvision.datasets.CIFAR10(root="./data/", download=True)

### Explore Data

We will be carrying out simple investigations as we did in the practical of denoising autoencoders. 
Specifically, we are interested in finding out the following :

How does the data look like?

- What is the `type` of data?
- What does each element of data represent?
- What are the constituent parts of each element?
- How is the image represented?
- What do we use to plot an image?
- How do we use the image in our model?
- What is the range of input data? Do we need to normalize it? 

In [None]:
# load data 
data = torchvision.datasets.CIFAR10(root="./data/", train=True) # only load training data

For the ease of referencing, we will create a dictionary mapping the category number to the category label.

In [None]:
CATEGORY_MAPPING = {
    0: "airplane",
    1: "automobile",
    2: "bird",
    3: "cat",
    4: "deer",
    5: "dog",
    6: "frog",
    7: "horse",
    8: "ship",
    9: "truck",
}

In [None]:
# print("Type of data is the class\n", type(data))
# print("\nEach element of the data is\n", type(data[0]))
# print("\nA single element is\n", data[0],"\n\nfirst element is the image and the second element is the category")
# print(f"\nImage size in tensor form is accessible through data.data[idx].shape e.g.: {data.data[0].shape}")

# what else do you want to know about data?

In [None]:
# original sized image as displayed in jupyter notebook
data[1][0]

### Image data

We will use `.data` attribute of `torchvision.datasets.CIFAR10` to access the images in tensor format.
We will use [`torchvision.transforms`](https://pytorch.org/vision/stable/transforms.html) to appropriately transform the data to be further processed by our CNN. 

In [None]:
idx = 100

img = data.data[idx]

# What is the range of data? 
# print(f"Min value: {img.min()} \t Max value:{img.max()}")

# how to display the image 
plt.imshow(img)
print(f"Category number: {data[idx][1]} \t Category label: {CATEGORY_MAPPING[data[idx][1]]}")

## Visualize 

In [None]:
reverse_cat_map = defaultdict(list)
for idx, (_, c) in enumerate(data):
    reverse_cat_map[c].append(idx)

In [None]:
# visualize
n_samples = len(data)

ncats = np.random.choice(list(CATEGORY_MAPPING.keys()), size=5) # number of categories
ncols = 10 # images per category
fig, axs = plt.subplots(nrows=len(ncats), ncols=ncols, figsize=(12,6), dpi=100)

for row, cat in enumerate(ncats):
    idxs = reverse_cat_map[cat]
    idxs = np.random.choice(idxs, size=ncols)
    
    # title 
    axs[row][0].set_ylabel(CATEGORY_MAPPING[cat], fontweight="bold", rotation=0, labelpad=50)
    for i in range(ncols):
        ax = axs[row][i]
        ax.imshow(data[idxs[i]][0])
        ax.axis('off')
    axs[row][0].axis('on')
    axs[row][0].xaxis.set_ticks([])
    axs[row][0].yaxis.set_ticks([])

_ = fig.suptitle(f"CIFAR-10 dataset")

## Inception Block

In this tutorial, instead of building the GoogleNet ditto, we will focus on the basics of InceptionNet. 
Specifically, we will learn how to build an Inception Block and arrange several of these blocks to form an InceptionNet. 

As a refresher, the inception block as described by Szegedy et al. consists of several convolutions of size 1,3, and 5 along with max pooling layers. 
This allows the network to process images at different receptive fields. 


<img src="img/inception_block.svg" width=1000>

[[Szegedy et al. 2014] Going deeper with convolutions](https://arxiv.org/pdf/1409.4842.pdf)

In [None]:
class InceptionBlock(nn.Module):
    """
    Implements a basic inception block 
    
    Args:
        c_in (int): Number of channels in the input 
        c_reduce (dict): dictionary mapping dimension reduction for each of the 3x3 and 5x5 convolutions. Keys are '3x3' and '5x5'.
        c_out (dict): dictionary mapping final number of channels for each of the convolutions. Keys are '1x1', '3x3', '5x5', and 'maxpool'
    """
    def __init__(self, c_in, c_reduce, c_out):
        super(InceptionBlock, self).__init__()
        
        # 1x1
        self.conv1x1 = nn.Sequential(         
            ### YOUR CODE HERE
            ### Use conv, batchnorm, activation
        )
        
        # 3x3 
        self.conv3x3 = nn.Sequential(
            ### YOUR CODE HERE
            ### Use conv, batchnorm, activation along with dimension reduction using 1x1 conv as above
        )
        
        # 5x5 
        self.conv5x5 = nn.Sequential(
            ### YOUR CODE HERE
            ### Use conv, batchnorm, activation along with dimension reduction using 1x1 conv as above
        )
        
        # max-pool
        self.pool = nn.Sequential(
            ### YOUR CODE HERE
        )
    
    def forward(self, x):
        ### YOUR CODE HERE
        # x_out = 
        return x_out

## (Shallow) InceptionNet 

Szegedy et al. used the following configuration for GoogleNet with 7M parameters. ImageNet's images are of the size 3 x 224 x 224.

<img src="img/inception-net.png" width=1000>


Since we are training on a smaller CIFAR10 dataset with images of size 3 x 32 x 32, we will use a smaller InceptionNet with the following configration

[[Szegedy et al. 2014] Going deeper with convolutions](https://arxiv.org/pdf/1409.4842.pdf)


### YOUR MODEL DETAILS HERE

We will **reduce the size of the output channels from the above table by a factor of 4**, and **restrict the depth to 1 for all the blocks**. Resulting in the following configuration (**fill in  ?**) -

**Input size: 32 x 32 x 3**


|      type     	| patch size <br>/ stride 	|  output <br>  size  	| depth 	| # 1x1 	|  #3x3 <br>reduce 	| #3x3 	|  #5x5 <br>reduce 	| #5x5 	| pool<br>proj 	|
|:-------------:	|:-----------------------:	|:-------------------:	|:-----:	|:-----:	|:----------------:	|:----:	|:----------------:	|:----:	|:------------:	|
|  convolution  	|          3x3/?          	|     32 x 32 x ?    	|   1   	|   -   	|         -        	|   -  	|         -        	|   -  	|       -      	|
|  convolution  	|          ?x?/1          	|     32 x 32 x ?    	|   1   	|   -   	|         -        	|   -  	|         -        	|   -  	|       -      	|
| inception(3a) 	|            -            	|     32 x 32 x ?    	|   1   	|   ?  	|        ?        	|  ?  	|         ?        	|   ?  	|       8      	|
| inception(3b) 	|            -            	|     32 x 32x ?    	|   1   	|   ?  	|        ?        	|  ?  	|         ?        	|  ?  	|      16      	|
|    maxpool    	|          3x3/?          	|    16 x 16 x ?    	|   -   	|   -   	|         -        	|   -  	|         -        	|   -  	|       -      	|
| inception(4a) 	|                         	|    16 x 16 x ?    	|   1   	|   ?  	|        ?        	|  ?  	|         ?        	|  ?  	|      16      	|
| inception(4b) 	|                         	|    16 x 16 x ?    	|   1   	|   ?  	|        ?        	|  ?  	|         ?        	|  ?  	|      16      	|
| inception(4c) 	|                         	|    16 x 16 x ?    	|   1   	|   ?  	|        ?        	|  ?  	|         ?        	|  ?  	|      16      	|
| inception(4d) 	|                         	|    16 x 16 x ?    	|   1   	|   ?  	|        ?        	|  ?  	|         ?        	|  ?  	|      16      	|
| inception(4e) 	|                         	|    16 x 16 x ?    	|   1   	|   ?  	|        ?        	|  ?  	|         ?        	|  ?  	|      32      	|
|    maxpool    	|          ?x?/2          	|     8 x 8 x ?     	|   -   	|   -   	|         -        	|   -  	|         -        	|   -  	|       -      	|
| inception(5a) 	|                         	|     8 x 8 x ?     	|   1   	|   ?  	|        ?        	|  ?  	|         ?        	|  ?  	|      32      	|
| inception(5b) 	|                         	|     8 x 8 x ?     	|   1   	|   ?  	|        ?        	|  ?  	|        ?        	|  ?  	|      32      	|
|    avg pool   	|          ?x?/1          	|     1 x 1 x ?     	|       	|       	|                  	|      	|                  	|      	|              	|
|  dropout(40%) 	|                         	|     1 x 1 x ?     	|       	|       	|                  	|      	|                  	|      	|              	|
|     linear    	|                         	|     1 x 1 x ?     	|       	|       	|                  	|      	|                  	|      	|              	|
|    softmax    	|                         	|      1 x 1 x ?     	|       	|       	|                  	|      	|                  	|      	|              	|

In [None]:
class ShallowInceptionNet(nn.Module):
    def __init__(self, c_in):
        super().__init__()
        
        self.input_args = [c_in]
        
        # 
        self.input_conv = nn.Sequential(
            ## conv1 
            ### YOUR CODE HERE
            ### Use conv, batchnorm, activation
            
            ## conv2
            ### YOUR CODE HERE
            ### Use conv, batchnorm, activation
        )
        
        # inception blocks 
        self.inception_stack = nn.Sequential(
            InceptionBlock(??, c_reduce={'3x3': ??, '5x5': ??}, c_out={'1x1':??, '3x3':??, '5x5':??, 'maxpool':??}), # 3a
            InceptionBlock(??, c_reduce={'3x3': ??, '5x5':??}, c_out={'1x1':??, '3x3':??, '5x5':??, 'maxpool':??}), # 3b
            nn.MaxPool2d(kernel_size=?? stride=??, padding=??), # maxpool-1
            InceptionBlock(??, c_reduce={'3x3': ??, '5x5':??}, c_out={'1x1':??, '3x3':??, '5x5':??, 'maxpool':??}), # 4a
            InceptionBlock(??, c_reduce={'3x3': ??, '5x5':??}, c_out={'1x1':??, '3x3':??, '5x5':??, 'maxpool':??}), # 4b
            InceptionBlock(??, c_reduce={'3x3': ??, '5x5':??}, c_out={'1x1':??, '3x3':??, '5x5':??, 'maxpool':??}), # 4c
            InceptionBlock(??, c_reduce={'3x3': ??, '5x5':??}, c_out={'1x1':??, '3x3':??, '5x5':??, 'maxpool':??}), # 4d
            InceptionBlock(??, c_reduce={'3x3': ??, '5x5':??}, c_out={'1x1':??, '3x3':??, '5x5':??, 'maxpool':??}), # 4e
            nn.MaxPool2d(kernel_size=??, stride=??, padding=??), # maxpool-1
            InceptionBlock(??, c_reduce={'3x3': ??, '5x5':??}, c_out={'1x1':??, '3x3':??, '5x5':??, 'maxpool':??}), # 5a
            InceptionBlock(??, c_reduce={'3x3': ??, '5x5':??}, c_out={'1x1':??, '3x3':??, '5x5':??, 'maxpool':??}), # 5b                    
        )
        
        # output module 
        self.output_softmax = nn.Sequential(
            nn.AdaptiveAvgPool2d(output_size=(1,1)), # kernel_size and stride are automaticall inferred: https://pytorch.org/docs/stable/generated/torch.nn.AdaptiveAvgPool2d.html
            nn.Dropout(0.4),
            nn.Flatten(),
            nn.Linear(??, ??),
            nn.Softmax(dim=1)
        )
    
    def forward(self, x):
        x = self.input_conv(x)
        x = self.inception_stack(x)
        x = self.output_softmax(x)
        return x
        


In [None]:
input_shape = data.data[0].shape # width, height,  number of channels, 
print(input_shape)

We need to write a function to fetch the memory statistics of our model.
While building different autoencoder models, it will be important to ensure they have all similar number of learnable parameters. This will ensure fair comparison of their performance.

In [None]:
def mem_size(model):
    """
    Get model size in GB (as str: "N GB")
    """
    mem_params = sum(
        [param.nelement() * param.element_size() for param in model.parameters()]
    )
    mem_bufs = sum([buf.nelement() * buf.element_size() for buf in model.buffers()])
    mem = mem_params + mem_bufs
    return f"{mem / 1e9:.4f} GB"

def num_params(model):
    """
    Print number of parameters in model's named children
    and total
    """
    s = "Number of parameters:\n"
    n_params = 0
    for name, child in model.named_children():
        n = sum(p.numel() for p in child.parameters())
        s += f"  • {name:<15}: {n}\n"
        n_params += n
    s += f"{'total':<19}: {n_params}"

    return s

def pp_model_summary(model):
    print(num_params(model))
    print(f"{'Total memory':<18} : {mem_size(model)}")

### Model size

How many parameters does your model have?

In [None]:
model = ShallowInceptionNet(c_in=input_shape[2])
pp_model_summary(model)

### Data Transformation


Standard ways to transform image datasets has majorly involved two components:

- **Data normalization**: It is a standard practice to make sure that all input dimensions are on the same scale. We will use [`transforms.normalize`](https://pytorch.org/vision/stable/transforms.html#conversion-transforms) for this purpose.
    
    
- **Data augmentation**: Given that an image category remains same if it is cropped (a little) or rotated, we want the network to be robust to these changes. Not only it makes for a better classifier it also prevents network from overfitting to the original dataset. Therefore, we will be doing random cropping followed by resizing and flipping of the images. This is conveniently handled by [`transforms.RandomCrop`](https://pytorch.org/vision/stable/_modules/torchvision/transforms/transforms.html#RandomCrop) and [`transforms.RandomHorizontalFlip`](https://pytorch.org/vision/stable/_modules/torchvision/transforms/transforms.html#RandomHorizontalFlip). Note that we can use other transformations as well. 
    
**Note**: We don't need to apply the above augmentation transformations on the test dataset. 

We will define these transforms here for convenience. 

In [None]:
DATA_MEANS =  (data.data / 255.0).mean(axis=(0,1,2))
DATA_STD = (data.data / 255.0).std(axis=(0,1,2))

print(f"data means along three channels: {DATA_MEANS}")
print(f"data std along three channels: {DATA_STD}")

train_transforms = transforms.Compose([
    transforms.RandomHorizontalFlip(),
    transforms.RandomResizedCrop(size=(input_shape[0], input_shape[1]), ratio=(0.95, 1.05)),
    transforms.ToTensor(),
    transforms.Normalize(mean=DATA_MEANS, std=DATA_STD)
])

test_transforms = transforms.Compose([
    transforms.ToTensor(),
    transforms.Normalize(mean=DATA_MEANS, std=DATA_STD)
])

### Dataset loading

[`torchvision.datasets`](https://pytorch.org/vision/stable/datasets.html) conveniently takes these transforms to provide us with a dataloader for iterating through the dataset.

In [None]:
# train dataset with transforms
train_data = torchvision.datasets.CIFAR10(root="./data/", transform=train_transforms)

# split data into train and val; we will create train_data at the start of every epoch
x_train, x_val = torch.utils.data.random_split(train_data, [45000, 5000]) # 10% train-val split
val_dataloader = torch.utils.data.DataLoader(x_val, batch_size=256, shuffle=True)

# test dataset with transforms
test_data = torchvision.datasets.CIFAR10(root="./data/", transform=test_transforms)
test_dataloader = torch.utils.data.DataLoader(test_data, batch_size=256, num_workers=4)

In [None]:
def process(model, dataloader, optimizer=None):
    n_samples = 0
    running_loss, running_acc = 0, 0
    for batch, labels in dataloader:
        # transfer to GPU if avaiable
        batch = batch.to(device)
        labels = labels.to(device)

        n_samples += batch.shape[0]
        
        # forward pass
        ### YOUR CODE HERE
        # Compute model output
        # Compute loss
        # Compute model predictions
        
        # backward pass 
        if optimizer is not None:
            optimizer.zero_grad()
            loss.backward()
            optimizer.step()
        
        running_loss += loss.item()
        running_acc += ### YOUR CODE HERE compute accuracy from predictions
        
    return running_loss / n_samples, running_acc / n_samples
        
    

In [None]:
def train(model):
    model.to(device)
    
    # fix seed for reproducibility 
    rng = np.random.RandomState(1)
    torch.manual_seed(rng.randint(np.iinfo(int).max))

    # create a model directory to store the best model
    model_dir = pathlib.Path("./models").resolve()
    if not model_dir.exists():
        model_dir.mkdir()

    optimizer = torch.optim.Adam(model.parameters(),lr=0.001)

    epoch_size=200
    batch_size=128

    best_val_acc = 0
    train_losses, train_accs = [], []
    val_losses, val_accs = [], []
    n_epochs = 30
    no_improvement_cnt = 0
    for epoch in range(n_epochs):
        print(f"@ epoch {epoch}", end="")

        # training loss
        idxs = rng.choice(len(x_train), epoch_size * batch_size, replace=True)
        train_dataloader = torch.utils.data.DataLoader([x_train[idx] for idx in idxs], batch_size=batch_size, num_workers=4)
        train_loss, train_acc = process(model, train_dataloader, optimizer)

        # validation loss
        with torch.no_grad():
            val_loss, val_acc = process(model, val_dataloader)

        # save the best model
        if val_acc > best_val_acc:
            best_val_acc = val_acc
            torch.save(model.state_dict(), model_dir / "best.ckpt")
            no_improvement = 0
        else:

            # if there has been no improvement in validation loss, stop early
            no_improvement_cnt += 1

            if no_improvement_cnt % 10 == 0:
                print("\nEarly stopping!")
                break

        # logging
        train_losses.append(train_loss)
        val_losses.append(val_loss)
        train_accs.append(train_acc)
        val_accs.append(val_acc)
        print(f"\ttrain_loss: {train_loss: .5f}, train_acc:{100*train_acc: 2.3f}%,  val_loss: {val_loss:.5f}, val_acc:{100*val_acc: 2.3f}%,")

    print(f"best val acc: {100*best_val_acc:2.3f}%")

    # load the best model
    model = model.__class__(*model.input_args)
    model.load_state_dict(torch.load(model_dir / "best.ckpt"))
    model = model.to(device) 
    
    metrics = {
        'train_losses': train_losses,
        'val_losses': val_losses,
        'train_accs': train_accs,
        'val_accs': val_accs
    }
    return model, metrics


In [None]:
model = ShallowInceptionNet(c_in=input_shape[2])
model, metrics = train(model)

In [None]:
# plot losses
fig, axs = plt.subplots(nrows=1, ncols=1, figsize=(10,5), dpi=100)

axs.plot(metrics['train_losses'], color="#BDD9BF", marker="o", label="Train loss", linestyle=":", linewidth=2)
axs.plot(metrics['val_losses'], color="#A997DF", marker="o", label="Val loss", linestyle=":", linewidth=2)
axs.set_ylabel("loss", fontsize=20)

acc_axs = axs.twinx()
acc_axs.plot(list(map(lambda x: 100*x, metrics['train_accs'])), color="#BDD9BF", marker="x", label="train acc", linewidth=2)
acc_axs.plot(list(map(lambda x: 100*x, metrics['val_accs'])), color="#A997DF", marker="x", label="val acc", linewidth=2)
acc_axs.set_ylabel("accuracy", fontsize=20)

axs.set_xlabel("Epochs", fontsize=20)

# tick size
for tick in axs.xaxis.get_major_ticks():
    tick.label.set_fontsize(15)

for tick in axs.yaxis.get_major_ticks():
    tick.label.set_fontsize(15)

for tick in acc_axs.yaxis.get_major_ticks():
    tick.label.set_fontsize(15)
    
    
# legend
legend = []
legend.append(Line2D([0,1], [1,0], color="#BDD9BF", label="Train", linewidth=5))
legend.append(Line2D([0,1], [1,0], color="#A997DF", label="Val", linewidth=5))
legend.append(Line2D([0,1], [1,0], color="black", label="Accuracy", linewidth=1))
legend.append(Line2D([0,1], [1,0], color="black", linestyle=":",label="loss", linewidth=1))
lgd = fig.legend(handles=legend, ncol=1, fontsize=15, loc="center right", fancybox=True, bbox_to_anchor=(1.0, 0.5, 0.2, 0))



### Test set performance

In [None]:
# test set performance 
test_loss, test_acc = process(model, test_dataloader)

print(f"test dataset loss: {test_loss: 0.5f} \t accuracy: {100*test_acc: 2.3f}%")