# Assignment 4: Self-Attention for Vision

For this assignment, we're going to implement self-attention blocks in a convolutional neural network for CIFAR-10 Classification.

# Part I. Preparation

First, we load the CIFAR-10 dataset. This might take a couple minutes the first time you do it, but the files should stay cached after that.

In [23]:
import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import DataLoader
from torch.utils.data import sampler

import torchvision.datasets as dset
import torchvision.transforms as T

import numpy as np

In [24]:
NUM_TRAIN = 49000

# The torchvision.transforms package provides tools for preprocessing data
# and for performing data augmentation; here we set up a transform to
# preprocess the data by subtracting the mean RGB value and dividing by the
# standard deviation of each RGB value; we've hardcoded the mean and std.
transform = T.Compose([
                T.ToTensor(),
                T.Normalize((0.4914, 0.4822, 0.4465), (0.2023, 0.1994, 0.2010))
            ])

# We set up a Dataset object for each split (train / val / test); Datasets load
# training examples one at a time, so we wrap each Dataset in a DataLoader which
# iterates through the Dataset and forms minibatches. We divide the CIFAR-10
# training set into train and val sets by passing a Sampler object to the
# DataLoader telling how it should sample from the underlying Dataset.
cifar10_train = dset.CIFAR10('./data/datasets', train=True, download=True,
                             transform=transform)
loader_train = DataLoader(cifar10_train, batch_size=64, 
                          sampler=sampler.SubsetRandomSampler(range(NUM_TRAIN)))

cifar10_val = dset.CIFAR10('./data/datasets', train=True, download=True,
                           transform=transform)
loader_val = DataLoader(cifar10_val, batch_size=64, 
                        sampler=sampler.SubsetRandomSampler(range(NUM_TRAIN, 50000)))

cifar10_test = dset.CIFAR10('./data/datasets', train=False, download=True, 
                            transform=transform)
loader_test = DataLoader(cifar10_test, batch_size=64)

Files already downloaded and verified
Files already downloaded and verified
Files already downloaded and verified


You have an option to **use GPU by setting the flag to True below**. It is not necessary to use GPU for this assignment. Note that if your computer does not have CUDA enabled, `torch.cuda.is_available()` will return False and this notebook will fallback to CPU mode.

The global variables `dtype` and `device` will control the data types throughout this assignment. 

In [25]:
USE_GPU = True

dtype = torch.float32 # we will be using float throughout this tutorial

if USE_GPU and torch.cuda.is_available():
    device = torch.device('cuda')
else:
    device = torch.device('cpu')

# Constant to control how frequently we print train loss
print_every = 100

print('using device:', device)

using device: cuda


## Flatten Function

In [26]:
def flatten(x):
    N = x.shape[0] # read in N, C, H, W
    return x.view(N, -1)  # "flatten" the C * H * W values into a single vector per image

def test_flatten():
    x = torch.arange(12).view(2, 1, 3, 2)
    print('Before flattening: ', x)
    print('After flattening: ', flatten(x))

test_flatten()

Before flattening:  tensor([[[[ 0,  1],
          [ 2,  3],
          [ 4,  5]]],


        [[[ 6,  7],
          [ 8,  9],
          [10, 11]]]])
After flattening:  tensor([[ 0,  1,  2,  3,  4,  5],
        [ 6,  7,  8,  9, 10, 11]])



### Check Accuracy Function


In [27]:
import torch.nn.functional as F  # useful stateless functions
def check_accuracy(loader, model):
    if loader.dataset.train:
        print('Checking accuracy on validation set')
    else:
        print('Checking accuracy on test set')
    num_correct = 0
    num_samples = 0
    model.eval()  # set model to evaluation mode
    with torch.no_grad():
        for x, y in loader:
            x = x.to(device=device, dtype=dtype)  # move to device, e.g. GPU
            y = y.to(device=device, dtype=torch.long)
            scores = model(x)
            _, preds = scores.max(1)
            num_correct += (preds == y).sum()
            num_samples += preds.size(0)
        acc = float(num_correct) / num_samples
        print('Got %d / %d correct (%.2f)' % (num_correct, num_samples, 100 * acc))
        return 100 * acc

### Training Loop

In [28]:
def train(model, optimizer, epochs=1):
    """
    Train a model on CIFAR-10 using the PyTorch Module API.
    
    Inputs:
    - model: A PyTorch Module giving the model to train.
    - optimizer: An Optimizer object we will use to train the model
    - epochs: (Optional) A Python integer giving the number of epochs to train for
    
    Returns: Nothing, but prints model accuracies during training.
    """
    model = model.to(device=device)  # move the model parameters to CPU/GPU
    acc_max = 0
    for e in range(epochs):
        for t, (x, y) in enumerate(loader_train):
            
            model.train()  # put model to training mode
            x = x.to(device=device, dtype=dtype)  # move to device, e.g. GPU
            y = y.to(device=device, dtype=torch.long)

            scores = model(x)
            loss = F.cross_entropy(scores, y)

            # Zero out all of the gradients for the variables which the optimizer
            # will update.
            optimizer.zero_grad()

            # This is the backwards pass: compute the gradient of the loss with
            # respect to each  parameter of the model.
            loss.backward()

            # Actually update the parameters of the model using the gradients
            # computed by the backwards pass.
            optimizer.step()

            if t % print_every == 0:
                print('Epoch %d, Iteration %d, loss = %.4f' % (e, t, loss.item()))
                acc = check_accuracy(loader_val, model)
                if acc >= acc_max:
                    acc_max = acc
                print()
    print("Maximum accuracy attained: ", acc_max)

In [29]:
# We need to wrap `flatten` function in a module in order to stack it
# in nn.Sequential
class Flatten(nn.Module):
    def forward(self, x):
        return flatten(x)

## Vanilla CNN; No Attention
We implement the vanilla architecture for you here. Do not modify the architecture. You will use the same architecture in the following parts. Do not modify the hyper-parameters.

In [30]:
channel_1 = 64
channel_2 = 32
learning_rate = 1e-3
num_classes = 10

model = nn.Sequential(
    nn.Conv2d(3, channel_1, 3, padding=1, stride=1),
    nn.ReLU(),
    nn.Conv2d(channel_1, channel_2, 3, padding=1),
    nn.ReLU(),
    Flatten(),
    nn.Linear(channel_2*32*32, num_classes),
)

optimizer = optim.Adam(model.parameters(), lr=learning_rate)


train(model, optimizer, epochs=1)

Epoch 0, Iteration 0, loss = 2.3113
Checking accuracy on validation set
Got 103 / 1000 correct (10.30)

Epoch 0, Iteration 100, loss = 1.7587
Checking accuracy on validation set
Got 381 / 1000 correct (38.10)

Epoch 0, Iteration 200, loss = 1.7377
Checking accuracy on validation set
Got 440 / 1000 correct (44.00)

Epoch 0, Iteration 300, loss = 1.5314
Checking accuracy on validation set
Got 469 / 1000 correct (46.90)

Epoch 0, Iteration 400, loss = 1.6199
Checking accuracy on validation set
Got 478 / 1000 correct (47.80)

Epoch 0, Iteration 500, loss = 1.5121
Checking accuracy on validation set
Got 503 / 1000 correct (50.30)

Epoch 0, Iteration 600, loss = 1.3052
Checking accuracy on validation set
Got 533 / 1000 correct (53.30)

Epoch 0, Iteration 700, loss = 1.3668
Checking accuracy on validation set
Got 503 / 1000 correct (50.30)

Maximum accuracy attained:  53.300000000000004


## Test set -- run this only once

Now we test our model on the test set . Think about how this compares to your validation set accuracy.
You should be able to see atleast 55% accuracy

In [31]:
vanillaModel = model
check_accuracy(loader_test, vanillaModel)


Checking accuracy on test set
Got 5319 / 10000 correct (53.19)


53.190000000000005

## Part II Self-Attention

In the next section, you will implement an Attention layer which you will then use within a convnet architecture defined above for cifar 10 classification task.

A self-attention layer is formulated as following:

Input: $X$ of shape $(H\times W, C)$

Query, key, value linear transforms are $W_Q$, $W_K$, $W_V$, of shape $(C, C)$. We implement these linear transforms as 1x1 convolutional layers of the same dimensions.

$XW_Q$, $XW_K$, $XW_V$, represent the output volumes when input X is passed through the transforms.


Self-Attention is given by the formula: $Attention(X) = X + Softmax(\frac{XW_Q(XW_K)^\top}{\sqrt{C}})XW_V$

### Inline Question 1: Self-Attention is equivalent to which of the following: (5 points)
1. K-means clustering <br />
2. Non-local means <br />
3. Residual Block <br />
4. Gaussian Blurring <br />

Your Answer: Non-local means

### Here you implement the Attention module, and run it in the next section (40 points)

In [32]:
# Initialize the attention module as a nn.Module subclass
class Attention(nn.Module):
    def __init__(self, in_channels):
        super().__init__()
        
        # TODO: Implement the Key, Query and Value linear transforms as 1x1 convolutional layers
        # Hint: channel size remains constant throughout
        self.conv_query = nn.Conv2d( in_channels, in_channels, kernel_size =1, padding = 0, stride =1)
        self.conv_key = nn.Conv2d( in_channels, in_channels, kernel_size =1, padding = 0, stride =1)
        self.conv_value = nn.Conv2d( in_channels, in_channels, kernel_size =1, padding = 0, stride =1)

    def forward(self, x):
        N, C, H, W = x.shape
        # TODO: Pass the input through conv_query, reshape the output volume to (N, C, H*W)
        q = self.conv_query(x).reshape(N, C, H*W)
#RuntimeError: shape '[64, 64, 1024]' is invalid for input of size 4734976
        # TODO: Pass the input through conv_key, reshape the output volume to (N, C, H*W)
        k = self.conv_key(x).reshape(N, C, H*W)
        
        # TODO: Pass the input through conv_value, reshape the output volume to (N, C, H*W)
        v = self.conv_value(x).reshape(N, C, H*W)

#         x_r = x.reshape(N, C, H*W)
#         print(q.shape)
#         print(q.transpose(1, 2).shape)
#         print(k.shape)
        # TODO: Implement the above formula for attention using q, k, v, C
        # NOTE: The X in the formula is already added for you in the return line
        
        # Calculate attention scores
        attention_scores = torch.matmul(q.transpose(1, 2), k) / torch.sqrt(torch.tensor(C))
        attention_scores = torch.softmax(attention_scores, dim=-1)

#         print(attention_scores.shape)
#         print(v.shape)
        # Apply attention to values
        attention = torch.matmul(attention_scores, v.transpose(1, 2))

        # Reshape the output to (N, C, H, W) before adding to the input volume
        attention = attention.reshape(N, C, H, W)
        return x + attention

## Single Attention Block: Early attention; After the first conv layer. (10 points)

In [33]:
channel_1 = 64
channel_2 = 32
learning_rate = 1e-3

# TODO: Use the above Attention module after the first Convolutional layer.
# Essentially the architecture should be [Conv->Relu->Attention->Relu->Conv->Relu->Linear]

model = nn.Sequential(
    nn.Conv2d(3, channel_1, 3, 1 ,1),
    nn.ReLU(),
    Attention(channel_1),
    nn.ReLU(),
    nn.Conv2d(channel_1, channel_2, 3, 1, 1),
    nn.ReLU(),
    Flatten(),
    nn.Linear(channel_2 * 32 * 32, num_classes)
)

optimizer = optim.Adam(model.parameters(), lr=learning_rate)

train(model, optimizer, epochs=10)

Epoch 0, Iteration 0, loss = 2.3087
Checking accuracy on validation set
Got 112 / 1000 correct (11.20)

Epoch 0, Iteration 100, loss = 1.5807
Checking accuracy on validation set
Got 402 / 1000 correct (40.20)

Epoch 0, Iteration 200, loss = 1.4572
Checking accuracy on validation set
Got 463 / 1000 correct (46.30)

Epoch 0, Iteration 300, loss = 1.3600
Checking accuracy on validation set
Got 520 / 1000 correct (52.00)

Epoch 0, Iteration 400, loss = 1.3550
Checking accuracy on validation set
Got 540 / 1000 correct (54.00)

Epoch 0, Iteration 500, loss = 1.4651
Checking accuracy on validation set
Got 557 / 1000 correct (55.70)

Epoch 0, Iteration 600, loss = 1.2878
Checking accuracy on validation set
Got 579 / 1000 correct (57.90)

Epoch 0, Iteration 700, loss = 1.2797
Checking accuracy on validation set
Got 583 / 1000 correct (58.30)

Epoch 1, Iteration 0, loss = 0.8786
Checking accuracy on validation set
Got 597 / 1000 correct (59.70)

Epoch 1, Iteration 100, loss = 1.0277
Checking acc

## Test set -- run this only once

Now we test our model on the test set . Think about how this compares to your validation set accuracy.
You should see improvement of about 2-3% over the vanilla convnet model. * Use this part to tune your Attention module and then move on to the next parts. *

In [34]:
earlyAttention = model
check_accuracy(loader_test, earlyAttention)

Checking accuracy on test set
Got 6243 / 10000 correct (62.43)


62.43

## Single Attention Block: Late attention; After the second conv layer. (10 points)

In [35]:
channel_1 = 64
channel_2 = 32
learning_rate = 1e-3

# TODO: Use the above Attention module after the Second Convolutional layer.
# Essentially the architecture should be [Conv->Relu->Conv->Relu->Attention->Relu->Linear]

model = nn.Sequential(
    nn.Conv2d(3, channel_1, 3, 1, 1),
    nn.ReLU(),
    nn.Conv2d(channel_1, channel_2, 3, 1, 1),
    nn.ReLU(),
    Attention(channel_2),
    nn.ReLU(),
    Flatten(),
    nn.Linear(channel_2 * 32 * 32, num_classes)
)

optimizer = optim.Adam(model.parameters(), lr=learning_rate)

train(model, optimizer, epochs=10)

Epoch 0, Iteration 0, loss = 2.2879
Checking accuracy on validation set
Got 135 / 1000 correct (13.50)

Epoch 0, Iteration 100, loss = 1.6502
Checking accuracy on validation set
Got 411 / 1000 correct (41.10)

Epoch 0, Iteration 200, loss = 1.5258
Checking accuracy on validation set
Got 454 / 1000 correct (45.40)

Epoch 0, Iteration 300, loss = 1.6398
Checking accuracy on validation set
Got 470 / 1000 correct (47.00)

Epoch 0, Iteration 400, loss = 1.3946
Checking accuracy on validation set
Got 517 / 1000 correct (51.70)

Epoch 0, Iteration 500, loss = 1.4216
Checking accuracy on validation set
Got 533 / 1000 correct (53.30)

Epoch 0, Iteration 600, loss = 1.4931
Checking accuracy on validation set
Got 519 / 1000 correct (51.90)

Epoch 0, Iteration 700, loss = 1.3623
Checking accuracy on validation set
Got 538 / 1000 correct (53.80)

Epoch 1, Iteration 0, loss = 1.1168
Checking accuracy on validation set
Got 520 / 1000 correct (52.00)

Epoch 1, Iteration 100, loss = 1.2746
Checking acc

## Test set -- run this only once

Now we test our model on the test set . Think about how this compares to your validation set accuracy.

In [36]:
lateAttention = model
check_accuracy(loader_test, lateAttention)

Checking accuracy on test set
Got 5974 / 10000 correct (59.74)


59.74

### Inline Question 2: Provide one example each of usage of self-attention and attention in computer vision. Explain the difference between the two. (5 points)


Your Answer:

Self Attention is used for attend to attend to different spatial locations of the same image, enabling the model to attend to the related regions and their interactions. This is also called spatial or intra atttention. One example use of self attention is Image Segmentation where self-attention can be used to capture long range dependencies between different pixels in the image. By attending to individual pixels and gathering information from their neighboring/ related pixels, the model can generate more precise predictions. This application of self-attention aids in accurately delineating object boundaries and enhancing the overall quality of the segmentation results.

On the other hand, Attention is used to attend to the information from one modality and pass on that information to attend to another modality. It allows the model to attend to relevant parts of one modality based on the information from another modality. For example, in image captioning with textual context, attention can be utilized to establish alignment between significant image regions and corresponding words in the accompanying text. The model selectively focuses on specific regions of the image, taking into account the textual context. This allows the model to generate captions that are more contextually relevant and appropriate.

To Differentiate, self-attention in computer vision emphasizes spatial relationships within a single input, enabling the model to concentrate on various regions and their interactions. In contrast, attention is employed to capture connections between distinct modalities, empowering the model to selectively attend to pertinent components of one modality by leveraging information from another modality.

## Double Attention Blocks: After conv layers 1 and 2 (10 points)

In [37]:
channel_1 = 64
channel_2 = 32
learning_rate = 1e-3

# TODO: Use the above Attention module after the Second Convolutional layer.
# Essentially the architecture should be [Conv->Relu->Attention->Relu->Conv->Relu->Attention->Relu->Linear]

model = nn.Sequential(
    nn.Conv2d(3, channel_1, 3, 1 ,1),
    nn.ReLU(),
    Attention(channel_1),
    nn.ReLU(),
    nn.Conv2d(channel_1, channel_2, 3, 1 ,1),
    nn.ReLU(),
    Attention(channel_2),
    nn.ReLU(),
    Flatten(),
    nn.Linear(channel_2 * 32 * 32, num_classes)
)

optimizer = optim.Adam(model.parameters(), lr=learning_rate)

train(model, optimizer, epochs=10)

Epoch 0, Iteration 0, loss = 2.3017
Checking accuracy on validation set
Got 133 / 1000 correct (13.30)

Epoch 0, Iteration 100, loss = 1.9606
Checking accuracy on validation set
Got 314 / 1000 correct (31.40)

Epoch 0, Iteration 200, loss = 1.5850
Checking accuracy on validation set
Got 419 / 1000 correct (41.90)

Epoch 0, Iteration 300, loss = 1.3390
Checking accuracy on validation set
Got 424 / 1000 correct (42.40)

Epoch 0, Iteration 400, loss = 1.5136
Checking accuracy on validation set
Got 484 / 1000 correct (48.40)

Epoch 0, Iteration 500, loss = 1.2485
Checking accuracy on validation set
Got 505 / 1000 correct (50.50)

Epoch 0, Iteration 600, loss = 1.3849
Checking accuracy on validation set
Got 510 / 1000 correct (51.00)

Epoch 0, Iteration 700, loss = 1.6672
Checking accuracy on validation set
Got 536 / 1000 correct (53.60)

Epoch 1, Iteration 0, loss = 1.2340
Checking accuracy on validation set
Got 506 / 1000 correct (50.60)

Epoch 1, Iteration 100, loss = 1.3478
Checking acc

## Test set -- run this only once

Now we test our model on the test set . Think about how this compares to your validation set accuracy.

In [38]:
vanillaModel = model
check_accuracy(loader_test, vanillaModel)

Checking accuracy on test set
Got 5997 / 10000 correct (59.97)


59.97

## Resnet with Attention 

Now we will experiment with applying attention within the Resnet10 architecture that we implemented in Homework 2. Please note that for a deeper model such as Resnet we do not expect significant improvements in performance with Attention

## Vanilla Resnet, No Attention

The architecture for Resnet is given below, please train it and evaluate it on the test set.

In [46]:
import torch
import torch.nn as nn

class ResNet(nn.Module):

    def __init__(self, block, layers, img_channels=3, num_classes=100, batchnorm=False):
        super(ResNet, self).__init__() #layers = [1, 1, 1, 1] 
        self.in_channels = 64
        self.conv1 = nn.Conv2d(img_channels, 64, kernel_size=7, stride=2, padding=3)
        self.bn1 = nn.BatchNorm2d(64)
        self.relu = nn.ReLU()
        self.maxpool = nn.MaxPool2d(kernel_size=3, stride=2, padding=1)
        self.batchnorm = batchnorm
        self.layer1 = self.make_layer(block, layers[0], out_channels=64, stride=1, batchnorm=batchnorm)
        self.layer2 = self.make_layer(block, layers[1], out_channels=128, stride=1, batchnorm=batchnorm)
        self.layer3 = self.make_layer(block, layers[2], out_channels=256, stride=1, batchnorm=batchnorm)
        self.layer4 = self.make_layer(block, layers[3], out_channels=512, stride=2, batchnorm=batchnorm)

        self.averagepool = nn.AdaptiveAvgPool2d((1, 1))
        self.fc = nn.Linear(512, num_classes)

    
    def forward(self, x):

        x = self.conv1(x)
        if self.batchnorm:
            x = self.bn1(x)
        x = self.relu(x)
        x = self.maxpool(x)
        x = self.layer1(x) 
        x = self.layer2(x)
        x = self.layer3(x)
        x = self.layer4(x)
        x = self.averagepool(x)
        x = x.reshape(x.shape[0], -1)
        x = x.reshape(x.shape[0], -1)
        x = self.fc(x)

        return x


        

    def make_layer(self, block, num_blocks, out_channels, stride, batchnorm=False):
        downsampler = None
        layers = []
        if stride != 1 or self.in_channels != out_channels:
            downsampler = nn.Sequential(nn.Conv2d(self.in_channels, out_channels, kernel_size = 1, stride = stride), nn.BatchNorm2d(out_channels))

        layers.append(block(self.in_channels, out_channels, downsampler, stride, batchnorm=batchnorm))

        self.in_channels = out_channels

        for i in range(num_blocks - 1):
            layers.append(block(self.in_channels, out_channels))

        
        return nn.Sequential(*layers)
        
class block(nn.Module):

    def __init__(self, in_channels, out_channels, downsampler = None, stride = 1, batchnorm=False):
        
        super(block, self).__init__()
        self.conv1 = nn.Conv2d(in_channels, out_channels, kernel_size = 3, padding = 2)
        self.bn1 = nn.BatchNorm2d(out_channels)
        self.conv2 = nn.Conv2d(out_channels, out_channels, kernel_size = 3, stride = stride)
        self.bn2 = nn.BatchNorm2d(out_channels)
        self.downsampler = downsampler
        self.relu = nn.ReLU()
        self.batchnorm = batchnorm

    
    def forward(self, x):

        residual = x
        x = self.conv1(x)
        if self.batchnorm:
            x = self.bn1(x)
        x = self.relu(x)
        x = self.conv2(x)
        if self.batchnorm:
            x = self.bn2(x)
        x = self.relu(x)
        
        if self.downsampler:
            residual = self.downsampler(residual)

        return self.relu(residual + x)
    


def ResNet10(num_classes = 100, batchnorm= False):

    return ResNet(block, [1, 1, 1, 1], num_classes=num_classes, batchnorm=batchnorm)


## Test set -- run this only once

Now we test our model on the test set . Think about how this compares to your validation set accuracy.

In [47]:
learning_rate = 1e-3

model = ResNet10()

optimizer = optim.Adam(model.parameters(), lr=learning_rate)

train(model, optimizer, epochs=10)

vanillaResnet = model
check_accuracy(loader_test, vanillaResnet)

Epoch 0, Iteration 0, loss = 4.5005
Checking accuracy on validation set
Got 119 / 1000 correct (11.90)

Epoch 0, Iteration 100, loss = 1.5761
Checking accuracy on validation set
Got 396 / 1000 correct (39.60)

Epoch 0, Iteration 200, loss = 1.4433
Checking accuracy on validation set
Got 406 / 1000 correct (40.60)

Epoch 0, Iteration 300, loss = 1.2680
Checking accuracy on validation set
Got 455 / 1000 correct (45.50)

Epoch 0, Iteration 400, loss = 0.9880
Checking accuracy on validation set
Got 500 / 1000 correct (50.00)

Epoch 0, Iteration 500, loss = 1.3753
Checking accuracy on validation set
Got 485 / 1000 correct (48.50)

Epoch 0, Iteration 600, loss = 0.9280
Checking accuracy on validation set
Got 567 / 1000 correct (56.70)

Epoch 0, Iteration 700, loss = 1.1729
Checking accuracy on validation set
Got 523 / 1000 correct (52.30)

Epoch 1, Iteration 0, loss = 0.9300
Checking accuracy on validation set
Got 577 / 1000 correct (57.70)

Epoch 1, Iteration 100, loss = 1.1176
Checking acc

74.63

In [48]:
Resnet = model
check_accuracy(loader_test, Resnet)

Checking accuracy on test set
Got 7463 / 10000 correct (74.63)


74.63

## Resnet with Attention (5 points)

In [42]:
import torch
import torch.nn as nn

class ResNet(nn.Module):

    def __init__(self, block, layers, img_channels=3, num_classes=100, batchnorm=False):
        super(ResNet, self).__init__() #layers = [1, 1, 1, 1] 
        self.in_channels = 64
        self.conv1 = nn.Conv2d(img_channels, 64, kernel_size=7, stride=2, padding=3)
        self.bn1 = nn.BatchNorm2d(64)
        self.relu = nn.ReLU()
        self.maxpool = nn.MaxPool2d(kernel_size=3, stride=2, padding=1)
        self.batchnorm = batchnorm
        self.layer1 = self.make_layer(block, layers[0], out_channels=64, stride=1, batchnorm=batchnorm)
        self.layer2 = self.make_layer(block, layers[1], out_channels=128, stride=1, batchnorm=batchnorm)
        self.layer3 = self.make_layer(block, layers[2], out_channels=256, stride=1, batchnorm=batchnorm)
        self.layer4 = self.make_layer(block, layers[3], out_channels=512, stride=2, batchnorm=batchnorm)

        self.averagepool = nn.AdaptiveAvgPool2d((1, 1))
        self.fc = nn.Linear(512, num_classes)
        
        self.attention = Attention(128)

    
    def forward(self, x):

        x = self.conv1(x)
        if self.batchnorm:
            x = self.bn1(x)
        x = self.relu(x)
        x = self.maxpool(x)
        x = self.layer1(x) 
        x = self.layer2(x)
        x = self.attention(x)
        x = self.layer3(x)
        x = self.layer4(x)
        x = self.averagepool(x)
        x = x.reshape(x.shape[0], -1)
        x = x.reshape(x.shape[0], -1)
        x = self.fc(x)

        return x


        

    def make_layer(self, block, num_blocks, out_channels, stride, batchnorm=False):
        downsampler = None
        layers = []
        if stride != 1 or self.in_channels != out_channels:
            downsampler = nn.Sequential(nn.Conv2d(self.in_channels, out_channels, kernel_size = 1, stride = stride), nn.BatchNorm2d(out_channels))

        layers.append(block(self.in_channels, out_channels, downsampler, stride, batchnorm=batchnorm))

        self.in_channels = out_channels

        for i in range(num_blocks - 1):
            layers.append(block(self.in_channels, out_channels))

        
        return nn.Sequential(*layers)
        
class block(nn.Module):

    def __init__(self, in_channels, out_channels, downsampler = None, stride = 1, batchnorm=False):
        
        super(block, self).__init__()
        self.conv1 = nn.Conv2d(in_channels, out_channels, kernel_size = 3, padding = 2)
        self.bn1 = nn.BatchNorm2d(out_channels)
        self.conv2 = nn.Conv2d(out_channels, out_channels, kernel_size = 3, stride = stride)
        self.bn2 = nn.BatchNorm2d(out_channels)
        self.downsampler = downsampler
        self.relu = nn.ReLU()
        self.batchnorm = batchnorm

    
    def forward(self, x):

        residual = x
        x = self.conv1(x)
        if self.batchnorm:
            x = self.bn1(x)
        x = self.relu(x)
        x = self.conv2(x)
        if self.batchnorm:
            x = self.bn2(x)
        x = self.relu(x)
        
        if self.downsampler:
            residual = self.downsampler(residual)

        return self.relu(residual + x)
    

def ResNet10(num_classes = 100, batchnorm= False):

    return ResNet(block, [1, 1, 1, 1], num_classes=num_classes, batchnorm=batchnorm)

In [43]:
## Resnet with Attention
learning_rate = 1e-3

# TODO: Use the above Attention module after the 2nd resnet block i.e. after self.layer2.

model = ResNet10()

optimizer = optim.Adam(model.parameters(), lr=learning_rate)

train(model, optimizer, epochs=10)

Epoch 0, Iteration 0, loss = 4.6657
Checking accuracy on validation set
Got 119 / 1000 correct (11.90)

Epoch 0, Iteration 100, loss = 1.4973
Checking accuracy on validation set
Got 401 / 1000 correct (40.10)

Epoch 0, Iteration 200, loss = 1.6365
Checking accuracy on validation set
Got 419 / 1000 correct (41.90)

Epoch 0, Iteration 300, loss = 1.2889
Checking accuracy on validation set
Got 507 / 1000 correct (50.70)

Epoch 0, Iteration 400, loss = 1.3445
Checking accuracy on validation set
Got 506 / 1000 correct (50.60)

Epoch 0, Iteration 500, loss = 1.1996
Checking accuracy on validation set
Got 562 / 1000 correct (56.20)

Epoch 0, Iteration 600, loss = 1.1652
Checking accuracy on validation set
Got 595 / 1000 correct (59.50)

Epoch 0, Iteration 700, loss = 1.0385
Checking accuracy on validation set
Got 586 / 1000 correct (58.60)

Epoch 1, Iteration 0, loss = 0.8241
Checking accuracy on validation set
Got 584 / 1000 correct (58.40)

Epoch 1, Iteration 100, loss = 1.0126
Checking acc

## Test set -- run this only once

Now we test our model on the test set . Think about how this compares to your validation set accuracy.

In [45]:
AttentionResnet = model
check_accuracy(loader_test, AttentionResnet)

Checking accuracy on test set
Got 7391 / 10000 correct (73.91)


73.91

## Inline Question 3: Rank the above models based on their performance on test dataset (15 points)
( You are encouraged to run each of the experiments (training) at
least 3 times to get an average estimate )

Report the test accuracies alongside the model names. For example, 1. Vanilla CNN (57.45%, 57.99%).. etc

1. Resnet with Attention (75.9%, 75.79%, 74.01) <br />
2. Vanilla Resnet (74.63%, 75.98%, 74.94%) <br /> 
3. Single Attention Block: Early attention; After the first conv layer (63.3%, 61.84%, 62.43%) <br /> 
4. Double Attention Blocks (62.57%, 62.49%, 61.97%) <br /> 
5. Single Attention Block: Late attention; After the second conv layer (59.08%, 59.89%, 59.74%) <br />
6. Vanilla CNN (53.8%, 52.77%, 53.3%) <br /> 

### Bonus Question (Ungraded): Can you give a possible explanation that supports the rankings?
Your Answer: Attention helps in attending and understanding the different features in the model, thus gives a better rating using attention. Resnet gives a better prediction as they have a residual block that helps in improving the performance. As the number of attention block increases, the model performance doesn't have a greater improvement. The rankings can be explained based on the characteristics of each model. Resnet with Attention ranks first due to its combination of Resnet architecture, which addresses the vanishing gradient problem, and attention mechanisms, which allow the model to focus on important features. This combination results in improved performance. Vanilla Resnet comes in second as it utilizes residual blocks to enable deeper networks and learn complex representations. The single attention block with early attention ranks third, capturing important features at an early stage and providing a moderate performance boost. The double attention blocks follow, offering some improvement but not significantly more than a single attention block. The single attention block with late attention ranks fifth, capturing higher-level features but with a less pronounced impact on performance. Finally, the vanilla CNN without attention or residual connections ranks last, as it lacks the enhancements needed to capture complex patterns effectively. Overall, the rankings highlight the benefits of attention mechanisms and the importance of model architecture in achieving higher performance.