# Attentional Networks in Computer Vision

Prepared by comp411 Teaching Unit (TA Can Küçüksözen) in the context of Computer Vision with Deep Learning Course. Do not hesitate to ask in case you have any questions, contact me at: ckucuksozen19@ku.edu.tr

Up until this point, we have worked with deep fully-connected networks, convolutional networks and recurrent networks using them to explore different optimization strategies and network architectures. Fully-connected networks are a good testbed for experimentation because they are very computationally efficient, on the other hand, most successful image processing methods use convolutional networks. However recent state-of-the-art results on computer vision realm are acquired using Attentional layers and Transformer architectures.

First you will implement several layer types that are used in fully attentional networks. You will then use these layers to train an Attentional Image Classification network, specifically a smaller version of Vision Transformer (VIT) on the CIFAR-10 dataset. The original paper can be accessed via the following link: https://arxiv.org/pdf/2010.11929.pdf

# Part I. Preparation

First, we load the CIFAR-10 dataset. This might take a couple minutes the first time you do it, but the files should stay cached after that.

In previous parts of the assignment we had to write our own code to download the CIFAR-10 dataset, preprocess it, and iterate through it in minibatches; PyTorch provides convenient tools to automate this process for us.

In [1]:
import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import DataLoader
from torch.utils.data import sampler
import torch.nn.functional as F
from torch.autograd import Variable

import torchvision.datasets as dset
import torchvision.transforms as T

import numpy as np

In [2]:
NUM_TRAIN = 49000

# The torchvision.transforms package provides tools for preprocessing data
# and for performing data augmentation; here we set up a transform to
# preprocess the data by subtracting the mean RGB value and dividing by the
# standard deviation of each RGB value; we've hardcoded the mean and std.
transform = T.Compose([
                T.ToTensor(),
                T.Normalize((0.4914, 0.4822, 0.4465), (0.2023, 0.1994, 0.2010))
            ])

# We set up a Dataset object for each split (train / val / test); Datasets load
# training examples one at a time, so we wrap each Dataset in a DataLoader which
# iterates through the Dataset and forms minibatches. We divide the CIFAR-10
# training set into train and val sets by passing a Sampler object to the
# DataLoader telling how it should sample from the underlying Dataset.
cifar10_train = dset.CIFAR10('./comp411/datasets', train=True, download=True,
                             transform=transform)
loader_train = DataLoader(cifar10_train, batch_size=64, 
                          sampler=sampler.SubsetRandomSampler(range(NUM_TRAIN)))

cifar10_val = dset.CIFAR10('./comp411/datasets', train=True, download=True,
                           transform=transform)
loader_val = DataLoader(cifar10_val, batch_size=64, 
                        sampler=sampler.SubsetRandomSampler(range(NUM_TRAIN, 50000)))

cifar10_test = dset.CIFAR10('./comp411/datasets', train=False, download=True, 
                            transform=transform)
loader_test = DataLoader(cifar10_test, batch_size=64)

Files already downloaded and verified
Files already downloaded and verified
Files already downloaded and verified


You have an option to **use GPU by setting the flag to True below**. It is not necessary to use GPU for this assignment. Note that if your computer does not have CUDA enabled, `torch.cuda.is_available()` will return False and this notebook will fallback to CPU mode.

The global variables `dtype` and `device` will control the data types throughout this assignment. 

In [3]:
USE_GPU = True

dtype = torch.float32 # we will be using float throughout this tutorial

if USE_GPU and torch.cuda.is_available():
    device = torch.device('cuda')
else:
    device = torch.device('cpu')

# Constant to control how frequently we print train loss
print_every = 100

print('using device:', device)

using device: cpu


# Part II. Barebones Transformers: Self-Attentional Layer

Here you will complete the implementation of the Pytorch nn.module `SelfAttention`, which will perform the forward pass of a self-attentional layer. Our implementation of the SelfAttentional layer will include three distinct fully connected layers which will be responsible of:

1. A fully connected layer, `W_Q`, which will be used to project our input into `queries`
2. A fully connected layer, `W_K`, which will be used to project our input into `keys`
3. A fully connected layer, `W_V`, which will be used to project our input into `values`

After defining such three fully connected layers, and obtain our `queries, keys, and values` variables at the beginning of our forward pass, the following operations should be carried out in order to complete the attentional layer implementation.

1. Seperate each of `query, key, and value` projections into their respective heads. In other words, split the feature vector dimension of each matrix into necessarry number of chunks.

2. Compute the `Attention Scores` between each pair of sequence elements via conducting a scaled dot product operation between every pair of `queries` and `keys`. Note that `Attention Scores` matrix should have the size of `[# of queries , # of keys]`

3. Calculate the `Attention Weights` of each query by applying the non-linear `Softmax` normalization accross the `keys` dimension of the `Attention Scores` matrix.

4. Obtain the output combination of `values` by matrix multiplying `Attention Weights` with `values`

5. Reassemble heads into one flat vector and return the output.

**HINT**: For a more detailed explanation of the self attentional layer, examine the Appendix A of the original ViT manuscript here:  https://arxiv.org/pdf/2010.11929.pdf 

In [6]:
from re import X
class SelfAttention(nn.Module):
    
    def __init__(self, input_dims, head_dims=128, num_heads=2,  bias=False):
        super(SelfAttention, self).__init__()
        
        ## initialize module's instance variables
        self.input_dims = input_dims
        self.head_dims = head_dims
        self.num_heads = num_heads
        self.proj_dims = head_dims * num_heads
        
        ## Declare module's parameters
        self.W_Q = nn.Linear(input_dims, self.proj_dims,bias=bias)
        self.W_K = nn.Linear(input_dims, self.proj_dims,bias=bias)
        self.W_V = nn.Linear(input_dims, self.proj_dims,bias=bias)

        self.init_weights()
        
    def init_weights(self):
        for m in self.modules():
            if isinstance(m, nn.Linear):
                nn.init.xavier_normal_(m.weight.data)
                if m.bias is not None:
                    m.bias.data.fill_(0.1)

    def forward(self, x):
        ## Input of shape, [B, N, D] where:
        ## - B denotes the batch size
        ## - N denotes number of sequence elements. I.e. the number of patches + the class token 
        ## - D corresponds to model dimensionality
        b,n,d = x.shape
        
        ## Construct queries,keys,values
        q_ = self.W_Q(x)
        k_ = self.W_K(x)
        v_ = self.W_V(x)
        
        ## Seperate q,k,v into their corresponding heads,
        ## After this operation each q,k,v will have the shape: [B,H,N,D//H] where
        ## - B denotes the batch size
        ## - H denotes number of heads
        ## - N denotes number of sequence elements. I.e. the number of patches + the class token 
        ## - D//H corresponds to per head dimensionality
        q, k, v = map(lambda z: torch.reshape(z, (b,n,self.num_heads,self.head_dims)).permute(0,2,1,3), [q_,k_,v_])
       
        #########################################################################################
        # TODO: Complete the forward pass of the SelfAttention layer, follow the comments below #
        #########################################################################################
        # *****START OF YOUR CODE (DO NOT DELETE/MODIFY THIS LINE)*****
        
        ## Compute attention logits. Note that this operation is conducted as a
        ## batched matrix multiplication between q and k, the output is scaled by 1/(D//H)^(1/2)
        ## inputs are queries and keys that are both of size [B,H,N,D//H]
        ## Output Attention logits should have the size: [B,H,N,N]
       
        alignment_scores = torch.matmul(q, k.transpose(-1, -2)) / np.sqrt(self.head_dims)

        ## Compute attention Weights. Note that this operation is conducted as a
        ## Softmax Normalization across the keys dimension. 
        ## Hint: You can apply the Softmax operation across the final dimension

        attention_scores = F.softmax(alignment_scores, dim=2)
       
        ## Compute output values. Note that this operation is conducted as a 
        ## batched matrix multiplication between the Attention Weights matrix and 
        ## the values tensor. After computing output values, the output should be reshaped
        ## Inputs are Attention Weights with size [B, H, N, N], values with size [B, H, N, D//H]
        ## Output should be of size [B, N, D]
        ## Hint: you should use torch.matmul, torch.permute, torch.reshape in that order
        
        attn_out = torch.matmul(attention_scores, v)
        attn_out = attn_out.permute(0,2,1,3)
        attn_out = torch.reshape(attn_out,(b,n,self.proj_dims))
        
        # *****END OF YOUR CODE (DO NOT DELETE/MODIFY THIS LINE)*****
        ################################################################################
        #                                 END OF YOUR CODE                             
        ################################################################################
    
        return attn_out

After defining the forward pass of the Self-Attentional Layer above, run the following cell to test your implementation.

When you run this function, output should have shape [64,16,256]

In [8]:
def test_self_attn_layer():
    x = torch.zeros((64, 16, 32), dtype=dtype)  # minibatch size 64, sequence elements size 16, feature channels size 32
    layer = SelfAttention(32,64,4)
    out = layer(x)
    print(out.size())  # you should see [64,16,256]
test_self_attn_layer()

torch.Size([64, 16, 256])


# Part III. Barebones Transformers: Transformer Encoder Block

Here you will complete the implementation of the Pytorch nn.module `TransformerBlock`, which will perform the forward pass of a Transfomer Encoder Block. You can refer to Figure 1 of the original manuscript of ViT from this link: https://arxiv.org/pdf/2010.11929.pdf in order to get yourself familiar with the architecture.



In [9]:
## Implementation of a two layer GELU activated Fully Connected Network is provided for you below:

class MLP(nn.Module):
    def __init__(self, input_dims, hidden_dims, output_dims, bias=True):
        super().__init__()
        
        self.fc_1 = nn.Linear(input_dims, hidden_dims, bias=bias)
        self.fc_2 = nn.Linear(hidden_dims, output_dims, bias=bias)
        
        self.init_weights()
        
    def init_weights(self):
        for m in self.modules():
            if isinstance(m, nn.Linear):
                nn.init.xavier_normal_(m.weight.data)
                if m.bias is not None:
                    m.bias.data.fill_(0.1)
        
    def forward(self, x):
        o = F.elu(self.fc_1(x))
        o = self.fc_2(o)
        return o

In [11]:
## Build from scratch a TransformerBlock Module. Note that the architecture of this
## module follows a simple computational pipeline:
## input --> layernorm --> SelfAttention --> skip connection 
##       --> layernorm --> MLP ---> skip connection ---> output
## Note that the TransformerBlock module works on a single hidden dimension hidden_dims,
## in order to faciliate skip connections with ease. Be careful about the input arguments
## to the SelfAttention block.


class TransformerBlock(nn.Module):
    def __init__(self, hidden_dims, num_heads=4, bias=False):
        super(TransformerBlock, self).__init__()
        
    ###############################################################
    # TODO: Complete the consturctor of  TransformerBlock module  #
    ###############################################################
    # *****START OF YOUR CODE (DO NOT DELETE/MODIFY THIS LINE)****
        self.hidden_dims = hidden_dims
        self.num_heads = num_heads
        self.head_dims = self.hidden_dims // self.num_heads
        
        self.norm1 = nn.LayerNorm(self.hidden_dims) 
        self.attention = SelfAttention(self.hidden_dims,head_dims=self.head_dims,num_heads=self.num_heads)
        self.norm2 = nn.LayerNorm(self.hidden_dims)
        self.mlp = MLP(hidden_dims, hidden_dims, hidden_dims)
        
    # *****END OF YOUR CODE (DO NOT DELETE/MODIFY THIS LINE)*****
    ###################################################################
    #                                 END OF YOUR CODE                #             
    ###################################################################
        
    def forward(self, x):
        
    ##############################################################
    # TODO: Complete the forward of TransformerBlock module      #
    ##############################################################
    # *****START OF YOUR CODE (DO NOT DELETE/MODIFY THIS LINE)****

        norm1_out = self.norm1(x)
        attention_out = self.attention(norm1_out) + x
        norm2_out = self.norm2(attention_out)
        mlp_out = self.mlp(norm2_out)
        out = mlp_out + attention_out

        return out

    # *****END OF YOUR CODE (DO NOT DELETE/MODIFY THIS LINE)*****
    ###################################################################
    #                                 END OF YOUR CODE                #             
    ###################################################################

After defining the forward pass of the Transformer Block Layer above, run the following cell to test your implementation.

When you run this function, output should have shape (64, 16, 64).

In [12]:
def test_transfomerblock_layer():
    x = torch.zeros((64, 16, 128), dtype=dtype)  # minibatch size 64, sequence elements size 16, feature channels size 128
    layer = TransformerBlock(128,4) # hidden dims size 128, heads size 4
    out = layer(x)
    print(out.size())  # you should see [64,16,128]
test_transfomerblock_layer()

torch.Size([64, 16, 128])


# Part IV The Vision Transformer (ViT)

The final implementation for the Pytorch nn.module `ViT` is given to you below, which will perform the forward pass of the Vision Transformer. Study it and get yourself familiar with the API.


In [13]:
class ViT(nn.Module):
    def __init__(self, hidden_dims, input_dims=3, output_dims=10, num_trans_layers = 4, num_heads=4, image_k=32, patch_k=4, bias=False):
        super(ViT, self).__init__()
                
        ## initialize module's instance variables
        self.hidden_dims = hidden_dims
        self.input_dims = input_dims
        self.output_dims = output_dims
        self.num_trans_layers = num_trans_layers
        self.num_heads = num_heads
        self.image_k = image_k
        self.patch_k = patch_k
        
        self.image_height = self.image_width = image_k
        self.patch_height = self.patch_width = patch_k
        
        assert self.image_height % self.patch_height == 0 and self.image_width % self.patch_width == 0,\
                'Image size must be divisible by the patch size.'

        self.num_patches = (self.image_height // self.patch_height) * (self.image_width // self.patch_width)
        self.patch_flat_len = self.patch_height * self.patch_width
        
        ## Declare module's parameters
        
        ## ViT's flattened patch embedding projection:
        self.linear_embed = nn.Linear(self.input_dims*self.patch_flat_len, self.hidden_dims)
        
        ## Learnable positional embeddings, an embedding is learned for each patch location and the class token
        self.pos_embedding = nn.Parameter(torch.randn(1, self.num_patches + 1, self.hidden_dims))
        
        ## Learnable classt token and its index among attention sequence elements.
        self.cls_token = nn.Parameter(torch.randn(1,1,self.hidden_dims))
        self.cls_index = torch.LongTensor([0])
        
        ## Declare cascaded Transformer blocks:
        transformer_encoder_list = []
        for _ in range(self.num_trans_layers):
            transformer_encoder_list.append(TransformerBlock(self.hidden_dims, self.num_heads, bias))
        self.transformer_encoder = nn.Sequential(*transformer_encoder_list)
        
        ## Declare the output mlp:
        self.out_mlp = MLP(self.hidden_dims, self.hidden_dims, self.output_dims)
         
    def unfold(self, x, f = 7, st = 4, p = 0):
        ## Create sliding window pathes using nn.Functional.unfold
        ## Input dimensions: [B,D,H,W] where
        ## --B : input batch size
        ## --D : input channels
        ## --H, W: input height and width
        ## Output dimensions: [B,N,H*W,D]
        ## --N : number of patches, decided according to sliding window kernel size (f),
        ##      sliding window stride and padding.
        b,d,h,w = x.shape
        x_unf = F.unfold(x, (f,f), stride=st, padding=p)    
        x_unf = torch.reshape(x_unf.permute(0,2,1), (b,-1,d,f*f)).transpose(-1,-2)
        n = x_unf.size(1)
        return x_unf,n
    
    def forward(self, x):
        b = x.size(0)
        ## create sliding window patches from the input image
        x_patches,n = self.unfold(x, self.patch_height, self.patch_height, 0)
        ## flatten each patch into a 1d vector: i.e. 3x4x4 image patch turned into 1x1x48
        x_patch_flat = torch.reshape(x_patches, (b,n,-1))
        ## linearly embed each flattened patch
        x_embed = self.linear_embed(x_patch_flat)
        
        ## retrieve class token 
        cls_tokens = self.cls_token.repeat(b,1,1)
        ## concatanate class token to input patches
        xcls_embed = torch.cat([cls_tokens, x_embed], dim=-2)
        
        ## add positional embedding to input patches + class token 
        xcls_pos_embed = xcls_embed + self.pos_embedding
        
        ## pass through the transformer encoder
        trans_out = self.transformer_encoder(xcls_pos_embed)
        
        ## select the class token 
        out_cls_token = torch.index_select(trans_out, -2, self.cls_index.to(trans_out.device))
        
        ## create output
        out = self.out_mlp(out_cls_token)
        
        return out.squeeze(-2)

After defining the forward pass of the ViT above, run the following cell to test your implementation.

When you run this function, output should have shape (64, 16, 64).

In [14]:
def test_vit():
    x = torch.zeros((64, 3, 32, 32), dtype=dtype)  # minibatch size 64, image size 3,32,32
    model = ViT(hidden_dims=128, input_dims=3, output_dims=10, num_trans_layers = 4, num_heads=4, image_k=32, patch_k=4)
    out = model(x)
    print(out.size())  # you should see [64,10]
test_vit()

torch.Size([64, 10])


# Part V. Train the ViT

### Check Accuracy
Given any minibatch of input data and desired targets, we can check the classification accuracy of a neural network. 

The check_batch_accuracy function is provided for you below:

In [15]:
def check_batch_accuracy(out, target,eps=1e-7):
    b, c = out.shape
    with torch.no_grad():
        _, pred = out.max(-1) 
        correct = np.sum(np.equal(pred.cpu().numpy(), target.cpu().numpy()))
    return correct, np.float(correct) / (b)

### Training Loop
As we have already seen in the Second Assignment, in our PyTorch based training loops, we use an Optimizer object from the `torch.optim` package, which abstract the notion of an optimization algorithm and provides implementations of most of the algorithms commonly used to optimize neural networks.

In [16]:
def train(network, optimizer, trainloader):
    """
    Train a model on CIFAR-10 using the PyTorch Module API for a single epoch
    
    Inputs:
    - network: A PyTorch Module giving the model to train.
    - optimizer: An Optimizer object we will use to train the model
    - trainloader: Iterable DataLoader object that fetches the minibatches
    
    Returns: overall training accuracy for the epoch
    """
    print('\nEpoch: %d' % epoch)
    network.train()  # put model to training mode
    network = network.to(device=device)  # move the model parameters to CPU/GPU
    train_loss = 0
    correct = 0
    total = 0
    
    for batch_idx, (inputs, targets) in enumerate(trainloader):
        inputs, targets = Variable(inputs.to(device)), targets.to(device)  # move to device, e.g. GPU
            
        outputs = network(inputs)
        loss =  F.cross_entropy(outputs, targets)
            
        # Zero out all of the gradients for the variables which the optimizer
        # will update.
        optimizer.zero_grad() 

        # This is the backwards pass: compute the gradient of the loss with
        # respect to each  parameter of the model.
        loss.backward()
            
        # Actually update the parameters of the model using the gradients
        # computed by the backwards pass.
        optimizer.step()
            
        loss = loss.detach()
        train_loss += loss.item()
        correct_p, _ = check_batch_accuracy(outputs, targets) 
        correct += correct_p
        total += targets.size(0)

        print('Loss: %.3f | Acc: %.3f%% (%d/%d)'
        % (train_loss/(batch_idx+1), 100.*correct/total, correct, total))
        
    return 100.*correct/total

### Evaluation Loop
We have also prepared a Evaluation loop in order to determine our networks capabilities in terms of classification accuracy on a given dataset, either the training, or the validation split

In [17]:
def evaluate(network, evalloader):
    """
    Evaluate a model on CIFAR-10 using the PyTorch Module API for a single epoch
    
    Inputs:
    - network: A PyTorch Module giving the model to train.
    - evalloader: Iterable DataLoader object that fetches the minibatches
    
    Returns: overall evaluation accuracy for the epoch
    """
    network.eval() # put model to evaluation mode
    network = network.to(device=device)  # move the model parameters to CPU/GPU
    eval_loss = 0
    correct = 0
    total = 0
    print('\n---- Evaluation in process ----')
    with torch.no_grad():
        for batch_idx, (inputs, targets) in enumerate(evalloader):
            inputs, targets = inputs.to(device), targets.to(device) # move to device, e.g. GPU
            outputs = network(inputs)
            loss = F.cross_entropy(outputs, targets)
            
            eval_loss += loss.item()
            correct_p, _ = check_batch_accuracy(outputs, targets)
            correct += correct_p
            total += targets.size(0)
            print('Loss: %.3f | Acc: %.3f%% (%d/%d)'
                % (eval_loss/(batch_idx+1), 100.*correct/total, correct, total))
    return 100.*correct/total

### Overfit a ViT
Now we are ready to run the training loop. A nice trick is to train your model with just a few training samples in order to see if your implementation is actually bug free. 

Simply pass the input size, hidden layer size, and number of classes (i.e. output size) to the constructor of `ViT`. 

You also need to define an optimizer that tracks all the learnable parameters inside `ViT`. We prefer to use `Adam` optimizer for this part.

You should be able to overfit small datasets, which will result in very high training accuracy and comparatively low validation accuracy.

In [18]:
sample_idx_tr = torch.randperm(len(cifar10_train))[:100]
sample_idx_val = torch.randperm(len(cifar10_train))[-100:]

trainset_sub = torch.utils.data.Subset(cifar10_train, sample_idx_tr)
valset_sub = torch.utils.data.Subset(cifar10_train, sample_idx_val)

print("For overfitting experiments, the subset of the dataset that is used has {} sample images".format(len(trainset_sub)))

batch_size_sub = 25
trainloader_sub = torch.utils.data.DataLoader(trainset_sub, batch_size=batch_size_sub, shuffle=True)
valloader_sub = torch.utils.data.DataLoader(valset_sub, batch_size=batch_size_sub, shuffle=False)

print('==> Data ready, batchsize = {}'.format(batch_size_sub))

For overfitting experiments, the subset of the dataset that is used has 100 sample images
==> Data ready, batchsize = 25


In [21]:
learning_rate = 0.002
input_dims = 3
hidden_dims = 128
output_dims = 10
num_trans_layers = 4
num_heads = 4
image_k = 32
patch_k = 4

model = None
optimizer = None

################################################################################
# TODO: Instantiate your ViT model and a corresponding optimizer #
################################################################################
# *****START OF YOUR CODE (DO NOT DELETE/MODIFY THIS LINE)*****

model = ViT(hidden_dims,input_dims,output_dims,num_trans_layers,num_heads,image_k,patch_k)

optimizer = optim.Adam(model.parameters(),lr=learning_rate,betas=(0.9,0.999))

# *****END OF YOUR CODE (DO NOT DELETE/MODIFY THIS LINE)*****
################################################################################
#                                 END OF YOUR CODE                             
################################################################################

tr_accs=[]
eval_accs=[]
for epoch in range(10):
    tr_acc = train(model, optimizer, trainloader_sub)
    print('Epoch {} of training is completed, Training accuracy for this epoch is {}'\
              .format(epoch, tr_acc))  
    
    eval_acc = evaluate(model, valloader_sub)
    print('Evaluation of Epoch {} is completed, Validation accuracy for this epoch is {}'\
              .format(epoch, eval_acc))  
    tr_accs.append(tr_acc)
    eval_accs.append(eval_acc)
    
print("\nFinal train set accuracy is {}".format(tr_accs[-1]))
print("Final val set accuracy is {}".format(eval_accs[-1]))


Epoch: 0
Loss: 4.011 | Acc: 8.000% (2/25)
Loss: 6.532 | Acc: 4.000% (2/50)
Loss: 6.766 | Acc: 8.000% (6/75)
Loss: 6.544 | Acc: 9.000% (9/100)
Epoch 0 of training is completed, Training accuracy for this epoch is 9.0

---- Evaluation in process ----
Loss: 5.032 | Acc: 8.000% (2/25)
Loss: 5.127 | Acc: 6.000% (3/50)
Loss: 4.739 | Acc: 6.667% (5/75)
Loss: 4.756 | Acc: 7.000% (7/100)
Evaluation of Epoch 0 is completed, Validation accuracy for this epoch is 7.0

Epoch: 1
Loss: 4.199 | Acc: 16.000% (4/25)
Loss: 4.150 | Acc: 24.000% (12/50)
Loss: 3.954 | Acc: 20.000% (15/75)
Loss: 3.753 | Acc: 18.000% (18/100)
Epoch 1 of training is completed, Training accuracy for this epoch is 18.0

---- Evaluation in process ----
Loss: 3.721 | Acc: 24.000% (6/25)
Loss: 3.183 | Acc: 24.000% (12/50)
Loss: 3.337 | Acc: 20.000% (15/75)
Loss: 3.435 | Acc: 19.000% (19/100)
Evaluation of Epoch 1 is completed, Validation accuracy for this epoch is 19.0

Epoch: 2
Loss: 3.847 | Acc: 16.000% (4/25)
Loss: 3.079 | Acc:

## Train the net
By training the four-layer ViT network for three epochs, with untuned hyperparameters that are initialized as below,  you should achieve greater than 50% accuracy both on the training set and the test set:

In [22]:
learning_rate = 0.002
input_dims = 3
hidden_dims = 128
output_dims = 10
num_trans_layers = 4
num_heads = 4
image_k = 32
patch_k = 4

model = None
optimizer = None

################################################################################
# TODO: Instantiate your ViT model and a corresponding optimizer #
################################################################################
# *****START OF YOUR CODE (DO NOT DELETE/MODIFY THIS LINE)*****

model = ViT(hidden_dims,input_dims,output_dims,num_trans_layers,num_heads,image_k,patch_k)

optimizer = optim.Adam(model.parameters(),lr=learning_rate,betas=(0.9,0.999))

# *****END OF YOUR CODE (DO NOT DELETE/MODIFY THIS LINE)*****
################################################################################
#                                 END OF YOUR CODE                             
################################################################################

tr_accs=[]
test_accs=[]
for epoch in range(3):
    tr_acc = train(model, optimizer, loader_train)
    print('Epoch {} of training is completed, Training accuracy for this epoch is {}'\
              .format(epoch, tr_acc))  
    
    test_acc = evaluate(model, loader_test)
    print('Evaluation of Epoch {} is completed, Test accuracy for this epoch is {}'\
              .format(epoch, test_acc))  
    
    tr_accs.append(tr_acc)
    test_accs.append(test_acc)
    
print("\nFinal train set accuracy is {}".format(tr_accs[-1]))
print("Final test set accuracy is {}".format(test_accs[-1]))


Epoch: 0
Loss: 5.124 | Acc: 12.500% (8/64)
Loss: 6.017 | Acc: 8.594% (11/128)
Loss: 6.308 | Acc: 9.375% (18/192)
Loss: 5.801 | Acc: 10.547% (27/256)
Loss: 5.613 | Acc: 11.875% (38/320)
Loss: 5.287 | Acc: 12.240% (47/384)
Loss: 5.055 | Acc: 12.054% (54/448)
Loss: 4.823 | Acc: 12.695% (65/512)
Loss: 4.684 | Acc: 11.979% (69/576)
Loss: 4.479 | Acc: 12.500% (80/640)
Loss: 4.350 | Acc: 12.642% (89/704)
Loss: 4.210 | Acc: 13.542% (104/768)
Loss: 4.095 | Acc: 13.702% (114/832)
Loss: 3.981 | Acc: 13.951% (125/896)
Loss: 3.878 | Acc: 14.062% (135/960)
Loss: 3.784 | Acc: 14.258% (146/1024)
Loss: 3.710 | Acc: 14.430% (157/1088)
Loss: 3.626 | Acc: 14.931% (172/1152)
Loss: 3.555 | Acc: 15.214% (185/1216)
Loss: 3.485 | Acc: 15.547% (199/1280)
Loss: 3.415 | Acc: 16.220% (218/1344)
Loss: 3.361 | Acc: 16.122% (227/1408)
Loss: 3.312 | Acc: 16.168% (238/1472)
Loss: 3.273 | Acc: 16.406% (252/1536)
Loss: 3.227 | Acc: 16.562% (265/1600)
Loss: 3.192 | Acc: 16.647% (277/1664)
Loss: 3.153 | Acc: 16.956% (293/