Homework Group A
## Osama Al Kamel / Mtr Num: 3141575
## Joshua Oldridge / Mtr Num: 3140770
---

# Introduction to Vision Transformers - Mini Project
---

Instructions are given in <span style="color:blue">blue</span> color.

**General Instructions and Hints**

* <span style="color:blue"> In your solution notebook, make it clear and explain what you did for which one of the tasks using markdown and / or commentary as appropriate.</span>
* You will be able to make use of or at least be inspired by some of the material already provided for other topics in this class
* <span style="color:red"> Whenever you use something from a specific source or by employing a specific tool <b>academic honesty demands</b> that you reference the original source!!!</span>

---

## Overview

- [Imports](#Imports)
- [Task 1: Inductive bias in transformer models](#Task-1:-Inductive-bias-in-transformer-models)
- [Task 2: Visualizing attention in ViTs](#Task-2:-Visualizing-attention-in-ViTs)
  - [Task 2.1: Modify our ViT implementation to output the attention matrix](#Task-2.1:-Modify-our-ViT-implementation-to-output-the-attention-matrix)
  - [Task 2.2: Write a function that creates an attention heatmap for a specific image](#Task-2.2:-Write-a-function-that-creates-an-attention-heatmap-for-a-specific-image)
  - [Task 2.3: Use the function to visualize attention maps for example images](#Task-2.3:-Use-the-function-to-visualize-attention-maps-for-example-images)
- [Task 3: Using the inductive bias of CNNs to support the training of ViTs](#Task-3:-Using-the-inductive-bias-of-CNNs-to-support-the-training-of-ViTs)
  - [Task 3.1: Training a teacher model](#Task-3.1:-Training-a-teacher-model)
  - [Task 3.2: Distillation loss](#Task-3.2:-Distillation-loss)
  - [Task 3.3: Changing the ViT architecture to allow distillation](#Task-3.3:-Changing-the-ViT-architecture-to-allow-distillation)
  - [Task 3.4: Train your own DeiT model](#Task-3.4:-Train-your-own-DeiT-model)

## Imports

In [7]:
import torch
from torchvision import datasets, transforms
import os
import torch.nn as nn
import torch.nn.functional as F
from torch import optim
import numpy as np
from sklearn.metrics import confusion_matrix, accuracy_score
import matplotlib.pyplot as plt
import cv2

print(torch.__version__)
#check if cuda is available
cuda_available = torch.cuda.is_available()

#Added device name printing and device variable
device_name = torch.cuda.get_device_name(0)
device = torch.device("cuda" if cuda_available else "cpu")
print(f"cuda available: {cuda_available}, using {device}, GPU name: {device_name}")
use_cuda = cuda_available

# set random seed
seed = 42
torch.manual_seed(seed)
np.random.seed(seed)
torch.cuda.manual_seed(seed)
torch.cuda.manual_seed_all(seed)
torch.backends.cudnn.deterministic = True
torch.backends.cudnn.benchmark = False

2.2.2
cuda available: True, using cuda, GPU name: NVIDIA GeForce RTX 3080 Laptop GPU


We can keep the same configuration parameters:

In [None]:
# Training parameters
EPOCHS = 10
WARMUP_EPOCHS = 10
BATCH_SIZE = 128
N_CLASSES = 10
N_WORKERS = 0
LR = 5e-4
OUTPUT_PATH = './outputs'

# Data parameters
DATASET = 'fmnist'
IMAGE_SIZE = 28
PATCH_SIZE = 4
N_CHANNELS = 1
DATA_PATH = './data/'

# ViT parameters
EMBED_DIM = 64
N_ATTENTION_HEADS = 4
FORWARD_MUL = 2
N_LAYERS = 6
DROPOUT = 0.1
MODEL_PATH = './model'
LOAD_MODEL = False

In this miniproject you will use the **same dataset for your experiments** as in the material notebook:

In [None]:
train_transform = transforms.Compose([transforms.Resize([IMAGE_SIZE, IMAGE_SIZE]),
                              transforms.RandomCrop(IMAGE_SIZE, padding=2), 
                              transforms.RandomHorizontalFlip(),
                              transforms.ToTensor(),
                              transforms.Normalize([0.5], [0.5])])
train = datasets.FashionMNIST(os.path.join(DATA_PATH, DATASET), train=True, download=True, transform=train_transform)

test_transform = transforms.Compose([transforms.Resize([IMAGE_SIZE, IMAGE_SIZE]), transforms.ToTensor(), transforms.Normalize([0.5], [0.5])])
test = datasets.FashionMNIST(os.path.join(DATA_PATH, DATASET), train=False, download=True, transform=test_transform)

train_loader = torch.utils.data.DataLoader(dataset=train,
                                             batch_size=BATCH_SIZE,
                                             shuffle=True,
                                             num_workers=N_WORKERS,
                                             drop_last=True)

test_loader = torch.utils.data.DataLoader(dataset=test,
                                            batch_size=BATCH_SIZE,
                                            shuffle=False,
                                            num_workers=N_WORKERS,
                                            drop_last=False)

## Task 1: Inductive bias in transformer models

In the first task of this miniproject you should develop a deeper understanding of why transformer models function well on various data structures and large data domains.

<span style="color:blue">
    Watch the <a href="https://www.youtube.com/watch?v=TrdevFK_am4">video</a> of Yannic Kilcher about the vision transformer <a href="https://arxiv.org/pdf/2010.11929">paper</a> and answer the following questions:
</span>

* What is the biggest weakness of transformer models from a complexity point of view and why can the use of image patches for vision transformers help with that?
<div style="color:lightblue"> </div>

* What is meant by the term "inductive bias" (sometimes he calls it "inductive prior")?
<div style="color:lightblue"> </div>

* What is the interplay between model bias and the amount of available data?
<div style="color:lightblue"> </div>

* If skip connections introduce an inductive bias, why are they needed in the transformer model?

# Defining the Model
<div style="color:lightblue">
Before moving on to task 2, lets define the model, its parameters, and train it

## Task 2: Visualizing attention in ViTs

Attention in ViTs tells us which parts of an image are important for some other image parts (or for the classification itself). This provides a form of inherent explainability. In this task, you should explore this option and visualize attention maps to understand, what a ViT is looking at in the image. As a starting point, **read this excellent [blog post](https://jacobgil.github.io/deeplearning/vision-transformer-explainability) by Jacob Gildenblat** to get an understanding of how attention can be visualized in ViTs. Here are some more **implementations** that might help you with the task:

* https://github.com/mashaan14/VisionTransformer-MNIST/blob/main/VisionTransformer_MNIST.ipynb
* https://github.com/jacobgil/vit-explain/tree/main
* https://github.com/jeonsworld/ViT-pytorch/blob/main/visualize_attention_map.ipynb

### Task 2.1: Modify our ViT implementation to output the attention matrix 
<span style="color:blue">
    To visualize attention for a specific image, the model needs to output not only its prediction, but also the attention matrix. You should adapt the <code>VisionTransformer</code> class that we used in the material notebook in such a way that it outputs the attention matrix <code>x_attention</code>.
</span>

**Hints:**
* The attention map `x_attention` is created inside the `SelfAttention` class
* Since `SelfAttention` is used in the `Encoder` and `VisionTransformer` classes, they need to be adapted as well
* The model should not output the attention maps everytime it is called, but only if we need them. Implement a `return_attention=False` parameter into the respective `forward()` functions to make this output conditional
* Our model uses 4 attention heads and 6 encoder blocks. Furthermore, the model uses 49 image patches plus 1 class token. This means your attention map should have a shape of `[6,4,50,50]`. 

### Task 2.2: Write a function that creates an attention heatmap for a specific image
To visualize the attention values as an image, the matrix must first be transformed. Remember, the attention matrix is of shape `[layers, attention_heads, num_patches+class_token, num_patches+class_token]`, but we want to view it as a grayscale image of size 28x28. Therefore, the function needs the following components:

* <span style="color:blue">Aggregate the attention weights across all heads. Just like in the <a href="https://jacobgil.github.io/deeplearning/vision-transformer-explainability">blog post</a> by Jacob Gildenblat you should implement a <code>mean</code>, <code>min</code>, and <code>max</code> aggregation! </span>
* <span style="color:blue">Again leaning on the idea by <a href="https://jacobgil.github.io/deeplearning/vision-transformer-explainability">Jacob Gildenblat</a>, you should implement a filter to discard attention values below a certain threshold. You can implement this as a function parameter <code>discard_ratio</code>.  </span>
* <span style="color:blue">The function should be able to <b>select specific layers</b>. With that we can later see, how different network depths behave.</span>
* <span style="color:blue">The function needs to account for <b>residual connections</b> by adding an identity matrix and then re-normlize the weights.</span>
* <span style="color:blue">Since we are not interested in the absolut attention values as much as the flow and change of attention weights through the network, you need to <b>recursively multiply the attention weight matrices</b> from successive layers to trace how attention flows from the input to the output.</span>
* <span style="color:blue">Finally, you need to reshape the resulting <code>joint_attentions</code> to match the image size.</span>

**Hints:**
* Use the existing implementations that where mentioned before, if you need guidance!

### Task 2.3: Use the function to visualize attention maps for example images
Now you should finally put everything together and visualize some attention maps to understand, what the model is looking at in the images! 

* <span style="color:blue">Load the weights of the best model from the material notebook <code>ViT_model.pt</code> into your adapted <code>VisionTransformer</code></span>
* <span style="color:blue">Visualize multiple original images as well as their attention maps in a grid. </span>
* <span style="color:blue">Try different <b>aggregation methods</b> for attention heads, <b>discard ratios</b> and <b>layers</b>.  </span>
* <span style="color:blue">Interpret your findings!</span>

If everything works, it should look like this:

<img src="./img/attention_example.png" width=600/>

This specific configuration shows some sensible attention maps. Interestingly, the later layers seem to somewhat inverse the attention of the earlier layers. More experiments and interpretations are expected here! 

## Task 3: Using the inductive bias of CNNs to support the training of ViTs
As task 1 has shown, the lack of an inductive bias can be a blessing and a curse. While such models excel on huge datasets, it is especially problematic on smaller datasets. In this task you should combine the a CNN and ViT technology to create *the best of two worlds*.

In the 2021 paper [Training data-efficient image transformers & distillation through attention](https://arxiv.org/pdf/2012.12877), Meta AI did exactly that: combining CNN and ViT to achieve a very high performing model with good convergence properties. This was done by using [knowledge distillation](https://en.wikipedia.org/wiki/Knowledge_distillation). A CNN model was trained first and used as a teacher for a ViT student model. With the teachers guidance, the ViT student achieves better performance with shorter training times. **This is exactly what you should be doing in this task!** It is therefore strongly advised that you read the [original paper](https://arxiv.org/pdf/2012.12877). The [paper on knowledge distillation](https://arxiv.org/pdf/1503.02531) by Geoffrey Hinton, Oriol Vinyals, and Jeff Dean could also help to understand this technology!

Here are some more **implementations** that might help you with the task:

* [DeiT implementation of Francesco Saverio](https://github.com/FrancescoSaverioZuppichini/DeiT)
* [DeiT implementation of Shuqi Huang](https://github.com/jiaowoguanren0615/Deit-Pytorch/blob/main/DeiT/models/deitv1.py)
* [Official DeiT implementation by Meta](https://github.com/facebookresearch/deit/blob/main/models_v2.py)

### Task 3.1: Training a teacher model
To distill knowledge from a strong teacher model, you first need to train such a model!

* <span style="color:blue">Set up a standard CNN model in PyTorch to train on the fashion mnist dataset.</span>
* <span style="color:blue">You can use the CNN from the introduction to CNN models. It should already have the correct layer formats, since we originally used it on the mnist dataset.</span>
* <span style="color:blue">Use any training loop to train and evaluate the CNN model. How does it compare to the ViT from the material notebook? <i>Hint: it needs to be better to actually work as a teacher model!</i> </span>
* <span style="color:blue">Save the model weights for later use in <code>teacher_model.pt</code></span>
* <span style="color:blue">Interpret your findings!</span>

### Task 3.2: Distillation loss
To use knowledge distillation you need to use a new loss function. This function should combine the student and teacher loss.

* <span style="color:blue">Implement a <code>HardDistillationLoss</code> function, which combines the student and teacher loss and weighs them with 0.5 each.</span>
* <span style="color:blue">The function should use "hard" labels (see the <a href="https://arxiv.org/pdf/2012.12877">paper</a> for hints).</span>
* <span style="color:blue">Use <code>CrossEntropyLoss</code>.</span>

**Hints:**
* Use the existing implementations that where mentioned before, if you need guidance!

### Task 3.3: Changing the ViT architecture to allow distillation
The distillation procedure in the [paper](https://arxiv.org/pdf/2012.12877) works by introducing a distillation token, which plays the same role as the class token, except that it aims at reproducing the label estimated by the teacher. Both tokens interact in the transformer through attention. In this task you should change the ViT architecture to integrate this token!


* <span style="color:blue">Add a distillation token to the <code>EmbedLayer</code>. This works the same way as the classification token. You can also initialize it as a <code>nn.Parameter</code> with <code>torch.zeros</code>. </span>
* <span style="color:blue">Think about what needs to be changed for the <code>pos_embedding</code>!</span>
* <span style="color:blue">Change the <code>Classifier</code> class to use the classification token <i>and</i> distillation token for prediction. <b>Add a linear layer</b> to project the distillation token (you dont need 2 layers and activation function here, just the linear layer is enough). During training, the <code>Classifier</code> should output the classification <b>and</b> the distillation projection, which will be used by the loss function from task 3.2. During inference, both outputs should be averaged. </span>
* <span style="color:blue">Name your new model <code>MyDeiT</code>. </span>

**Hints:**
* Use the existing implementations that where mentioned before, if you need guidance!

### Task 3.4: Train your own DeiT model

* <span style="color:blue">Initialize a student model based on your new <code>MyDeiT</code> class. Use the same parameter configuration as before. </span>
* <span style="color:blue">Train the student model with the <code>HardDistillationLoss</code> you built in task 3.2!</span>
* <span style="color:blue">Your training loop might need to account for the second output (the distillation token) when making batch predictions during <code>train()</code> mode. </span>
* <span style="color:blue">Interpret your results!</span>