# Artificial Neural Networks and Deep Learning  
## Assignment 3.3 - Self-attention and Transformers

Prof. Dr. Ir. Johan A. K. Suykens     

In this file, we first understand the self-attention mechanism by implementing it both with ``NumPy`` and ``PyTorch``.
Then, we implement a 6-layer Vision Transformer (ViT) and train it on the MNIST dataset.

All training will be conducted on a single T4 GPU.


In [1]:
# Please first load your google drive
# from google.colab import drive
# drive.mount('/content/drive')

In [2]:
# Please go to Edit > Notebook settings > Hardware accelerator > choose "T4 GPU"
# Now check if you have loaded the GPU successfully
!nvidia-smi

Tue Jul 29 14:10:09 2025       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.54.15              Driver Version: 550.54.15      CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|   0  Tesla T4                       Off |   00000000:00:04.0 Off |                    0 |
| N/A   75C    P8             11W /   70W |       0MiB /  15360MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
                                                

# Self-attention Mechanism
Self-attention is the core mechanism in Transformer.

## Self-attention with NumPy
To have a better understanding of it, we first manually implement self-attention mechanism with ``numpy``. You can check the dimension of each variable during the matrix computation.

Feel free to change the dimensions of each variable and see how the output dimension will change accordingly.

In [3]:
import math
import numpy as np
from numpy.random import randn

# I. Define the input data X
# X consists out of 32 samples, each sample has dimensionality 256
n = 32   # no. of samples
d = 256  # dimensionality
X = randn(n, d) # (32, 256)

# II. Generate the projection weights
Wq = randn(d, d) # (256, 256)
Wk = randn(d, d)
Wv = randn(d, d)

# III. Project X to find its query, keys and values vectors
Q = np.dot(X, Wq) # (32, 256)
K = np.dot(X, Wk)
V = np.dot(X, Wv)

# IV. Compute the self-attention score, denoted by A
# A = softmax(QK^T / \sqrt{d})
# Define the softmax function
def softmax(z):
    z = np.clip(z, 100, -100) # clip in case softmax explodes
    tmp = np.exp(z)
    res = np.exp(z) / np.sum(tmp, axis=1)
    return res

A = softmax(np.dot(Q, K.transpose())/math.sqrt(d)) # (32, 32)

# V. Compute the self-attention output
# outputs = A * V
outputs = np.dot(A, V) # (32, 256)

print("The attention outputs are\n {}".format(outputs))

The attention outputs are
 [[ 7.15412073 -0.88118061  3.15121294 ... -0.35169155 -4.33388862
  -1.31002483]
 [ 7.15412073 -0.88118061  3.15121294 ... -0.35169155 -4.33388862
  -1.31002483]
 [ 7.15412073 -0.88118061  3.15121294 ... -0.35169155 -4.33388862
  -1.31002483]
 ...
 [ 7.15412073 -0.88118061  3.15121294 ... -0.35169155 -4.33388862
  -1.31002483]
 [ 7.15412073 -0.88118061  3.15121294 ... -0.35169155 -4.33388862
  -1.31002483]
 [ 7.15412073 -0.88118061  3.15121294 ... -0.35169155 -4.33388862
  -1.31002483]]


## Self-attention with PyTorch
Now, we implement self-attention with ``PyTorch``, which is commonly used when building Transformers.

Feel free to change the dimensions of each variable and see how the output dimension will change accordingly.

In [4]:
import math
import torch
import torch.nn as nn

class SelfAttention(nn.Module):
    def __init__(self, dim_input, dim_q, dim_v):
        '''
        dim_input: the dimension of each sample
        dim_q: dimension of Q matrix, should be equal to dim_k
        dim_v: dimension of V matrix, also the  dimension of the attention output
        '''
        super(SelfAttention, self).__init__()

        self.dim_input = dim_input
        self.dim_q = dim_q
        self.dim_k = dim_q
        self.dim_v = dim_v

        # Define the linear projection
        self.linear_q = nn.Linear(self.dim_input, self.dim_q, bias=False)
        self.linear_k = nn.Linear(self.dim_input, self.dim_k, bias=False)
        self.linear_v = nn.Linear(self.dim_input, self.dim_v, bias=False)
        self._norm_fact = 1 / math.sqrt(self.dim_k)

    def forward(self, x):
        batch, n, dim_q = x.shape

        q = self.linear_q(x) # (batchsize, seq_len, dim_q)
        k = self.linear_k(x) # (batchsize, seq_len, dim_k)
        v = self.linear_v(x) # (batchsize, seq_len, dim_v)
        print(f'x.shape:{x.shape} \n Q.shape:{q.shape} \n K.shape:{k.shape} \n V.shape:{v.shape}')

        dist = torch.bmm(q, k.transpose(1,2)) * self._norm_fact
        dist = torch.softmax(dist, dim=-1)
        print('attention matrix: ', dist.shape)

        outputs = torch.bmm(dist, v)
        print('attention outputs: ', outputs.shape)

        return outputs


batch_size = 32 # number of samples in a batch
dim_input = 128 # dimension of each item in the sample sequence
seq_len = 20 # sequence length for each sample
x = torch.randn(batch_size, seq_len, dim_input)
self_attention = SelfAttention(dim_input, dim_q=64, dim_v=32)

attention = self_attention(x)

print(attention)

x.shape:torch.Size([32, 20, 128]) 
 Q.shape:torch.Size([32, 20, 64]) 
 K.shape:torch.Size([32, 20, 64]) 
 V.shape:torch.Size([32, 20, 32])
attention matrix:  torch.Size([32, 20, 20])
attention outputs:  torch.Size([32, 20, 32])
tensor([[[-0.0522, -0.0343, -0.0016,  ...,  0.1116,  0.2659,  0.3607],
         [-0.1077, -0.1497, -0.0007,  ...,  0.1763,  0.2828,  0.3807],
         [-0.0022,  0.1339,  0.0745,  ...,  0.0398,  0.2234,  0.3782],
         ...,
         [-0.0061, -0.0247, -0.0637,  ...,  0.1624,  0.2627,  0.3511],
         [ 0.0633,  0.0740,  0.0030,  ...,  0.0613,  0.2983,  0.3456],
         [-0.0095, -0.0577, -0.0445,  ...,  0.1537,  0.2766,  0.3529]],

        [[-0.0813,  0.0984,  0.0846,  ..., -0.0186, -0.0353, -0.0045],
         [-0.1423,  0.0384,  0.0575,  ..., -0.0717, -0.1099, -0.0075],
         [-0.0491,  0.0757,  0.0706,  ..., -0.0022, -0.1252,  0.0119],
         ...,
         [-0.0893,  0.0734,  0.0769,  ..., -0.0244,  0.0077,  0.0065],
         [-0.0583, -0.0124,  0.1

# Transformers
In this section, we implement a 6-layer Vision Transformer (ViT) and trained it on the MNIST dataset.
We consider the classification tasks.
First, we load the MNIST dataset as follows:

In [5]:
import os
import numpy as np
import pandas as pd
import torch
import torch.nn as nn
from torch.utils.data import Dataset, DataLoader
from tqdm.notebook import tqdm
import matplotlib.pyplot as plt
import torchvision
from torchvision import datasets, utils
from torchvision.datasets import MNIST

def get_mnist_loader(batch_size=100, shuffle=True):
    """

    :return: train_loader, test_loader
    """
    train_dataset = MNIST(root='../data',
                          train=True,
                          transform=torchvision.transforms.ToTensor(),
                          download=True)
    test_dataset = MNIST(root='../data',
                         train=False,
                         transform=torchvision.transforms.ToTensor(),
                         download=True)

    train_loader = torch.utils.data.DataLoader(dataset=train_dataset,
                                               batch_size=batch_size,
                                               shuffle=shuffle)
    test_loader = torch.utils.data.DataLoader(dataset=test_dataset,
                                              batch_size=batch_size,
                                              shuffle=False)
    return train_loader, test_loader

In [6]:
# This package is needed to build the transformer
!pip install einops



## Build ViT from scratch
Recall that each Transformer block include 2 modules: the self-attention module, the feedforward module.

In [7]:
from einops import rearrange
import torch.nn.functional as F

class Residual(nn.Module):
    def __init__(self, fn):
        super().__init__()
        self.fn = fn

    def forward(self, x, **kwargs):
        return self.fn(x, **kwargs) + x

class PreNorm(nn.Module):
    def __init__(self, dim, fn):
        super().__init__()
        self.norm = nn.LayerNorm(dim)
        self.fn = fn

    def forward(self, x, **kwargs):
        return self.fn(self.norm(x), **kwargs)

class FeedForward(nn.Module):
    def __init__(self, dim, hidden_dim, dropout = 0.):
        super().__init__()
        self.net = nn.Sequential(
            nn.Linear(dim, hidden_dim),
            nn.GELU(), # Gaussian Error Linear Units is another type of activation function
            nn.Dropout(dropout),
            nn.Linear(hidden_dim, dim),
            nn.Dropout(dropout)
        )

    def forward(self, x):
        return self.net(x)

class Attention(nn.Module):
    def __init__(self, dim, heads=8, dropout = 0.):
        super().__init__()
        self.heads = heads
        self.scale = dim ** -0.5

        self.to_qkv = nn.Linear(dim, dim * 3, bias=False)
        self.to_out = nn.Sequential(
            nn.Linear(dim, dim),
            nn.Dropout(dropout)
        )

    def forward(self, x, mask = None):
        b, n, _, h = *x.shape, self.heads
        qkv = self.to_qkv(x)
        q, k, v = rearrange(qkv, 'b n (qkv h d) -> qkv b h n d', qkv=3, h=h)

        dots = torch.einsum('bhid,bhjd->bhij', q, k) * self.scale

        if mask is not None:
            mask = F.pad(mask.flatten(1), (1, 0), value = True)
            assert mask.shape[-1] == dots.shape[-1], 'mask has incorrect dimensions'
            mask = mask[:, None, :] * mask[:, :, None]
            dots.masked_fill_(~mask, float('-inf'))
            del mask

        attn = dots.softmax(dim=-1)

        out = torch.einsum('bhij,bhjd->bhid', attn, v)
        out = rearrange(out, 'b h n d -> b n (h d)')
        out =  self.to_out(out)
        return out

class Transformer(nn.Module):
    def __init__(self, dim, depth, heads, mlp_dim, dropout = 0.):
        super().__init__()
        self.layers = nn.ModuleList([])
        for _ in range(depth):
            self.layers.append(nn.ModuleList([
                Residual(PreNorm(dim, Attention(dim, heads = heads, dropout = dropout))),
                Residual(PreNorm(dim, FeedForward(dim, mlp_dim, dropout = dropout)))
            ]))

    def forward(self, x, mask=None):
        for attn, ff in self.layers:
            x = attn(x, mask=mask)
            x = ff(x)
        return x

class ViT(nn.Module):
    def __init__(self, *, image_size, patch_size, num_classes, dim, depth, heads, mlp_dim, channels=3, dropout = 0.):
        super().__init__()
        assert image_size % patch_size == 0, 'image dimensions must be divisible by the patch size'
        num_patches = (image_size // patch_size) ** 2
        patch_dim = channels * patch_size ** 2

        self.patch_size = patch_size

        self.pos_embedding = nn.Parameter(torch.randn(1, num_patches + 1, dim))
        self.patch_to_embedding = nn.Linear(patch_dim, dim)
        self.cls_token = nn.Parameter(torch.randn(1, 1, dim))
        self.transformer = Transformer(dim, depth, heads, mlp_dim, dropout)

        self.to_cls_token = nn.Identity()

        self.mlp_head = nn.Sequential(
            nn.Linear(dim, mlp_dim),
            nn.GELU(), # Gaussian Error Linear Units is another type of activation function
            nn.Dropout(dropout),
            nn.Linear(mlp_dim, num_classes)
        )

    def forward(self, img, mask=None):
        p = self.patch_size

        x = rearrange(img, 'b c (h p1) (w p2) -> b (h w) (p1 p2 c)', p1=p, p2=p)
        x = self.patch_to_embedding(x)

        cls_tokens = self.cls_token.expand(img.shape[0], -1, -1)
        x = torch.cat((cls_tokens, x), dim=1)
        x += self.pos_embedding
        x = self.transformer(x, mask)

        x = self.to_cls_token(x[:, 0])
        return self.mlp_head(x)

## Training and test function


In [8]:
import torch.nn.functional as F

def train_epoch(model, optimizer, data_loader, loss_history):
    total_samples = len(data_loader.dataset)
    model.train()

    correct_samples = 0
    total_samples = 0
    last_loss = None

    for i, (data, target) in enumerate(data_loader):
        data = data.cuda()
        target = target.cuda()
        optimizer.zero_grad()
        output = F.log_softmax(model(data), dim=1)
        loss = F.nll_loss(output, target)
        loss.backward()
        optimizer.step()

        last_loss = loss.item()
        # Accuracy tracking
        _, pred = torch.max(output, dim=1)
        correct_samples += pred.eq(target).sum().item()
        total_samples += target.size(0)

        if i % 100 == 0:
            print('[' + '{:5}'.format(i * batch_size) + '/' + '{:5}'.format(total_samples) +
                  ' (' + '{:3.0f}'.format(100 * i / len(data_loader)) + '%)]  Train Loss: ' +
                  '{:6.4f}'.format(loss.item()) + '  Train Accuracy: ' +
                  '{}/{} ({:4.2f}%)'.format(correct_samples, total_samples, 100.0 * correct_samples / total_samples))
            loss_history.append(loss.item())

    # Final 100%
    print('[' + '{:5}'.format(total_samples) + '/' + '{:5}'.format(total_samples) +
          ' (100%)]  Train Loss: ' + '{:6.4f}'.format(last_loss) + '  Train Accuracy: ' +
          '{}/{} ({:4.2f}%)'.format(correct_samples, total_samples, 100.0 * correct_samples / total_samples))
    loss_history.append(last_loss)

In [9]:
def evaluate(model, data_loader, loss_history):
    model.eval()

    total_samples = len(data_loader.dataset)
    correct_samples = 0
    total_loss = 0

    # We do not need to remember the gradients when testing
    # This will help reduce memory
    with torch.no_grad():
        for data, target in data_loader:
            data = data.cuda()
            target = target.cuda()
            output = F.log_softmax(model(data), dim=1)
            loss = F.nll_loss(output, target, reduction='sum')
            _, pred = torch.max(output, dim=1)

            total_loss += loss.item()
            correct_samples += pred.eq(target).sum()

    avg_loss = total_loss / total_samples
    loss_history.append(avg_loss)
    print('\nAverage test loss: ' + '{:.4f}'.format(avg_loss) +
          '  Accuracy:' + '{:5}'.format(correct_samples) + '/' +
          '{:5}'.format(total_samples) + ' (' +
          '{:4.2f}'.format(100.0 * correct_samples / total_samples) + '%)\n')

## Let's start training!
Here, you can change the ViT structure by changing the hyper-parametrs inside ``ViT`` function.
The default settings are with 6 layers, 8 heads for the multi-head attention mechanism and embedding dimension of 64.
You can also increase the number of epochs to obtain better results.

In [10]:
import time

N_EPOCHS = 20

# You can change the architecture here
model = ViT(image_size=28, patch_size=7, num_classes=10, channels=1,
            dim=64, depth=6, heads=8, mlp_dim=128)
model = model.cuda()
print(model)

train_loader, test_loader = get_mnist_loader(batch_size=128, shuffle=True)
train_loss_history, test_loss_history = [], []

optimizer = torch.optim.Adam(model.parameters(), lr=0.001)
# Gradually reduce the learning rate while training
scheduler = torch.optim.lr_scheduler.ExponentialLR(optimizer, gamma=0.95)

start_time = time.time()
for epoch in range(1, N_EPOCHS + 1):
    print('Epoch:', epoch,'LR:', scheduler.get_last_lr())
    train_epoch(model, optimizer, train_loader, train_loss_history)
    evaluate(model, test_loader, test_loss_history)
    scheduler.step()

print('Execution time:', '{:5.2f}'.format(time.time() - start_time), 'seconds')

ViT(
  (patch_to_embedding): Linear(in_features=49, out_features=64, bias=True)
  (transformer): Transformer(
    (layers): ModuleList(
      (0-5): 6 x ModuleList(
        (0): Residual(
          (fn): PreNorm(
            (norm): LayerNorm((64,), eps=1e-05, elementwise_affine=True)
            (fn): Attention(
              (to_qkv): Linear(in_features=64, out_features=192, bias=False)
              (to_out): Sequential(
                (0): Linear(in_features=64, out_features=64, bias=True)
                (1): Dropout(p=0.0, inplace=False)
              )
            )
          )
        )
        (1): Residual(
          (fn): PreNorm(
            (norm): LayerNorm((64,), eps=1e-05, elementwise_affine=True)
            (fn): FeedForward(
              (net): Sequential(
                (0): Linear(in_features=64, out_features=128, bias=True)
                (1): GELU(approximate='none')
                (2): Dropout(p=0.0, inplace=False)
                (3): Linear(in_features=12

In [11]:
import time
import torch
import numpy as np
from itertools import product

# Define hyperparameter grid
search_grid = {
    'dim': [32, 64, 128],
    'depth': [4, 6, 8],
    'heads': [4, 8],
    'mlp_dim': [64, 128, 256],
    'dropout': [0.0, 0.1, 0.2]
}

# Early stopping config
PATIENCE = 5
N_EPOCHS = 50

# Data loaders
train_loader, val_loader = get_mnist_loader(batch_size=128, shuffle=True)

best_val_loss = float('inf')
best_config = None

param_names = list(search_grid.keys())
default_config = {'dim': 64, 'depth': 6, 'heads': 8, 'mlp_dim': 128, 'dropout': 0.0}

# Iterate over each parameter separately
for param in param_names:
    for value in search_grid[param]:
        if value == default_config[param] and param != "dim":
            continue  # Skip default value; it will be tested in other iterations
        config = default_config.copy()
        config[param] = value

        dim = config['dim']
        depth = config['depth']
        heads = config['heads']
        mlp_dim = config['mlp_dim']
        dropout = config['dropout']

        print(f"\n\n=== Training ViT with dim={dim}, depth={depth}, heads={heads}, mlp_dim={mlp_dim}, dropout={dropout} ===")

        model = ViT(image_size=28, patch_size=7, num_classes=10, channels=1,
                    dim=dim, depth=depth, heads=heads, mlp_dim=mlp_dim, dropout=dropout)
        model = model.cuda()

        optimizer = torch.optim.Adam(model.parameters(), lr=0.001)
        scheduler = torch.optim.lr_scheduler.ExponentialLR(optimizer, gamma=0.95)

        train_loss_history = []
        val_loss_history = []

        start_time = time.time()
        patience_counter = 0
        best_epoch_train_loss = float('inf')
        best_epoch_val_loss = float('inf')

        for epoch in range(1, N_EPOCHS + 1):
            print(f"Epoch {epoch} | LR: {scheduler.get_last_lr()[0]:.5f}")

            train_epoch(model, optimizer, train_loader, train_loss_history)  # This appends to train_loss_history
            evaluate(model, val_loader, val_loss_history)  # This appends to val_loss_history
            scheduler.step()

            current_train_loss = train_loss_history[-1]  # Get the latest appended loss
            current_val_loss = val_loss_history[-1]  # Get the latest appended loss

            if current_val_loss < best_epoch_val_loss:
                best_epoch_train_loss = current_train_loss
                best_epoch_val_loss = current_val_loss
                patience_counter = 0
            else:
                patience_counter += 1
                if patience_counter >= PATIENCE:
                    print("Early stopping triggered.")
                    break

        total_time = time.time() - start_time
        print(f"Training loss: {best_epoch_train_loss:.4f} | Validation loss: {best_epoch_val_loss:.4f} | Time: {total_time:.2f}s")

        if best_epoch_val_loss < best_val_loss:
            best_val_loss = best_epoch_val_loss
            best_config = {
                'dim': dim,
                'depth': depth,
                'heads': heads,
                'mlp_dim': mlp_dim,
                'dropout': dropout
            }

print("\n=== Best Configuration ===")
print(best_config)
print(f"Best validation loss: {best_val_loss:.4f}")



=== Training ViT with dim=32, depth=6, heads=8, mlp_dim=128, dropout=0.0 ===
Epoch 1 | LR: 0.00100

Average test loss: 0.3652  Accuracy: 8829/10000 (88.29%)

Epoch 2 | LR: 0.00095

Average test loss: 0.2283  Accuracy: 9272/10000 (92.72%)

Epoch 3 | LR: 0.00090

Average test loss: 0.1625  Accuracy: 9504/10000 (95.04%)

Epoch 4 | LR: 0.00086

Average test loss: 0.1263  Accuracy: 9611/10000 (96.11%)

Epoch 5 | LR: 0.00081

Average test loss: 0.1131  Accuracy: 9653/10000 (96.53%)

Epoch 6 | LR: 0.00077

Average test loss: 0.1078  Accuracy: 9655/10000 (96.55%)

Epoch 7 | LR: 0.00074

Average test loss: 0.1040  Accuracy: 9673/10000 (96.73%)

Epoch 8 | LR: 0.00070

Average test loss: 0.1057  Accuracy: 9673/10000 (96.73%)

Epoch 9 | LR: 0.00066

Average test loss: 0.0779  Accuracy: 9766/10000 (97.66%)

Epoch 10 | LR: 0.00063

Average test loss: 0.0829  Accuracy: 9739/10000 (97.39%)

Epoch 11 | LR: 0.00060

Average test loss: 0.1204  Accuracy: 9620/10000 (96.20%)

Epoch 12 | LR: 0.00057

Aver