# M2177.004300 Deep Learning Assignment #2<br> Part 2. Training Vision Transformers (PyTorch)

Copyright (C) Data Science & AI Laboratory, Seoul National University. This material is for educational uses only. Some contents are based on the material provided by other paper/book authors and may be copyrighted by them. Written by Youngwoo Kimh, October 2025

**For understanding of this work, please carefully look at given PDF file.**

Now, you're going to leave behind your implementations and instead migrate to one of popular deep learning frameworks, **PyTorch**. <br>
In this notebook, you will learn to understand and build the basic components of Vision Tranformer(ViT). Then, you will try to classify images in the FashionMNIST datatset and explore the effects of different components of ViTs.
<br>
There are **2 sections**, and in each section, you need to follow the instructions to complete the skeleton codes and explain them.

**Note**: certain details are missing or ambiguous on purpose, in order to test your knowledge on the related materials. However, if you really feel that something essential is missing and cannot proceed to the next step, then contact the teaching staff with clear description of your problem.

### Submitting your work:
<font color=red>**DO NOT clear the final outputs**</font> so that TAs can grade both your code and results.

### Some helpful tutorials and references for assignment #2-2:
- [1] Pytorch official documentation. [[link]](https://pytorch.org/docs/stable/index.html)
- [2] Stanford CS231n lectures. [[link]](http://cs231n.stanford.edu/)
- [3] Alexey Dosovitskiy et al., "An Image is Worth 16 x 16 Words: Transformers for Image Recognition at Scale", ICLR 2021. [[pdf]](https://arxiv.org/pdf/2010.11929.pdf)

## 1. Building Vision Transformer
Here, you will build the basic components of Vision Transformer(ViT). <br>

![Vision Transformer](imgs/ViT.png)

Using the explanation and code provided as guidance, <br>
Define each component of ViT. <br>


#### ViT architecture:
* ViT model consists with input patch embedding, positional embeddings, transformer encoder, etc.
* Patch embedding
* Positional embeddings
* Transformer encoder with
    * Attention module
    * MLP module

In [1]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [2]:
%pwd # 현재 경로 확인

'/content'

In [3]:
# Assighment1 경로로 이동
%cd /content/drive/MyDrive/Assignment2

/content/drive/MyDrive/Assignment2


In [4]:
import numpy as np
import torch
import torch.nn as nn
from torchvision import transforms
from torch.optim import AdamW

seed = 42
np.random.seed(seed)
torch.manual_seed(seed)
torch.cuda.manual_seed(seed)
torch.cuda.manual_seed_all(seed)
torch.backends.cudnn.deterministic = True
torch.backends.cudnn.benchmark = False
g = torch.Generator()
g.manual_seed(seed)

<torch._C.Generator at 0x7bb2a7f8fe30>

##### Patch Embed

**Initialization**: When you create an instance of the PatchEmbedding class, you specify the image_size, patch_size, and in_channels. image_size is the height and width of the input image, patch_size is the size of each patch, and in_channels is the number of input image channels (e.g., 3 for RGB images).

**Convolutional Projection**: Inside the PatchEmbedding class, a 2D convolutional layer (nn.Conv2d) is used to perform a patch-based projection. This convolutional layer has a kernel size of patch_size, which defines the size of each patch, and a stride of patch_size, which ensures that patches do not overlap. The convolutional layer effectively extracts image patches.

**Reshaping**: After the convolutional projection, the output tensor is reshaped using view. It is transformed from a 4D tensor with dimensions (batch_size, in_channels, H, W) to a 3D tensor with dimensions (batch_size, num_patches, patch_dim). num_patches is the total number of non-overlapping patches in the image, and patch_dim is the number of output channels from the convolutional layer.

In [5]:
class PatchEmbed(nn.Module):
    """ConvStem Patch Embedding for small grayscale images (keeps 8x8 tokens at 32x32 input)."""
    def __init__(self, img_size=224, patch_size=16, in_chans=3, embed_dim=768):
        super().__init__()
        self.img_size = img_size
        self.patch_size = patch_size

        ##############################################################################
        #                           IMPLEMENT YOUR CODE                              #
        ##############################################################################
        #stem
        c1 = embed_dim // 4
        c2 = embed_dim // 2
        self.stem = nn.Sequential(
            nn.Conv2d(in_chans, c1, kernel_size=3, stride=1, padding=1, bias=False),
            nn.BatchNorm2d(c1), nn.GELU(),
            nn.Conv2d(c1, c2, kernel_size=3, stride=2, padding=1, bias=False),
            nn.BatchNorm2d(c2), nn.GELU(),
            nn.Conv2d(c2, embed_dim, kernel_size=3, stride=1, padding=1, bias=False),
            nn.BatchNorm2d(embed_dim), nn.GELU(),
        )

        self.proj = nn.Conv2d(embed_dim, embed_dim, kernel_size=2, stride=2)

        out_side = img_size // 4
        self.grid_size = (out_side, out_side)
        self.num_patches = out_side * out_side
        ##############################################################################
        #                              END YOUR CODE                                 #
        ##############################################################################
    def forward(self, x):
        ##############################################################################
        #                           IMPLEMENT YOUR CODE                              #
        ##############################################################################
        x = self.stem(x)
        x = self.proj(x)
        x = x.flatten(2).transpose(1, 2)
        ##############################################################################
        #                              END YOUR CODE                                 #
        ##############################################################################
        return x


##### Attention

**Initialization**
* dim: The input dimension of the sequence. This is the dimensionality of the queries, keys, and values.
* num_heads: The number of attention heads to use. Multi-head attention allows the model to focus on different parts of the input simultaneously.

**Linear Projections (qkv and proj)**: The qkv linear layer takes the input sequence and projects it into three parts: queries (q), keys (k), and values (v). The output of this layer has a shape of (batch_size, sequence_length, 3 * dim).

**Forward Pass (forward method)**: In the forward pass, the input tensor x is processed through the attention mechanism. Here's what happens:<br>
* The linear projection qkv is applied to x, producing a tensor of shape (batch_size, sequence_length, 3 * dim).|
* This tensor is reshaped to have dimensions (batch_size, sequence_length, 3, num_heads, head_dim). The permute operation rearranges the dimensions to (3, batch_size, num_heads, sequence_length, head_dim), making it suitable for multi-head attention.
* The three parts, q, k, and v, are extracted from the reshaped tensor.
* The attention scores are computed by taking the dot product of queries q and keys k. The result is scaled by self.scale.
* The attention scores are passed through a softmax activation along the last dimension (sequence_length), producing attention weights.
* The weighted sum of values v is computed using the attention weights.
* The result is transposed and reshaped to its original shape, and then passed through the proj linear layer.
* The final output is returned.

In [6]:
class Attention(nn.Module):
    def __init__(self, dim, num_heads=8):
        super().__init__()
        self.num_heads = num_heads
        head_dim = dim // num_heads
        self.scale = head_dim ** -0.5

        self.qkv = nn.Linear(dim, dim * 3)
        self.proj = nn.Linear(dim, dim)

        self.attn_drop = nn.Dropout(0.08) #dropout 추가
        self.proj_drop = nn.Dropout(0.08)

    def forward(self, x):
        B, N, C = x.shape
        ##############################################################################
        #                           IMPLEMENT YOUR CODE                              #
        ##############################################################################
        qkv = self.qkv(x)
        qkv = qkv.reshape(B, N, 3, self.num_heads, C // self.num_heads)
        qkv = qkv.permute(2, 0, 3, 1, 4)
        q, k, v = qkv[0], qkv[1], qkv[2]

        attn = (q @ k.transpose(-2, -1)) * self.scale
        attn = attn.softmax(dim=-1)
        attn = self.attn_drop(attn)

        ##############################################################################
        #                              END YOUR CODE                                 #
        ##############################################################################
        x = (attn @ v).transpose(1, 2).reshape(B, N, C)
        x = self.proj(x)
        x = self.proj_drop(x)

        return x # output dimension must be: (batch size, number of patches, embed_dim)

##### MLP

The MLP module must consist of three layers:
* fully conncted layer 1
* activation layer
* fully conncted layer 2

In [7]:
class Mlp(nn.Module):
    def __init__(self, in_features, hidden_features=None, out_features=None, act_layer=nn.GELU):
        super().__init__()
        out_features = out_features or in_features
        hidden_features = hidden_features or in_features

        ##############################################################################
        #                           IMPLEMENT YOUR CODE                              #
        ##############################################################################
        self.fc1 = nn.Linear(in_features, hidden_features)
        self.act = act_layer()
        self.fc2 = nn.Linear(hidden_features, out_features)
        self.drop = nn.Dropout(0.1) #dropout 추가
        ##############################################################################
        #                              END YOUR CODE                                 #
        ##############################################################################

    def forward(self, x):
        ##############################################################################
        #                           IMPLEMENT YOUR CODE                              #
        ##############################################################################
        x = self.fc1(x)
        x = self.act(x)
        x = self.drop(x)
        x = self.fc2(x)
        x = self.drop(x)
        ##############################################################################
        #                              END YOUR CODE                                 #
        ##############################################################################
        return x # output dimension must be: (batch size, number of patches, out_features)

##### Transformer Block
The transformer block contains the attention module and MLP module which have residual connections.
Refer to the following image and build the forward pass.

![Transformer Block](imgs/TransformerBlock.png)

In [8]:
class Block(nn.Module):
    def __init__(self, dim, num_heads, mlp_ratio=4., act_layer=nn.GELU, norm_layer=nn.LayerNorm):
        super().__init__()
        self.norm1 = norm_layer(dim)
        self.attn = Attention(dim, num_heads=num_heads)
        self.norm2 = norm_layer(dim)
        mlp_hidden_dim = int(dim * mlp_ratio)
        self.mlp = Mlp(in_features=dim, hidden_features=mlp_hidden_dim,
                       act_layer=act_layer)

    def forward(self, x):
        ##############################################################################
        #                           IMPLEMENT YOUR CODE                              #
        ##############################################################################
        x = x + self.attn(self.norm1(x))
        x = x + self.mlp(self.norm2(x))
        ##############################################################################
        #                              END YOUR CODE                                 #
        ##############################################################################
        return x


##### Vision Transformer

Using all the components that you built above, **complete** the vision transformer class.

In [9]:
class VisionTransformer(nn.Module):
    """ Vision Transformer """

    def __init__(self, img_size=28, patch_size=4, in_chans=1, num_classes=10, embed_dim=768, depth=12,
                 num_heads=12, mlp_ratio=4., norm_layer=nn.LayerNorm, ):
        super().__init__()
        self.num_features = self.embed_dim = embed_dim
        self.num_heads = num_heads
        self.depth = depth

        self.patch_embed = PatchEmbed(
            img_size=img_size, patch_size=patch_size, in_chans=in_chans, embed_dim=embed_dim)
        num_patches = self.patch_embed.num_patches

        self.cls_token = nn.Parameter(torch.zeros(1, 1, embed_dim))
        ##############################################################################
        #                           IMPLEMENT YOUR CODE                              #
        ##############################################################################
        # similarly to cls_token, define a learnable positional embedding that matches the patchified input token size.
        self.pos_embed = nn.Parameter(torch.zeros(1, num_patches + 1, embed_dim))
        ##############################################################################
        #                              END YOUR CODE                                 #
        ##############################################################################

        self.blocks = nn.ModuleList([
            Block(
                dim=embed_dim, num_heads=num_heads, mlp_ratio=mlp_ratio,  norm_layer=norm_layer)
            for i in range(depth)])
        self.norm = norm_layer(embed_dim)

        # Classifier head
        self.head = nn.Linear(
            embed_dim, num_classes) if num_classes > 0 else nn.Identity()

    def forward(self, x):
        ##############################################################################
        #                           IMPLEMENT YOUR CODE                              #
        ##############################################################################
        B = x.shape[0]

        # Patch Embedding
        x = self.patch_embed(x)

        # Concatenate class tokens to patch embedding
        cls_tokens = self.cls_token.expand(B, -1, -1)
        x = torch.cat((cls_tokens, x), dim=1)

        # Add positional embedding to patches
        x = x + self.pos_embed

        # Forward through encoder blocks
        for blk in self.blocks:
            x = blk(x)
        x = self.norm(x)

        # Use class token for classification
        cls_token_final = x[:, 0]

        # Classifier head
        x = self.head(cls_token_final)
        ##############################################################################
        #                              END YOUR CODE                                 #
        ##############################################################################
        return x

## 2. Training a small ViT model on FashionMNIST dataset.

Define and Train a vision transformer on FashionMNIST dataset. **(You must reach above 85% for full points.)** <br>
Train with at least 5 different hyperparameter settings varying the following ViT hyperparameters.
Report the setting for the best performance.

#### ViT hyperparameters:
* patch_size
* embed_dim
* depth
* num_heads
* mlp_ratio
* etc.


In [10]:
import numpy as np

from tqdm import tqdm, trange

import torch
import torch.nn as nn
from torch.optim import Adam
from torch.nn import CrossEntropyLoss
from torch.utils.data import DataLoader

from torchvision.transforms import ToTensor
from torchvision.datasets.mnist import FashionMNIST

import math
from torch.optim import AdamW #최적화
from torch.optim.lr_scheduler import LambdaLR #스케쥴러

import copy

In [21]:
import torch, gc
gc.collect()
torch.cuda.empty_cache()

In [22]:
def Train():
    img_size   = 28
    patch_size = 4
    embed_dim  = 192
    depth      = 12
    num_heads  = 8
    mlp_ratio  = 3.5
    LR_MAX     = 3e-3

    N_EPOCHS   = 20
    LR_MAX     = 3e-3
    WD         = 5e-2
    WARMUP_PCT = 0.30
    DIV        = 25.0
    FINAL_DIV  = 100.0
    CLIP_NORM  = 1.0

    # 데이터셋/증강
    train_transform = transforms.Compose([
        transforms.RandomCrop(28, padding=4),
        transforms.RandomHorizontalFlip(0.5),
        transforms.RandomRotation(8),
        transforms.ToTensor(),
        transforms.RandomErasing(p=0.10, scale=(0.02, 0.15), ratio=(0.3, 3.3)),
    ])
    test_transform = transforms.ToTensor()

    train_set = FashionMNIST(root='./data', train=True,  download=True, transform=train_transform)
    test_set  = FashionMNIST(root='./data', train=False, download=True, transform=test_transform)

    train_loader = DataLoader(train_set, shuffle=True,  batch_size=192,
                              num_workers=2, pin_memory=True, persistent_workers=True)
    test_loader  = DataLoader(test_set,  shuffle=False, batch_size=512,
                              num_workers=2, pin_memory=True, persistent_workers=True)

    torch.backends.cudnn.benchmark = True

    device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
    print("Using device:", device, f"({torch.cuda.get_device_name(device)})" if torch.cuda.is_available() else "")

    model = VisionTransformer(patch_size=patch_size, embed_dim=embed_dim, depth=depth, num_heads=num_heads, mlp_ratio=mlp_ratio).to(device)
    model_path = './vit.pth'

    #최적화/스케줄러
    optimizer = AdamW(model.parameters(), lr=LR_MAX, weight_decay=WD, betas=(0.9, 0.999))

    steps_per_epoch = len(train_loader)
    scheduler = torch.optim.lr_scheduler.OneCycleLR(
        optimizer, max_lr=LR_MAX,
        epochs=N_EPOCHS, steps_per_epoch=steps_per_epoch,
        pct_start=WARMUP_PCT, anneal_strategy='cos',
        div_factor=DIV, final_div_factor=FINAL_DIV
    )

    criterion = CrossEntropyLoss()

    #Train loop
    for epoch in range(N_EPOCHS):
        model.train()
        train_loss = 0.0

        for x, y in tqdm(train_loader, desc=f"Epoch {epoch+1}/{N_EPOCHS}", leave=False, mininterval=2.0):
            x, y = x.to(device, non_blocking=True), y.to(device, non_blocking=True)
            optimizer.zero_grad(set_to_none=True)

            y_hat = model(x)
            loss  = criterion(y_hat, y)
            loss.backward()
            torch.nn.utils.clip_grad_norm_(model.parameters(), CLIP_NORM)
            optimizer.step()

            scheduler.step()
            train_loss += loss.detach().cpu().item() / len(train_loader)

        print(f"[{epoch+1:02d}/{N_EPOCHS}] loss={train_loss:.3f}")

    # Test loop
    with torch.no_grad():
        model.eval()
        correct, total = 0, 0
        test_loss = 0.0
        for batch in tqdm(test_loader, desc="Testing"):
            x, y = batch
            x, y = x.to(device), y.to(device)
            y_hat = model(x)
            loss = criterion(y_hat, y)
            test_loss += loss.detach().cpu().item() / len(test_loader)

            correct += torch.sum(torch.argmax(y_hat, dim=1) == y).detach().cpu().item()
            total += len(x)
        print(f"Test loss: {test_loss:.2f}")
        print(f"Test accuracy: {correct / total * 100:.2f}%")

    torch.save(model.state_dict(), model_path)
    print('Saved Trained Model.')

Train()

Using device: cuda (Tesla T4)




[01/20] loss=0.937




[02/20] loss=0.669




[03/20] loss=0.593




[04/20] loss=0.557




[05/20] loss=0.539




[06/20] loss=0.550




[07/20] loss=0.528




[08/20] loss=0.494




[09/20] loss=0.451




[10/20] loss=0.426




[11/20] loss=0.396




[12/20] loss=0.381




[13/20] loss=0.356




[14/20] loss=0.336




[15/20] loss=0.322




[16/20] loss=0.301




[17/20] loss=0.288




[18/20] loss=0.275




[19/20] loss=0.265




[20/20] loss=0.258


Testing: 100%|██████████| 20/20 [00:03<00:00,  5.90it/s]


Test loss: 0.25
Test accuracy: 91.22%
Saved Trained Model.


In [15]:
# For Google Colab
# Uncomment the next line and run. The compressed file will be in your assignment directory.
!bash CollectSubmission.sh [Your Student ID]

### Describe what you did and discovered here
In this cell you should write all the settings tried and performances you obtained. Report what you did and what you discovered from the trials.
You can write in Korean

1. 데이터 증강 : 일반화 도움
2. AdamW 사용
3. 스케줄러 사용
4. embedding에서 conv 로 stem
5. dropout 사용

#Hyper parameter(공통)

    N_EPOCHS   = 20
    WD         = 5e-2
    WARMUP_PCT = 0.30
    DIV        = 25.0
    FINAL_DIV  = 100.0
    CLIP_NORM  = 1.0

#Base (Acc = 91.35%)
    patch_size = 4
    embed_dim  = 192
    depth      = 10
    num_heads  = 8
    mlp_ratio  = 3.5
    LR_MAX     = 3e-3

#Variant 1 (Acc = 91.33%)
    patch_size = 4
    embed_dim  = 192
    depth      = 10
    num_heads  = 8
    mlp_ratio  = 3.5
    LR_MAX     = 2e-3

LR을 낮춰 수렴 속도 비교. 거의 차이 없음

#Variant 2 (Acc = 91.98%)
    patch_size = 4
    embed_dim  = 128
    depth      = 10
    num_heads  = 8
    mlp_ratio  = 3.5
    LR_MAX     = 3e-3

embed_dim을 128로 줄임. 성능 약간 향상 (gpu 한계로 다른 계정에서 실험)

#Variant 3 (Acc = 90.87%)
    patch_size = 4
    embed_dim  = 256
    depth      = 10
    num_heads  = 8
    mlp_ratio  = 3.5
    LR_MAX     = 3e-3

embed_dim을 256으로 늘림. train loss는 계속 하락했으나 정확도는 떨어짐.(gpu 한계로 다른 계정에서 실험)

#Variant 4 (Acc = 91.14%)
    patch_size = 4
    embed_dim  = 192
    depth      = 8
    num_heads  = 8
    mlp_ratio  = 3.5
    LR_MAX     = 3e-3

depth를 8로 줄. base와 차이 미미(gpu 한계로 다른 계정에서 실험)

#Variant 5 (Acc = 91.22%)
    patch_size = 4
    embed_dim  = 192
    depth      = 12
    num_heads  = 8
    mlp_ratio  = 3.5
    LR_MAX     = 3e-3

depth =12 로 늘림. 8, 10, 12 다 미미한 성능 차이를 보였다.

#실험 결론
embed dim은 늘리는것보다 줄이는 것이 좋다
depth는 거의 차이가 없으나 10이 12, 8 보다 좋은 듯 하다._Tell us here_