# M2177.004300 Deep Learning Assignment #2<br> Part 1. Transformer from scratch (PyTorch)

Copyright (C) Data Science & AI Laboratory, Seoul National University. This material is for educational uses only. Some contents are based on the material provided by other paper/book authors and may be copyrighted by them. Written by Youngwoo Kimh, October 2025

**For understanding of this work, please carefully
look at given PDF file.**

In this notebook, you will learn to implement a transformer model from scratch. By doing so, you will understand the nuts and bolts of Transformers more clearly at a code level.
<br>
There are **5 sections**, and in each section, you need to follow the instructions to complete the skeleton codes and explain them.

**Note**: certain details are missing or ambiguous on purpose, in order to test your knowledge on the related materials. However, if you really feel that something essential is missing and cannot proceed to the next step, then contact the teaching staff with clear description of your problem.

### Submitting your work:
<font color=red>**DO NOT clear the final outputs**</font> so that TAs can grade both your code and results.  

### Some helpful tutorials and references for assignment #2-1:
- [1] Original Transformer paper(Vaswani et al., 2017). [[link]](https://papers.nips.cc/paper/7181-attention-is-all-you-need.pdf)
- [2] Helpful instructions about how Transformer works. [[link]](https://github.com/jadore801120/attention-is-all-you-need-pytorch)     

### Check virtual env and import packages

In [1]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [2]:
%pwd # 현재 경로 확인

'/content'

In [3]:
# Assighment1 경로로 이동
%cd /content/drive/MyDrive/Assignment2

/content/drive/MyDrive/Assignment2


In [4]:
import os

import numpy as np
import torch
import torch.nn as nn
import torch.optim as optim
import torch.utils.data as data
import math


if torch.cuda.is_available() is True:
    device = torch.device("cuda")
else:
    device = torch.device("cpu")

seed = 42
np.random.seed(seed)
torch.manual_seed(seed)
torch.cuda.manual_seed(seed)
torch.cuda.manual_seed_all(seed)
torch.backends.cudnn.deterministic = True
torch.backends.cudnn.benchmark = False
g = torch.Generator()
g.manual_seed(seed)

<torch._C.Generator at 0x7a9b3287ef50>

## Overview of the model

![encoder](./imgs/Model_small.png)

## 1. Positional Encoding

According to the original paper on Transformer, positional encoding is constructed by using sine functions to even dimensions and cosine functions to odd dimensions.

\begin{align*}
    PE_{(pos,2i)} = sin(pos / 10000^{2i/dim}) \\
    PE_{(pos,2i+1)} = cos(pos / 10000^{2i/dim})
\end{align*}

In [5]:
class PositionalEncoding(nn.Module):
    def __init__(self, dim, seq_len_max):
        super(PositionalEncoding, self).__init__()
        PE = torch.zeros(seq_len_max, dim)
        ######################### TO DO #########################
        pos = torch.arange(0, seq_len_max, dtype = torch.float).unsqueeze(1)
        expo = torch.exp(torch.arange(0, dim, 2).float() * (-math.log(10000.0) /dim))

        PE[:, 0::2] = torch.sin(pos * expo)
        PE[:, 1::2] = torch.cos(pos * expo)

        ######################### TO DO #########################

        ######################### DO NOT CHANGE #########################
        # Positional Encoding is not learnable parameters.
        self.register_buffer('PE', PE.unsqueeze(0))
        ######################### DO NOT CHANGE #########################

    def forward(self, X):
        return X + self.PE[:, :X.size(1)]

## 2. Multi-head attention

![multi_head_attention](./imgs/Attention.png)

In this section, we will implement MultiHeadAttention Class.  
The parameters of MultiHeadAttention class is defined as follows.
Note that according to the definition of multi-head attention, the dimension of the model is equal to the product
of the word dimension and the number of heads

$dim$:  dimension of the model  
$dim$ = dimension for a each word * $head\_num$  
$seq\_len$:  length of the input sequence

This module will get batched sequences x and return multi-head attention ouput.

X size:  $(batch\_num, seq\_len, dim)$  
mask: Tensor to indicate the words involved in score calculation  
output size:  $(batch\_num, seq\_len, dim)$

$W_q$ = linear transformation for query  
$W_k$ = linear transformation for key    
$W_v$ = linear transformation for value  
$W_o$ = linear transformation for concatenated heads

The model operates according to the following equation.  
It should select the values that will participate in score calculation based on the received mask.

$Q = X * W_q$  
$K = X * W_k$  
$V = X * W_v$  

$scores = \frac{QK^T}{\sqrt{word\_dim}}$  
$masked\_scores = mask(\frac{QK^T}{\sqrt{word\_dim}})$  
$probs = softmax(masked\_scores)$  
$heads = probsV$  
$output = heads * W_o$  




In [6]:
import torch.nn as nn

class MultiHeadAttention(nn.Module):
    def __init__(self, dim, head_num):
        super(MultiHeadAttention, self).__init__()

        self.dim = dim
        self.head_num = head_num
        self.word_dim = dim // head_num

        ######################### TO DO #########################
        self.W_q = nn.Linear(dim, dim)
        self.W_k = nn.Linear(dim, dim)
        self.W_v = nn.Linear(dim, dim)
        self.W_o = nn.Linear(dim, dim)
        ######################### TO DO #########################

    def scaled_dot_product(self, Q, K, V, mask=None):
        ######################### TO DO #########################
        scores = torch.matmul(Q, K.transpose(-2, -1)) / math.sqrt(self.word_dim)
        if mask is not None :
          masked_scores = scores.masked_fill(mask ==0, float('-inf'))
        else :
          masked_scores = scores
        probs = torch.nn.functional.softmax(masked_scores, dim = -1)
        heads = torch.matmul(probs, V)

        ######################### TO DO #########################
        return heads

    def split(self, X):
        batch_num, seq_len, dim = X.size()
        return X.view(batch_num, seq_len, self.head_num, self.word_dim).transpose(1, 2)

    def combine(self, X):
        batch_num, _, seq_len, _ = X.size()
        return X.transpose(1, 2).contiguous().view(batch_num, seq_len, self.dim)

    def forward(self, X_Q, X_K, X_V, mask=None):
        Q = self.split(self.W_q(X_Q))
        K = self.split(self.W_k(X_K))
        V = self.split(self.W_v(X_V))

        heads = self.scaled_dot_product(Q, K, V, mask)
        output = self.W_o(self.combine(heads))
        return output



## 3. Encoder

Implement EncoderLayer class using **one MultiHeadAttention layer, one FNN layer and two normalization layer**.  
**Please apply dropout right after passing through multi-head attention and FFN layer.**

**HINT**  
**1. Normalization is a LayerNorm.**  
**2. LayerNorm layers have learnable parameters. Therefore, you should use two normalization layers.**

In [7]:
class FFN(nn.Module):
    def __init__(self, dim, FFN_dim):
        super(FFN, self).__init__()
        self.FFN_layer = nn.Sequential(nn.Linear(dim, FFN_dim),
                                       nn.ReLU(),
                                       nn.Linear(FFN_dim, dim))
    def forward(self, X):
        return self.FFN_layer(X)

class EncoderLayer(nn.Module):
    def __init__(self, dim, head_num, FFN_dim, dropout):
        super(EncoderLayer, self).__init__()
        ######################### TO DO #########################
        self.self_attn = MultiHeadAttention(dim, head_num)
        self.ffn = FFN(dim, FFN_dim)
        self.norm1 = nn.LayerNorm(dim)
        self.norm2 = nn.LayerNorm(dim)

        self.dropout = nn.Dropout(dropout)

        ######################### TO DO #########################

    def forward(self, X, mask):
        ######################### TO DO #########################
        attn_X = self.self_attn(X, X, X, mask)
        X = X + self.dropout(attn_X)
        X = self.norm1(X)

        ffn_X = self.ffn(X)
        X = X + self.dropout(ffn_X)
        X = self.norm2(X)
        output = X
        ######################### TO DO #########################
        return output

## 4. Decoder

Implement DecoderLayer class using **two MultiHeadAttention layers(self-attention and cross-attention), one FNN layer and three normalization layers.**
**Please apply dropout right after passing through two multi-head attention layers and FFN layer.**

**HINT**  
**1. Normalization is a LayerNorm.**  
**2. LayerNorm layers have learnable parameters. Therefore, you should use three normalization layers.**  
**3. The first multi-head attention layer is a self attention layer, and the second attention layer is a cross attention layer. Choose the mask carefully.**

In [8]:
class DecoderLayer(nn.Module):
    def __init__(self, dim, head_num, FFN_dim, dropout):
        super(DecoderLayer, self).__init__()
        ######################### TO DO #########################

        self.self_attn = MultiHeadAttention(dim, head_num)
        self.cross_attn = MultiHeadAttention(dim, head_num)
        self.ffn = FFN(dim, FFN_dim)

        self.norm = nn.LayerNorm(dim)
        self.dropout = nn.Dropout(dropout)

        ######################### TO DO #########################

    def forward(self, X, enc_output, cross_attn_mask, self_attn_mask):
        ######################### TO DO #########################
        self_attn_X = self.self_attn(X, X, X, self_attn_mask)
        X = X + self.dropout(self_attn_X)
        X = self.norm(X)

        cross_attn_X = self.cross_attn(X, enc_output, enc_output, cross_attn_mask)
        X = X + self.dropout(cross_attn_X)
        X = self.norm(X)

        ffn_X = self.ffn(X)
        X = X + self.dropout(ffn_X)
        X = self.norm(X)
        output = X
        ######################### TO DO #########################
        return output

## 5. Prepare sample data and Run model

In [9]:
class Transformer(nn.Module):
    def __init__(self, input_lib_size, output_lib_size, dim, head_num, layer_num, \
                 FFN_dim, seq_len_max, dropout):
        super(Transformer, self).__init__()
        self.enc_embeds = nn.Embedding(input_lib_size, dim)
        self.dec_embeds = nn.Embedding(output_lib_size, dim)
        self.pe = PositionalEncoding(dim, seq_len_max)

        self.encoder = nn.ModuleList([EncoderLayer(dim, head_num, FFN_dim, dropout) \
                                             for _ in range(layer_num)])
        self.decoder = nn.ModuleList([DecoderLayer(dim, head_num, FFN_dim, dropout) \
                                             for _ in range(layer_num)])
        self.Linear = nn.Linear(dim, output_lib_size)
        self.dropout = nn.Dropout(dropout)

    def generate_mask(self, src, tgt):
        self_attn_mask = (src != 0).unsqueeze(1).unsqueeze(2)
        cross_attn_mask = (tgt != 0).unsqueeze(1).unsqueeze(3)
        seq_length = tgt.size(1)
        nopeak_mask = (1 - torch.triu(torch.ones(1, seq_length, seq_length), diagonal=1)).bool()
        nopeak_mask = nopeak_mask.to(device)
        cross_attn_mask = cross_attn_mask & nopeak_mask
        return self_attn_mask, cross_attn_mask

    def forward(self, src, tgt):
        self_attn_mask, cross_attn_mask = self.generate_mask(src, tgt)
        src_embeds = self.dropout(self.pe(self.enc_embeds(src)))
        tgt_embeds = self.dropout(self.pe(self.dec_embeds(tgt)))

        enc_output = src_embeds
        for enc_layer in self.encoder:
            enc_output = enc_layer(enc_output, self_attn_mask)

        dec_output = tgt_embeds
        for dec_layer in self.decoder:
            dec_output = dec_layer(dec_output, enc_output, self_attn_mask, cross_attn_mask)

        output = self.Linear(dec_output)
        return output

In [10]:
input_lib_size = 5000
output_lib_size = 5000
dim = 512
head_num = 4
layer_num = 3
FFN_dim = 2048
seq_len_max = 100
dropout = 0.1

transformer = Transformer(input_lib_size, output_lib_size, dim, head_num, layer_num, \
                          FFN_dim, seq_len_max, dropout)
transformer = transformer.to(device)

# Generate random sample data
src_data = torch.randint(1, input_lib_size, (64, seq_len_max)).to(device)
tgt_data = torch.randint(1, output_lib_size, (64, seq_len_max)).to(device)

In [11]:
criterion = nn.CrossEntropyLoss(ignore_index=0)
optimizer = optim.Adam(transformer.parameters(), lr=0.0001, betas=(0.9, 0.98), eps=1e-9)

transformer.train()

for epoch in range(100):
    optimizer.zero_grad()
    output = transformer(src_data, tgt_data[:, :-1])
    loss = criterion(output.contiguous().view(-1, output_lib_size), tgt_data[:, 1:].contiguous().view(-1))
    loss.backward()
    optimizer.step()
    print(f"Epoch: {epoch+1}, Loss: {loss.item()}")

Epoch: 1, Loss: 8.677824020385742
Epoch: 2, Loss: 8.580459594726562
Epoch: 3, Loss: 8.498089790344238
Epoch: 4, Loss: 8.441888809204102
Epoch: 5, Loss: 8.392080307006836
Epoch: 6, Loss: 8.345414161682129
Epoch: 7, Loss: 8.300348281860352
Epoch: 8, Loss: 8.248345375061035
Epoch: 9, Loss: 8.195426940917969
Epoch: 10, Loss: 8.138192176818848
Epoch: 11, Loss: 8.078092575073242
Epoch: 12, Loss: 8.022431373596191
Epoch: 13, Loss: 7.9623589515686035
Epoch: 14, Loss: 7.908764362335205
Epoch: 15, Loss: 7.848241329193115
Epoch: 16, Loss: 7.788707256317139
Epoch: 17, Loss: 7.727346897125244
Epoch: 18, Loss: 7.668664932250977
Epoch: 19, Loss: 7.606161117553711
Epoch: 20, Loss: 7.549562931060791
Epoch: 21, Loss: 7.487461090087891
Epoch: 22, Loss: 7.426892280578613
Epoch: 23, Loss: 7.365617752075195
Epoch: 24, Loss: 7.303801536560059
Epoch: 25, Loss: 7.235662937164307
Epoch: 26, Loss: 7.177582740783691
Epoch: 27, Loss: 7.115059852600098
Epoch: 28, Loss: 7.051268100738525
Epoch: 29, Loss: 6.992549419

In [12]:
# For Google Colab
# Uncomment the next line and run. The compressed file will be in your assignment directory.
!bash CollectSubmission.sh [2023-18676]

Assignment2-1_Transformer_from_scratch.ipynb
tar: Assignment2-1_Transformer_from_scratch.ipynb: file changed as we read it
Assignment2-2_ViT.ipynb
tar: Assignment2-2_ViT.ipynb: file changed as we read it
vit_amp.pth
tar: vit_amp.pth: file changed as we read it
vit.pth
tar: vit.pth: file changed as we read it


### Describe what you did and discovered here
In this cell you should write all the settings tried and performances you obtained. Report what you did and what you discovered from the trials.
You can write in Korean

1. 포지셔널 인코딩을 구현해서 embedding과 더함
2. W_q, W_k 등 파라미터 선언, 스코어 계산, value 계산
3. 선형 FFN 인코더 레이어에 추가, LayerNorm, dropout 적용
4. 디코더 레이어 어텐션, ffn블럭, layernorm, dropout 추가