<a href="https://colab.research.google.com/github/kasier48/DeepLearning/blob/main/Pratice_Week_2_2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# [2주차] 심화과제: Multi-head Attention으로 감정 분석 모델 구현하기

- [ ]  Multi-head attention(MHA) 구현
    - Self-attention module을 MHA로 확장해주시면 됩니다. 여기서 MHA는 다음과 같이 구현합니다.
        1. 기존의 $W_q, W_k, W_v$를 사용하여 $Q, K, V$를 생성합니다. 이 부분은 코드 수정이 필요 없습니다.
        2. $Q, K, V \in \mathbb{R}^{S \times D}$가 있을 때, 이를 $Q, K, V \in \mathbb{R}^{S \times H \times D’}$으로 reshape 해줍니다. 여기서 $H$는 `n_heads`라는 인자로 받아야 하고, $D$가 $H$로 나눠 떨어지는 값이여야 하는 제약 조건이 필요합니다. $D = H \times D’$입니다.
        3. $Q, K, V$를 $Q, K, V \in \mathbb{R}^{H \times S \times D’}$의 shape으로 transpose해줍니다.
        4. $A = QK^T/\sqrt{D'} \in \mathbb{R}^{H \times S \times S}$를 기존의 self-attention과 똑같이 계산합니다. 이 부분은 코드 수정이 필요 없습니다.
        5. Mask를 더합니다. 기존과 $A$의 shape이 달라졌기 때문에 dimension을 어떻게 맞춰줘야할지 생각해줘야 합니다.
        6. $\hat{x} = \textrm{Softmax}(A)V \in \mathbb{R}^{H \times S \times D'}$를 계산해주고 transpose와 reshape을 통해 $\hat{x} \in \mathbb{R}^{S \times D}$의 shape으로 다시 만들어줍니다.
        7. 기존과 똑같이 $\hat{x} = \hat{x} W_o$를 곱해줘서 마무리 해줍니다. 이 또한 코드 수정이 필요 없습니다.
- [ ]  Layer normalization, dropout, residual connection 구현
    - 다시 `TransformerLayer` class로 돌아와서 과제를 진행하시면 됩니다.
    - Attention module을 $MHA$, feed-forward layer를 $FFN$이라고 하겠습니다.
    - 기존의 구현은 다음과 같습니다:
        
        ```python
        # x, mask is given
        
        x1 = MHA(x, mask)
        x2 = FFN(x1)
        
        return x2
        ```
        
    - 다음과 같이 수정해주시면 됩니다.
        
        ```python
        # x, mask is given
        
        x1 = MHA(x, mask)
        x1 = Dropout(x1)
        x1 = LayerNormalization(x1 + x)
        
        x2 = FFN(x)
        x2 = Dropout(x2)
        x2 = LayerNormalization(x2 + x1)
        
        return x2
        ```
        
    - 여기서 `x1 + x`와 `x2 + x1`에 해당하는 부분들은 residual connection이라고 부릅니다.
- [ ]  5-layer 4-head Transformer
    - 기존 실습에서 사용한 hyper-parameter들과 위에서 구현한 Transformer를 가지고 5-layer 4-head Transformer의 성능 결과를 report해주시면 됩니다.

In [1]:
pip install datasets sacremoses

Collecting datasets
  Downloading datasets-3.2.0-py3-none-any.whl.metadata (20 kB)
Collecting sacremoses
  Downloading sacremoses-0.1.1-py3-none-any.whl.metadata (8.3 kB)
Collecting dill<0.3.9,>=0.3.0 (from datasets)
  Downloading dill-0.3.8-py3-none-any.whl.metadata (10 kB)
Collecting xxhash (from datasets)
  Downloading xxhash-3.5.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (12 kB)
Collecting multiprocess<0.70.17 (from datasets)
  Downloading multiprocess-0.70.16-py310-none-any.whl.metadata (7.2 kB)
Collecting fsspec<=2024.9.0,>=2023.1.0 (from fsspec[http]<=2024.9.0,>=2023.1.0->datasets)
  Downloading fsspec-2024.9.0-py3-none-any.whl.metadata (11 kB)
Downloading datasets-3.2.0-py3-none-any.whl (480 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m480.6/480.6 kB[0m [31m15.2 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading sacremoses-0.1.1-py3-none-any.whl (897 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m897.5/897.5 kB[

In [None]:
import torch
from datasets import load_dataset
from torch.utils.data import DataLoader
# from transformers import BertTokenizerFast
# from tokenizers import (
#     decoders,
#     models,
#     normalizers,
#     pre_tokenizers,
#     processors,
#     trainers,
#     Tokenizer,
# )


ds = load_dataset("stanfordnlp/imdb")
tokenizer = torch.hub.load('huggingface/pytorch-transformers', 'tokenizer', 'bert-base-uncased')

from torch.nn.utils.rnn import pad_sequence

def collate_fn(batch):
  max_len = 400
  texts, labels = [], []
  for row in batch:
    # [MYCODE] label에는 -2를 주어 마지막 단어를 주도록 설정
    # texts에는 -2를 주어 마지막 단어를 제외한 문장을 주도록 설정
    labels.append(tokenizer(row['text'], truncation=True, max_length=max_len).input_ids[-2])
    texts.append(torch.LongTensor(tokenizer(row['text'], truncation=True, max_length=max_len).input_ids[:-2]))

  texts = pad_sequence(texts, batch_first=True, padding_value=tokenizer.pad_token_id)
  labels = torch.LongTensor(labels)

  return texts, labels


train_loader = DataLoader(
    ds['train'], batch_size=64, shuffle=True, collate_fn=collate_fn
)
test_loader = DataLoader(
    ds['test'], batch_size=64, shuffle=False, collate_fn=collate_fn
)

from torch import nn
from math import sqrt

class MultiHeadAttention(nn.Module):
  def __init__(self, input_dim, d_model, num_heads):
    super().__init__()

    self.input_dim = input_dim
    self.d_model = d_model

    # [MOYCODE] d_k에 num_heads 만큼의 차원 단위 부여
    self.num_heads = num_heads
    self.d_k = d_model // num_heads

    self.wq = nn.Linear(input_dim, d_model)
    self.wk = nn.Linear(input_dim, d_model)
    self.wv = nn.Linear(input_dim, d_model)
    self.dense = nn.Linear(d_model, d_model)

    self.softmax = nn.Softmax(dim=-1)

  def forward(self, x, mask):
    batch_size = x.size(0)
    seq_len = x.size(1)

    # [MYCODE] split_heads를 통해 num_heads 만큼 차원으로 확장.
    # (B, S, D) -> (B, H, S, D_K)
    q = self.__split_heads(self.wq(x))
    k = self.__split_heads(self.wk(x))
    v = self.__split_heads(self.wv(x))

    # (B, H, S, D_K) * (B, H, D_K, S) = (B, H, S, S)
    score = torch.matmul(q, k.transpose(-1, -2))

    # [MYCODE] d_k = d_model / num_heads 단위로 처리되므로 변경
    head_unit = sqrt(self.d_k)
    score = score / head_unit

    # [MYCODE] head_unit 단위로 score를 계산하였으므로 mask도 동일하게 처리
    if mask is not None:
      # (B, 1, 1, S)
      mask = mask.unsqueeze(1)
      mask = mask.expand(-1, self.num_heads, seq_len, seq_len)
      score = score + (mask * -1e9)

    # (B, H, S, S) * (B, H, S, D_K) = (B, H, S, D_K)
    score = self.softmax(score)
    result = torch.matmul(score, v)

    # [MYCODE] num_heads 만큼 다시 결합하여 d_model 차원으로 복원한다.
    # (B, S * H, D)
    result = result.transpose(1, 2).contiguous().view(batch_size, -1, self.d_model)

    # (B, S, D)
    result = self.dense(result)

    return result

  def __split_heads(self, x):
    batch_size, seq_len, d_model = x.size()
    x = x.view(batch_size, self.num_heads, seq_len, self.d_k)
    return x

class TransformerLayer(nn.Module):
  def __init__(self, input_dim, d_model, dff, num_heads, dropout_rate=0.1):
    super().__init__()

    self.input_dim = input_dim
    self.d_model = d_model
    self.dff = dff

    # [MYOCDE] layer_norm, dropout 적용
    self.mha = MultiHeadAttention(input_dim, d_model, num_heads)
    self.layer_norm = nn.LayerNorm(d_model)
    self.dropout = nn.Dropout(p=dropout_rate)
    self.ffn = nn.Sequential(
      nn.Linear(d_model, dff),
      nn.ReLU(),
      nn.Linear(dff, d_model)
    )

  def forward(self, x, mask):
    # [MYOCDE] multi head attention, droput, layer_norm, residual connection 적용
    x1 = self.mha(x, mask)
    x1 = self.dropout(x1)
    x1 = self.layer_norm(x1 + x)

    x2 = self.ffn(x1)
    x2 = self.dropout(x2)
    x2 = self.layer_norm(x2 + x1)

    return x2

import numpy as np

def get_angles(pos, i, d_model):
    angle_rates = 1 / np.power(10000, (2 * (i // 2)) / np.float32(d_model))
    return pos * angle_rates

def positional_encoding(position, d_model):
    angle_rads = get_angles(np.arange(position)[:, None], np.arange(d_model)[None, :], d_model)
    angle_rads[:, 0::2] = np.sin(angle_rads[:, 0::2])
    angle_rads[:, 1::2] = np.cos(angle_rads[:, 1::2])
    pos_encoding = angle_rads[None, ...]

    return torch.FloatTensor(pos_encoding)


max_len = 400
print(positional_encoding(max_len, 256).shape)

class TextClassifier(nn.Module):
  def __init__(self, vocab_size, d_model, n_layers, dff, num_heads, dropout_rate=0.1):
    super().__init__()

    self.vocab_size = vocab_size
    self.d_model = d_model
    self.n_layers = n_layers
    self.dff = dff

    self.embedding = nn.Embedding(vocab_size, d_model)
    self.pos_encoding = nn.parameter.Parameter(positional_encoding(max_len, d_model), requires_grad=False)
    self.layers = nn.ModuleList([TransformerLayer(d_model, d_model, dff, num_heads, dropout_rate) for _ in range(n_layers)])

    # [MYCODE] 마지막 단어를 예측하는 것이므로 총 토큰의 길이를 주도록 설정
    self.classification = nn.Linear(d_model, vocab_size)

  def forward(self, x):
    mask = (x == tokenizer.pad_token_id)
    mask = mask[:, None, :]
    seq_len = x.shape[1]

    x = self.embedding(x)
    x = x * sqrt(self.d_model)
    x = x + self.pos_encoding[:, :seq_len]

    for layer in self.layers:
      x = layer(x, mask)

    x = x[:, 0]
    x = self.classification(x)

    return x

device = torch.device('cuda')

# [MYCODE] 5 layer, 4 heads를 적용
token_len = len(tokenizer)
model = TextClassifier(vocab_size=token_len, d_model=32, n_layers=5, dff=32, num_heads=4, dropout_rate=0.1)

from torch.optim import Adam

lr = 0.001
model = model.to(device)

# [MYCODE] 마지막 단어에 대한 예측이기 때문에 다중 분류할 수 있도록 설정
loss_fn = nn.CrossEntropyLoss()
optimizer = Adam(model.parameters(), lr=lr)

import numpy as np
import matplotlib.pyplot as plt

def accuracy(model, dataloader):
  cnt = 0
  acc = 0

  for data in dataloader:
    inputs, labels = data
    inputs, labels = inputs.to(device), labels.to(device)

    preds = model(inputs)

    # [MYCODE] 다중 뷴류이므로 가장 높은 확률의 토큰을 선택
    preds = torch.argmax(preds, dim=-1)
    # preds = (preds > 0).long()[..., 0]

    cnt += labels.shape[0]
    acc += (labels == preds).sum().item()

  return acc / cnt

n_epochs = 50

for epoch in range(n_epochs):
  total_loss = 0.
  model.train()
  for data in train_loader:
    model.zero_grad()
    inputs, labels = data
    inputs, labels = inputs.to(device), labels.to(device)

    preds = model(inputs)

    loss = loss_fn(preds, labels)
    loss.backward()
    optimizer.step()

    total_loss += loss.item()

  print(f"Epoch {epoch:3d} | Train Loss: {total_loss}")

  with torch.no_grad():
    model.eval()
    train_acc = accuracy(model, train_loader)
    test_acc = accuracy(model, test_loader)
    print(f"=========> Train acc: {train_acc:.3f} | Test acc: {test_acc:.3f}")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


README.md:   0%|          | 0.00/7.81k [00:00<?, ?B/s]

train-00000-of-00001.parquet:   0%|          | 0.00/21.0M [00:00<?, ?B/s]

test-00000-of-00001.parquet:   0%|          | 0.00/20.5M [00:00<?, ?B/s]

unsupervised-00000-of-00001.parquet:   0%|          | 0.00/42.0M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/25000 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/25000 [00:00<?, ? examples/s]

Generating unsupervised split:   0%|          | 0/50000 [00:00<?, ? examples/s]

Downloading: "https://github.com/huggingface/pytorch-transformers/zipball/main" to /root/.cache/torch/hub/main.zip
The cache for model files in Transformers v4.22.0 has been updated. Migrating your old cache. This is a one-time only operation. You can interrupt this and resume the migration later on by calling `transformers.utils.move_cache()`.


0it [00:00, ?it/s]

tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

torch.Size([1, 400, 256])
Epoch   0 | Train Loss: 1541.9675447940826
Epoch   1 | Train Loss: 1066.5496199131012
Epoch   2 | Train Loss: 1037.178345799446
Epoch   3 | Train Loss: 1020.3218406438828
Epoch   4 | Train Loss: 1002.0137741565704
Epoch   5 | Train Loss: 982.331939458847
Epoch   6 | Train Loss: 955.7625530958176
Epoch   7 | Train Loss: 927.3441436290741
Epoch   8 | Train Loss: 891.8431397676468
Epoch   9 | Train Loss: 857.2826907634735
Epoch  10 | Train Loss: 820.7283426523209
Epoch  11 | Train Loss: 777.0744856595993
Epoch  12 | Train Loss: 740.554391503334
Epoch  13 | Train Loss: 698.4117406606674
Epoch  14 | Train Loss: 657.5186665654182
Epoch  15 | Train Loss: 616.6323562860489
Epoch  16 | Train Loss: 579.2211511731148
Epoch  17 | Train Loss: 541.9733017086983
Epoch  18 | Train Loss: 509.8935802578926
Epoch  19 | Train Loss: 475.93106758594513
Epoch  20 | Train Loss: 443.2163539528847
Epoch  21 | Train Loss: 415.82473039627075
Epoch  22 | Train Loss: 390.05713230371475
Epo