<a href="https://colab.research.google.com/github/mystlee/2024_CSU_AI/blob/main/chapter5/torch_RNN_Attention_Encoder_Decoder_(question).ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## 0. 모델 구성, 학습 등을 위한 라이브러리 import   
### torch   
  - pytorch 프레임워크   

### torch.nn
  - nn = neural network   
  - 딥러닝 관련 라이브러리  
  - fully-connected layer, conv layer 등을 포함  

### torch.nn.functional   
  - 활성화 함수와 같은 딥러닝 관련 함수 라이브러리   
  - Softmax, ReLU 함수와 같은 활성화 함수 등등 포함

### torch.optim
  - 모델 학습을 위한 옵티마이저 라이브러리   
  - SGD, AdaGrad, RMSProp, Adam 등 옵티마이저 포함

### datasets
  - Huggingface에서 제공하는 dataset 다운로드 및 관리 패키지   
  

### transformers
  - Huggingface에서 제공하는 transformer관련 패키지   
  - 기본적으로 transformer에 사용되는 다양한 모듈을 포함하고 있지만, 예시에서는 tokenzier활용을 위한 용도


In [None]:
!pip install -q datasets sentencepiece transformers

import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
from datasets import load_dataset
from transformers import AutoTokenizer
import random

## 1-1. 데이터셋 변환   
모델 입력에 맞게 데이터셋의 format을 변환

In [None]:
DATASET = 'ted_hrlr'

if DATASET.lower() == "ted_hrlr":
# TED Talks 데이터셋 (포루투칼어 <-> 영어) 로드
  raw_datasets = load_dataset('ted_hrlr', 'pt_to_en', trust_remote_code = True)
  # 학습(training), 검증(validation), 테스트(test) 데이터셋 분할
  train_data = raw_datasets['train']
  valid_data = raw_datasets['validation']
  test_data = raw_datasets['test']
  src_lang = 'pt'
  tgt_lang = 'en'


elif DATASET.lower() == "tatoeba":
  def sample_dataset(dataset, num_samples, seed = 42):
    shuffled = dataset.shuffle(seed = seed)
    num_samples = min(num_samples, len(shuffled))
    return shuffled.select(range(num_samples))

  src_lang = 'en'
  tgt_lang = 'fr'
  raw_datasets = load_dataset('tatoeba', lang1 = src_lang, lang2 = tgt_lang,
                              trust_remote_code = True)
  train_dataset = raw_datasets['train']

  train_valid_split = train_dataset.train_test_split(test_size = 0.2, seed = 42)
  train_data = train_valid_split['train']
  temp_split = train_valid_split['test']

  validation_test_split = temp_split.train_test_split(test_size = 0.5, seed = 42)
  valid_data = validation_test_split['train']
  test_data = validation_test_split['test']

  train_data = sample_dataset(train_data, num_samples = 50000)
  valid_data = sample_dataset(valid_data, num_samples = 5000)
  test_data = sample_dataset(test_data, num_samples = 5000)

raw_datasets = {
    'train': train_data,
    'validation': valid_data,
    'test': test_data
}

print(f"Train samples: {len(train_data)}")
print(f"Validation samples: {len(valid_data)}")
print(f"Test samples: {len(test_data)}")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


README.md:   0%|          | 0.00/13.8k [00:00<?, ?B/s]

ted_hrlr.py:   0%|          | 0.00/6.68k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/131M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/51786 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/1194 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/1804 [00:00<?, ? examples/s]

Train samples: 51786
Validation samples: 1194
Test samples: 1804


## 1-2. 데이터셋 변환 (텍스트 tokenizing)   
Tokenizer를 이용해서 텍스트를 숫자 (index)로 변환   
BERT (transformer의 일종)에 사용된 tokenizer를 활용   
 - BOS (\<s\>): begin of sentence -> 문장의 시작   
 - EOS (\<\s\>): emd of sentence -> 문장의 끝   
 - PAD: 문장의 공백을 채울 때 사용


In [None]:
# 토크나이저 (tokenizer) 설정 (BERT의 기본 토크나이저 사용)
if DATASET.lower() == "ted_hrlr":
  tokenizer_src = AutoTokenizer.from_pretrained('bert-base-multilingual-cased')
  tokenizer_tgt = AutoTokenizer.from_pretrained('bert-base-uncased')
elif DATASET.lower() == "tatoeba":
  tokenizer_src = AutoTokenizer.from_pretrained('bert-base-uncased')
  tokenizer_tgt = AutoTokenizer.from_pretrained('bert-base-multilingual-cased')

# special token 설정
bos_token = '[BOS]'
eos_token = '[EOS]'
pad_token = '[PAD]'

# special token을 tokenizer의 목록에 추가
special_tokens = {'bos_token': bos_token,
                  'eos_token': eos_token,
                  'pad_token': pad_token}

tokenizer_src.add_special_tokens(special_tokens)
tokenizer_tgt.add_special_tokens(special_tokens)

tokenizer_config.json:   0%|          | 0.00/49.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/625 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/996k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.96M [00:00<?, ?B/s]



tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

2

## 1-3. 데이터셋 변환 (텍스트 data loader)   
입력 문장과 출력 문장의 포맷을 맞춤   
 - 입력 문장: 문장 앞에 BOS 토큰을 삽입   
 - 출력 문장: 문장 뒤에 EOS 토큰을 삽입   
또 하나의 special token
 - 번역 및 NLP에서는 입/출력 domain을 표기해주기 위한 특수한 토큰이 사용됨
 - 현재 예시에서는 번역 (입력: 포루투칼어, 출력: 영어)   
기타
 - 문장의 최대 길이: max_length   
 - 문장 자르기 옵션: truncation   
 - 패딩: padding   

In [None]:
from torch.utils.data import DataLoader

max_length = 64

def filter_max_length(example):
    source_tokens = tokenizer_src(bos_token + ' ' + example['translation'][src_lang] + ' ' + eos_token)
    target_tokens = tokenizer_tgt(bos_token + ' ' + example['translation'][tgt_lang] + ' ' + eos_token)
    input_length = len(source_tokens['input_ids'])
    target_length = len(target_tokens['input_ids'])
    return input_length <= max_length and target_length <= max_length

# 너무 긴 문장을 사용하지 않는 부분
train_data = raw_datasets['train'].filter(filter_max_length)
valid_data = raw_datasets['validation'].filter(filter_max_length)
test_data = raw_datasets['test'].filter(filter_max_length)

print(f"After filtering, train samples: {len(train_data)}")
print(f"After filtering, validation samples: {len(valid_data)}")
print(f"After filtering, test samples: {len(test_data)}")

# 데이터 전처리 함수
def preprocess_function(examples):
    inputs = [bos_token + ' ' + ex[src_lang] + ' ' + eos_token
              for ex in examples['translation']]
    targets = [bos_token + ' ' + ex[tgt_lang] + ' ' + eos_token
               for ex in examples['translation']]

    model_inputs = tokenizer_src(inputs, max_length = max_length,
                                 truncation = True, padding = 'max_length')

    labels = tokenizer_tgt(targets, max_length = max_length,
                           truncation = True, padding = 'max_length')

    model_inputs['labels'] = labels['input_ids']
    return model_inputs

# 데이터셋 전처리
train_dataset = train_data.map(preprocess_function, batched = True,
                               remove_columns = train_data.column_names)
valid_dataset = valid_data.map(preprocess_function, batched = True,
                               remove_columns = valid_data.column_names)
test_dataset = test_data.map(preprocess_function, batched = True,
                             remove_columns=test_data.column_names)

def collate_fn(batch):
    input_ids = torch.tensor([item['input_ids'] for item in batch], dtype = torch.long)
    labels = torch.tensor([item['labels'] for item in batch], dtype = torch.long)
    return {'input_ids': input_ids, 'labels': labels}

batch_size = 64
train_dataloader = DataLoader(train_dataset, batch_size = batch_size,
                              shuffle = True, collate_fn = collate_fn)
valid_dataloader = DataLoader(valid_dataset, batch_size = batch_size,
                              collate_fn = collate_fn)
test_dataloader = DataLoader(test_dataset, batch_size = batch_size,
                             collate_fn = collate_fn)


Filter:   0%|          | 0/51786 [00:00<?, ? examples/s]

Filter:   0%|          | 0/1194 [00:00<?, ? examples/s]

Filter:   0%|          | 0/1804 [00:00<?, ? examples/s]

After filtering, train samples: 49466
After filtering, validation samples: 1148
After filtering, test samples: 1731


Map:   0%|          | 0/49466 [00:00<?, ? examples/s]

Map:   0%|          | 0/1148 [00:00<?, ? examples/s]

Map:   0%|          | 0/1731 [00:00<?, ? examples/s]

## 2. 모델 구조   
### 모델 구조 작성   
- 일반적으로 class 의 \_\_init\_\_ 함수부분에서 모듈들은 선언하고,
- 그 다음 forward 함수에 전체적인 흐름 작성   
- Encoder-decoder 구조로 가장 popular하게 쓰이는 구조# 2. 모델 구조   
모델 구조 작성   
- 일반적으로 class 의 \_\_init\_\_ 함수부분에서 모듈들은 선언하고,
- 그 다음 forward 함수에 전체적인 흐름 작성   
- Encoder-decoder 구조로 가장 popular하게 쓰이는 구조

<img src = "https://nkw011.github.io/assets/image/seq_to_seq/seq2.png" width = "80%" height = "70%">   

출처: <https://nkw011.github.io/nlp/seqtoseq/>   


In [None]:
# 모델 정의 (기존 seq2seq RNN 모델)
class Seq2SeqRNN(nn.Module):
    def __init__(self,
                 input_dim,
                 output_dim,
                 embed_dim,
                 hidden_dim,
                 padding_idx):

        super(Seq2SeqRNN, self).__init__()
        self.encoder = nn.Embedding(input_dim, embed_dim,
                                    padding_idx = padding_idx)
        self.decoder = nn.Embedding(output_dim, embed_dim,
                                    padding_idx = padding_idx)
        self.rnn = nn.RNN(embed_dim, hidden_dim, batch_first = True)
        self.fc = nn.Linear(hidden_dim, output_dim)

    def forward(self, src, tgt):
        # Encoder
        embedded_src = self.encoder(src)
        _, hidden = self.rnn(embedded_src)
        # Decoder
        embedded_tgt = self.decoder(tgt)
        outputs, _ = self.rnn(embedded_tgt, hidden)
        outputs = self.fc(outputs)
        return outputs

## 3-1. 학습 함수, 손실 함수 및 옵티마이저 정의   
학습 함수   
- batch 단위로 update 진행
- error와 gradients를 계산하고 업데이트!
- classification task -> cross entropy loss를 사용!
 - token을 분류해서 맞추는 의미
 - cross entorpy를 측정할 때, padding_idx라는 변수가 사용되는데, 이는 loss를 계산할 때, padding되는 부분에 대해서 계산하지 않기 위함

- optimizer는 Adam 사용
 - [pytorch optimizer] <https://pytorch.org/docs/stable/optim.html>
 - SGD, RMSprop 등 다양한 optimizer 활용 가능


Loss와 optimizer를 이용해서 모델 학습!



In [None]:
input_dim = len(tokenizer_src)
output_dim = len(tokenizer_tgt)
embed_dim = 256
hidden_dim = 512
padding_idx = tokenizer_src.pad_token_id

# 모델 객체 생성
model = Seq2SeqRNN(input_dim, output_dim, embed_dim, hidden_dim, padding_idx)
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
model.to(device)

# 손실함수 및 optimizer 설정
criterion = nn.CrossEntropyLoss(ignore_index = padding_idx)
optimizer = optim.Adam(model.parameters(), lr = 0.001)

# 학습 함수
def train(model, dataloader, optimizer, criterion, epoch, log_interval = 100):
    model.train()
    total_loss = 0
    epoch_loss = 0
    total_steps = 0

    interval_loss = 0
    interval_correct = 0
    interval_total = 0

    total_batches = len(dataloader)

    for batch_idx, batch in enumerate(dataloader):
        src = batch['input_ids'].to(device)
        tgt = batch['labels'].to(device)

        optimizer.zero_grad()
        output = model(src, tgt[:, :-1])  # 마지막 token 제외
        output = output.reshape(-1, output_dim)
        tgt = tgt[:, 1:].reshape(-1)  # 첫 token 제외

        loss = criterion(output, tgt)
        loss.backward()
        optimizer.step()

        loss_value = loss.item()
        total_loss += loss_value
        epoch_loss += loss_value
        total_steps += 1

        interval_loss += loss_value

        predictions = output.argmax(dim=1)
        mask = tgt != padding_idx
        correct = (predictions == tgt) & mask
        interval_correct += correct.sum().item()
        interval_total += mask.sum().item()

        if total_steps % log_interval == 0:
            avg_loss = interval_loss / log_interval
            accuracy = interval_correct / interval_total if interval_total > 0 else 0
            progress = ((batch_idx + 1) / total_batches) * 100
            print(f'Epoch [{epoch+1}/{num_epochs}], '
                  f'Progress: {progress:.2f}%, Loss: {avg_loss:.4f}, '
                  f'Accuracy: {accuracy*100:.2f}%')
            interval_loss = 0
            interval_correct = 0
            interval_total = 0

    avg_epoch_loss = epoch_loss / total_steps
    print(f'Epoch {epoch+1} Completed. Average Loss: {avg_epoch_loss:.4f}')

    # Validation
    model.eval()
    total_val_loss = 0
    total_correct = 0
    total_tokens = 0
    with torch.no_grad():
        for batch in valid_dataloader:
            src = batch['input_ids'].to(device)
            tgt = batch['labels'].to(device)

            output = model(src, tgt[:, :-1])
            output = output.reshape(-1, output_dim)
            tgt = tgt[:, 1:].reshape(-1)

            loss = criterion(output, tgt)
            total_val_loss += loss.item()

            # Accuracy 측정
            predictions = output.argmax(dim=1)
            mask = tgt != padding_idx
            correct = (predictions == tgt) & mask
            total_correct += correct.sum().item()
            total_tokens += mask.sum().item()

    avg_val_loss = total_val_loss / len(valid_dataloader)
    accuracy = total_correct / total_tokens if total_tokens > 0 else 0
    print(f"Epoch {epoch + 1}/{num_epochs}, "
          f"Validation Loss: {avg_val_loss:.4f}, "
          f"Validation Accuracy: {accuracy*100:.2f}%")

    return avg_epoch_loss

# T4에서 실행
num_epochs = 10
for epoch in range(num_epochs):
    loss = train(model, train_dataloader, optimizer, criterion, epoch, log_interval = 50)

Epoch [1/10], Progress: 6.47%, Loss: 6.1769, Accuracy: 20.75%
Epoch [1/10], Progress: 12.94%, Loss: 5.0650, Accuracy: 27.08%
Epoch [1/10], Progress: 19.40%, Loss: 4.8000, Accuracy: 29.06%
Epoch [1/10], Progress: 25.87%, Loss: 4.7222, Accuracy: 29.56%
Epoch [1/10], Progress: 32.34%, Loss: 4.6514, Accuracy: 29.86%
Epoch [1/10], Progress: 38.81%, Loss: 4.5525, Accuracy: 30.84%
Epoch [1/10], Progress: 45.28%, Loss: 4.5143, Accuracy: 30.97%
Epoch [1/10], Progress: 51.75%, Loss: 4.5202, Accuracy: 30.78%
Epoch [1/10], Progress: 58.21%, Loss: 4.4559, Accuracy: 31.12%
Epoch [1/10], Progress: 64.68%, Loss: 4.3982, Accuracy: 31.50%
Epoch [1/10], Progress: 71.15%, Loss: 4.3522, Accuracy: 31.87%
Epoch [1/10], Progress: 77.62%, Loss: 4.3578, Accuracy: 31.59%
Epoch [1/10], Progress: 84.09%, Loss: 4.3491, Accuracy: 31.71%
Epoch [1/10], Progress: 90.56%, Loss: 4.2964, Accuracy: 32.33%
Epoch [1/10], Progress: 97.02%, Loss: 4.2612, Accuracy: 32.74%
Epoch 1 Completed. Average Loss: 4.6214
Epoch 1/10, Vali

### Attention 적용 (<b>과제</b>)
기존 seq2seq RNN에 attention을 적용해보자!
Attention을 적용하려면?
1. Attention score를 계산   
  - Encoder output과 decoder output곱셈을 score로 가정
2. Attention 가중치를 계산   
  - Attention score에서 softmax연산을 통해 모든 가중치의 합이 1이 되도록   
3. Context vector 계산   
  - 기존 context vector에 matrix 곱셈 적용
  - 기존 context vector에서 관심주고 싶은 부분만 집중 (attention)!
4. Context vector를 이전 decoder output에 결합   
  - 다양한 벡터 결합 방법 중에서 이어붙이는 형태 (concatentation)을 적용   

### 구현 시 참고사항
#### torch.bmm 은 무슨 역할? (batch matrix-matrix)
 - 3차원 matrix 중에서 첫번째 차원이 batch인 matrix에 대해서 matrix A와 matrix B를 곱하는 연산
 - 2차원인 경우 matrix A (N x M) 와 matrix B (M x P)가 있다면,
  - 두 matrix를 곱셈 연산하면? (N x M) * (M x P) -> (N x P)
    - N, M, P는 차원의 크기를 나타냄
 - torch bmm의 경우 <u>batch단위로 위와 같은 곱셈 연산을 위한 것</u>
  - 3차원인 matrix A (B x N x M) 와 matrix B (B x M x P)가 있다면,   
    - 여기서 B는 batch 크기   
  - 두 matrix를 곱셈 연산하면? (B x N x M) * (B x M x P) -> (B, N, P)   
    - dot product (내적)을 위한 연산   

<img src = "https://velog.velcdn.com/images/kyungmin1029/post/7c0d9be2-48f1-436e-83ea-db94bf158269/image.png" width = "80%" height = "70%">   
출처: <https://velog.io/@kyungmin1029/NLP-RNN-based-Encoder-decoder>


In [None]:
class Seq2SeqAttnRNN(nn.Module):
    def __init__(self,
                 input_dim,
                 output_dim,
                 embed_dim,
                 hidden_dim,
                 padding_idx):

        super(Seq2SeqAttnRNN, self).__init__()
        self.encoder = nn.Embedding(input_dim, embed_dim,
                                    padding_idx = padding_idx)
        self.decoder = nn.Embedding(output_dim, embed_dim,
                                    padding_idx = padding_idx)
        self.rnn = nn.RNN(embed_dim, hidden_dim, batch_first=True)
        self.fc = nn.Linear(hidden_dim * 2, output_dim)  # Adjusted for attention

    def forward(self, src, tgt):
        # Encoder
        embedded_src = self.encoder(src)
        encoder_outputs, hidden = self.rnn(embedded_src)

        # Decoder
        embedded_tgt = self.decoder(tgt)
        decoder_outputs, _ = self.rnn(embedded_tgt, hidden)

        #### Attention mechanism 적용
        # 1. attention score 계산
        # Encoder와 decoder의 output을 곱함
        attention_scores = torch.bmm(decoder_outputs, encoder_outputs.transpose(1, 2))

        # 2. attention 가중치 계산
        # Attention score에서 softmax로 가중치 계산
# WRITE YOUR CODE HERE!
        attention_weights = "WRITE YOUR CODE!!!"

        # 3. Context vectors에 attention 적용
        # Dot product 적용
        # Attention weight와 encoder의 output을 곱해서 context vector 계산
# WRITE YOUR CODE HERE!
        context = torch.bmm("WRITE YOUR CODE!!!", "WRITE YOUR CODE!!!")

        # 4. Context vectors와 decoder의 출력을 결합
        # 결합에는 concatenation 적용
# WRITE YOUR CODE HERE!
        concat = torch.cat(("WRITE YOUR CODE!!!", "WRITE YOUR CODE!!!"), dim = 2)

        # Final output
        outputs = self.fc(concat)  # (batch_size, tgt_seq_len, output_dim)

        return outputs

기존 Encoder-Decoder 기반에 attention을 적용한 model 학습   
Seq2SeqAttnRNN를 이용해서 새로운 객체 생성 후 학습!


In [None]:
input_dim = len(tokenizer_src)
output_dim = len(tokenizer_tgt)
embed_dim = 256
hidden_dim = 512
padding_idx = tokenizer_src.pad_token_id

# 모델 객체 생성
att_model = Seq2SeqAttnRNN(input_dim, output_dim, embed_dim, hidden_dim, padding_idx)
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
att_model.to(device)

# 손실함수 및 optimizer 설정
criterion = nn.CrossEntropyLoss(ignore_index = padding_idx)
optimizer = optim.Adam(att_model.parameters(), lr = 0.001)

# 학습 함수
def train(model, dataloader, optimizer, criterion, epoch, log_interval = 100):
    model.train()
    total_loss = 0
    epoch_loss = 0
    total_steps = 0

    interval_loss = 0
    interval_correct = 0
    interval_total = 0

    total_batches = len(dataloader)

    for batch_idx, batch in enumerate(dataloader):
        src = batch['input_ids'].to(device)
        tgt = batch['labels'].to(device)

        optimizer.zero_grad()
        output = model(src, tgt[:, :-1])  # 마지막 token 제외
        output = output.reshape(-1, output_dim)
        tgt = tgt[:, 1:].reshape(-1)  # 첫 token 제외

        loss = criterion(output, tgt)
        loss.backward()
        optimizer.step()

        loss_value = loss.item()
        total_loss += loss_value
        epoch_loss += loss_value
        total_steps += 1

        interval_loss += loss_value

        predictions = output.argmax(dim=1)
        mask = tgt != padding_idx
        correct = (predictions == tgt) & mask
        interval_correct += correct.sum().item()
        interval_total += mask.sum().item()

        if total_steps % log_interval == 0:
            avg_loss = interval_loss / log_interval
            accuracy = interval_correct / interval_total if interval_total > 0 else 0
            progress = ((batch_idx + 1) / total_batches) * 100
            print(f'Epoch [{epoch+1}/{num_epochs}], '
                  f'Progress: {progress:.2f}%, Loss: {avg_loss:.4f}, '
                  f'Accuracy: {accuracy*100:.2f}%')
            interval_loss = 0
            interval_correct = 0
            interval_total = 0

    avg_epoch_loss = epoch_loss / total_steps
    print(f'Epoch {epoch+1} Completed. Average Loss: {avg_epoch_loss:.4f}')

    # Validation
    model.eval()
    total_val_loss = 0
    total_correct = 0
    total_tokens = 0
    with torch.no_grad():
        for batch in valid_dataloader:
            src = batch['input_ids'].to(device)
            tgt = batch['labels'].to(device)

            output = model(src, tgt[:, :-1])
            output = output.reshape(-1, output_dim)
            tgt = tgt[:, 1:].reshape(-1)

            loss = criterion(output, tgt)
            total_val_loss += loss.item()

            # Accuracy 측정
            predictions = output.argmax(dim=1)
            mask = tgt != padding_idx
            correct = (predictions == tgt) & mask
            total_correct += correct.sum().item()
            total_tokens += mask.sum().item()

    avg_val_loss = total_val_loss / len(valid_dataloader)
    accuracy = total_correct / total_tokens if total_tokens > 0 else 0
    print(f"Epoch {epoch + 1}/{num_epochs}, "
          f"Validation Loss: {avg_val_loss:.4f}, "
          f"Validation Accuracy: {accuracy*100:.2f}%")

    return avg_epoch_loss

# T4에서 실행
num_epochs = 10
for epoch in range(num_epochs):
    loss = train(att_model, train_dataloader, optimizer, criterion, epoch, log_interval = 50)

Epoch [1/10], Progress: 6.47%, Loss: 6.0960, Accuracy: 21.57%
Epoch [1/10], Progress: 12.94%, Loss: 5.0692, Accuracy: 27.19%
Epoch [1/10], Progress: 19.40%, Loss: 4.8333, Accuracy: 29.21%
Epoch [1/10], Progress: 25.87%, Loss: 4.6692, Accuracy: 30.78%
Epoch [1/10], Progress: 32.34%, Loss: 4.5474, Accuracy: 32.60%
Epoch [1/10], Progress: 38.81%, Loss: 4.4500, Accuracy: 33.86%
Epoch [1/10], Progress: 45.28%, Loss: 4.3781, Accuracy: 35.19%
Epoch [1/10], Progress: 51.75%, Loss: 4.2522, Accuracy: 37.13%
Epoch [1/10], Progress: 58.21%, Loss: 4.1890, Accuracy: 37.83%
Epoch [1/10], Progress: 64.68%, Loss: 4.0805, Accuracy: 39.63%
Epoch [1/10], Progress: 71.15%, Loss: 4.0483, Accuracy: 39.45%
Epoch [1/10], Progress: 77.62%, Loss: 4.0119, Accuracy: 40.50%
Epoch [1/10], Progress: 84.09%, Loss: 3.9720, Accuracy: 40.74%
Epoch [1/10], Progress: 90.56%, Loss: 3.8613, Accuracy: 41.78%
Epoch [1/10], Progress: 97.02%, Loss: 3.9081, Accuracy: 41.18%
Epoch 1 Completed. Average Loss: 4.4060
Epoch 1/10, Vali

In [None]:
def translate(model, sentence):
    model.eval()
    with torch.no_grad():
        src = bos_token + ' ' + sentence + ' ' + eos_token
        src_tokenized = tokenizer_src(src, return_tensors = 'pt',
                                      max_length = 128,
                                      truncation = True,
                                      padding = 'max_length')
        src_input_ids = src_tokenized['input_ids'] # shape: [1, src_seq_len]

        # Encoder에서 컨텍스트 벡터 생성
        embedded_src = model.encoder(src_input_ids)
        _, hidden = model.rnn(embedded_src)

        # Decoder에서 초기 입력 설정 (<BOS> 토큰)
        tgt_input = torch.tensor([[tokenizer_tgt.bos_token_id]],
                                 dtype = torch.long)
        translated_tokens = []

        for _ in range(50):  # 최대 생성 길이 설정
            # shape: [1, seq_len, embed_dim]
            embedded_tgt = model.decoder(tgt_input)
            # output shape: [1, seq_len, hidden_dim]
            output, hidden = model.rnn(embedded_tgt, hidden)
            # 마지막 time step의 출력, shape: [1, output_dim]
            output = model.fc(output[:, -1, :])
            pred_token = output.argmax(1)  # shape: [1]

            translated_tokens.append(pred_token.item())
            if pred_token.item() == tokenizer_tgt.eos_token_id:
                break

            # pred_token의 dim.을 [1, 1]로
            pred_token = pred_token.unsqueeze(1)  # shape: [1, 1]
            # tgt_input (앞서 예측된 tokens) 과 pred_token을 연결
            tgt_input = torch.cat([tgt_input, pred_token], dim=1)  # shape: [1, seq_len + 1]

        translated_sentence = tokenizer_tgt.decode(translated_tokens, skip_special_tokens=True)
        return translated_sentence


# 예시 문장 번역
sample_sentence = "Olá, como você está?"
translation = translate(model, sample_sentence)
print(f'번역 결과: {translation}')


위에는 단순하게 auto regressive 체험을 위한 코드   
성능은 매우 엉터리로 나옴...   
결과가 좋아지려면,
1. model size를 많이 키워야함
2. 학습을 오랫동안 시켜야함