# Spanish-to-English translation with a sequence-to-sequence Transformer

- Source Language: Spanish (`¿A dónde conduce este camino?`)
- Target Language: English (`Where does this road lead?`)


[Original Keras Code](https://keras.io/examples/nlp/neural_machine_translation_with_transformer/) : **English-to-Spanish** translation with a sequence-to-sequence Transformer 


결과 해석의 용이성을 위해 번역 순서 바꿔서(Spanish-to-English) 구현

</br>


reference
- https://pytorch.org/text/0.11.0/vocab.html#build-vocab-from-iterator
- https://github.com/hwk0702/keras2torch/blob/main/Natural_Language_Processing/Extra/TorchText_introduction_KJS.ipynb
- https://pytorch.org/tutorials/beginner/torchtext_translation_tutorial.html

## 0. Setup, Import

torchtext, spacy, en_core_web_sm, es_core_news_sm

In [None]:
#!pip install torchtext
#!pip install -U spacy
#!python -m spacy download en_core_web_sm
#!python -m spacy download es_core_news_sm

In [1]:
import string
import re
import os
import random
import pathlib
import numpy as np

import torch
import torch.nn as nn
from torch import optim
import torch.nn.functional as F
from torchtext.utils import download_from_url

---

## 1. 데이터 다운로드

anki : https://www.manythings.org/anki/

- 영어-제2외국어 sentence pair dataset

- English + TAB + The Other Language + TAB + Attribution

> This work isn't easy.	この仕事は簡単じゃない。	CC-BY 2.0 (France) Attribution: tatoeba.org #3737550 (CK) & #7977622 (Ninja)

> Those are sunflowers.	それはひまわりです。	CC-BY 2.0 (France) Attribution: tatoeba.org #441940 (CK) & #205407 (arnab)


### 1.1. data 저장 폴더 생성

In [2]:
import os

path = 'data/' # 현재 디렉토리에 data 폴더 추가

if not os.path.exists(path):
    os.mkdir(path)

### 1.2. data 압축파일 다운로드

`torchtext.utils`의 `download_from_url`를 활용하여 anki 사이트에서 spa-eng.zip 받아오기

In [3]:
from torchtext.utils import download_from_url

url = 'http://storage.googleapis.com/download.tensorflow.org/data/spa-eng.zip'
download_from_url(url, path=None, root=path, overwrite=False, hash_value=None, hash_type='sha256')

100%|██████████████████████████████████████████████████████████████████████████████| 2.64M/2.64M [00:00<00:00, 11.1MB/s]


'/home/yookyung/codes/data/spa-eng.zip'

### 1.3. 압축 해제 with zipfile

In [4]:
import zipfile

zip_file=zipfile.ZipFile(path + "spa-eng.zip") #path는 위에서 정의한 'data/'
zip_file.extractall(path)

---

## 2. 데이터 전처리

### 2.1. 데이터 불러오기

txt 파일 한 줄씩 읽어들이면서 스페인어-영어 pair로 구성된 튜플 생성

In [2]:
import pathlib

path = 'data/'
pth = path+ "spa-eng/spa.txt"
text_file = pathlib.Path(pth)

with open(text_file) as f:
    lines = f.read().split("\n")[:-1]
text_pairs = []
for line in lines:
    eng, spa = line.split("\t")
    text_pairs.append((spa, eng))

In [3]:
for _ in range(5):
    print(random.choice(text_pairs))

('Anoche tuve un sueño extraño.', 'I had a strange dream last night.')
('Tal vez venga.', 'Perhaps he will come.')
('Esta estufa quema aceite.', 'This stove burns oil.')
('Tom no quiere estar cerca de Mary.', "Tom doesn't want to be around Mary.")
('Oí a alguien decir mi nombre desde detrás.', 'I heard someone call my name from behind.')


tuple 형태로 sentence pair(스페인어, 영어) 구성되어있음

### 2.2. train, valid, test set으로 분리

In [4]:
random.shuffle(text_pairs)
num_val_samples = int(0.15 * len(text_pairs))
num_train_samples = len(text_pairs) - 2 * num_val_samples
train_pairs = text_pairs[:num_train_samples]
val_pairs = text_pairs[num_train_samples : num_train_samples + num_val_samples]
test_pairs = text_pairs[num_train_samples + num_val_samples :]

print(f"{len(text_pairs)} total pairs")
print(f"{len(train_pairs)} training pairs")
print(f"{len(val_pairs)} validation pairs")
print(f"{len(test_pairs)} test pairs")

118964 total pairs
83276 training pairs
17844 validation pairs
17844 test pairs


In [3]:
test_pairs[:10]

[('La primavera está a la vuelta de la esquina.',
  'Spring is just around the corner.'),
 ('Ellos no son mis verdaderos padres.', "They aren't my real parents."),
 ('Él no tenía más opción que huir.', 'He had no choice but to run away.'),
 ('Por favor, esperarme.', 'Please wait for me.'),
 ('No tengo ni idea de lo que está pasando.',
  "I've no idea what's happening."),
 ('No tengo tanto dinero como crees.',
  "I don't have as much money as you think."),
 ('Él mantuvo su promesa.', 'He kept his promise.'),
 ('La Guerra de 1812 había comenzado.', 'The War of 1812 had begun.'),
 ('Alguien llamó a la puerta.', 'Somebody knocked at the door.'),
 ('Quiero más dinero.', 'I want more money.')]

### 2.3. 토크나이저 정의 및 단어사전 구축

`torchtext` 최신 버전: `get_tokenizer`, `build_vocab_from_iterator`
- 이전 버전(field 정의해서 순차적으로 전처리)과의 차이는 [github](https://github.com/hwk0702/keras2torch/blob/main/Natural_Language_Processing/Extra/TorchText_introduction_KJS.ipynb) 참고

In [None]:
#!pip install -U spacy
#!python -m spacy download en_core_web_sm
#!python -m spacy download es_core_news_sm

In [6]:
from torchtext.data.utils import get_tokenizer

src_lang = 'spanish'
trg_lang = 'english'

tokenizer = {}

tokenizer[src_lang] = get_tokenizer('spacy', language = "es_core_news_sm") #spanish #español
tokenizer[trg_lang] = get_tokenizer('spacy', language = "en_core_web_sm") # english

### spacy tokenizer 참고: https://yujuwon.tistory.com/entry/spaCy-%EC%82%AC%EC%9A%A9%ED%95%98%EA%B8%B0-tokenization

`build_vocab_from_iterator`: Build a Vocab from an iterator.


```python
torchtext.vocab.build_vocab_from_iterator(iterator: Iterable, min_freq: int = 1, specials: Optional[List[str]] = None, special_first: bool = True) 
```

- iterator – Iterator used to build Vocab. Must yield list or iterator of tokens.

- min_freq – The minimum frequency needed to include a token in the vocabulary.

- specials – Special symbols to add. The order of supplied tokens will be preserved.

- special_first – Indicates whether to insert symbols at the beginning or at the end.

In [7]:
from typing import Iterable, List

# generator: 이터레이터를 생성해주는 함수
### 이터레이터는 클래스에 __iter__, __next__ 또는 __getitem__ 메서드를 구현해야 하지만
### 제너레이터는 함수 안에서 yield라는 키워드만 사용하면 끝
### 대용량 반복을 수행해야할 때, 메모리를 더욱 효율적으로 사용하기 위한 도구
##### 참고1: https://dojang.io/mod/page/view.php?id=2412
##### 참고2: https://nirsa.tistory.com/118

def yield_tokens(data: Iterable , language: str) -> List[str]:
    
    language_index = {src_lang: 0, trg_lang: 1}
    
    for sample in data:
        yield tokenizer[language](sample[language_index[language]])
        
        # (스페인어, 영어) sentence pair
        # ex) spn_tokenizer(sample1[0]) # 첫번째 문장 pair의 스페인어 문장 토크나이징
    

In [8]:
result = yield_tokens(train_pairs[:5], 'spanish')
print(result)
print(type(result))

<generator object yield_tokens at 0x7fefe036f040>
<class 'generator'>


In [9]:
# 토큰화 결과 출력
for i in range(5):
    print(next(result))

['Esto', 'es', 'totalmente', 'inaceptable', '.']
['Pensé', 'que', 'eras', 'un', 'hombre', 'de', 'honor', '.']
['Somos', 'las', 'primeras', 'en', 'llegar', '.']
['No', 'nos', 'desanimemos', 'ahora', '.']
['Dependemos', 'de', 'usted', '.']


In [10]:
from torchtext.vocab import build_vocab_from_iterator

vocab_dict = {} #단어 집합

special_tokens = ['<unk>', '<pad>', '<bos>', '<eos>']
UNK_IDX, PAD_IDX, BOS_IDX, EOS_IDX = 0, 1, 2, 3

for language in [src_lang, trg_lang]:
    # 언어 별로 torchtext의 Vocab(단어 집합) 객체 생성
    vocab_dict[language] = build_vocab_from_iterator(yield_tokens(train_pairs, language), min_freq=2, specials=special_tokens, special_first=True)

In [11]:
# UNK_IDX를 기본 인덱스로 설정 -> 토큰을 찾지 못하는 경우에 반환
## 설정안해두면 RuntimeError 발생

for language in [src_lang, trg_lang]:
    vocab_dict[language].set_default_index(UNK_IDX) #unk_idx=0

In [12]:
vocab_dict

{'spanish': Vocab(), 'english': Vocab()}

언어별로 Vocab 인스턴스 생성 (참고: https://pytorch.org/text/0.11.0/vocab.html#torchtext.vocab.Vocab )

In [5]:
eng_vocab = vocab_dict[trg_lang].get_stoi()

sorted(eng_vocab.items(), key = lambda x: x[1])[:16]

[('<unk>', 0),
 ('<pad>', 1),
 ('<bos>', 2),
 ('<eos>', 3),
 ('.', 4),
 ('I', 5),
 ('to', 6),
 ('the', 7),
 ('Tom', 8),
 ('you', 9),
 ('a', 10),
 ('?', 11),
 ('is', 12),
 ("n't", 13),
 ("'s", 14),
 ('in', 15)]

\<unk>, \<pad> 등의 special 토큰들이 상단에 위치

In [14]:
print(f"spanish 단어 집합 크기: {len(vocab_dict[src_lang])}")
print(f"english 단어 집합 크기: {len(vocab_dict[trg_lang])}")

spanish 단어 집합 크기: 13640
english 단어 집합 크기: 8319


스페인어가 더 많은 이유: 어미, 단어 형태의 다양성(인칭, 격에 따라 다름)

### 2.4. data -> tensor 변환 (Vectorization)

In [15]:
from torch.nn.utils.rnn import pad_sequence

def txt_to_tensor(text, language):
    tokens = tokenizer[language](text)
    token_ids = vocab_dict[language](tokens)
    return torch.cat((torch.tensor([BOS_IDX]), torch.tensor(token_ids), torch.tensor([EOS_IDX])))

`collate_fn`: collate lists of samples into batches

- custom collate_fn can be used to customize collation, e.g., padding sequential data to max length of a batch

- 참고: https://pytorch.org/docs/stable/data.html#dataloader-collate-fn

In [16]:
# DatoLoader에 쓰일 collate_fn 커스터마이징

def custom_collate_fn(batch):
    src_batch, trg_batch = [],[]
    
    for src_sample, trg_sample in batch:
        src_batch.append(txt_to_tensor(src_sample, src_lang))
        trg_batch.append(txt_to_tensor(trg_sample, trg_lang))
    
    # padding
    src_batch = pad_sequence(src_batch, padding_value=PAD_IDX)
    trg_batch = pad_sequence(trg_batch, padding_value=PAD_IDX)
    
    src_batch = src_batch.transpose(0,1)
    trg_batch = trg_batch.transpose(0,1)
    
    return src_batch, trg_batch

In [17]:
from torch.utils.data import DataLoader

##collate_fn 참고: https://pytorch.org/docs/stable/data.html
train_dataloader = DataLoader(train_pairs, batch_size = 128, collate_fn = custom_collate_fn)

In [18]:
batch_ex = next(iter(train_dataloader))
print(batch_ex)
print(batch_ex[0].shape)

(tensor([[   2,  177,   15,  ...,    1,    1,    1],
        [   2,  228,    6,  ...,    1,    1,    1],
        [   2,  806,   38,  ...,    1,    1,    1],
        ...,
        [   2,   17, 1963,  ...,    1,    1,    1],
        [   2,    7,   53,  ...,    1,    1,    1],
        [   2,  116,   15,  ...,    1,    1,    1]]), tensor([[  2,  59,  12,  ...,   1,   1,   1],
        [  2,   5, 151,  ...,   1,   1,   1],
        [  2,  40,  37,  ...,   1,   1,   1],
        ...,
        [  2,  18,  66,  ...,   1,   1,   1],
        [  2,   8,  19,  ...,   1,   1,   1],
        [  2,  59,  12,  ...,   1,   1,   1]]))
torch.Size([128, 19])


(batch_size,sen_length), bos_idx=2, pad_idx=1

pytorch는 단어 시퀀스를 정수 인덱스 시퀀스로 바꾸고 원-핫 벡터로 한번 더 바꾸고나서 임베딩 층의 입력으로 사용하는 것이 아니라, 단어를 정수 인덱스로만 바꾼채로 임베딩 층의 입력으로 사용해도 룩업 테이블 된 결과인 임베딩 벡터를 리턴 -> 원핫 인코딩 안해도 됨!

In [19]:
print(batch_ex[0].shape[1])
print(batch_ex[1].shape[1])

19
20


---

## 3. Modeling

In [22]:
from IPython.display import Image

### Transformer

:Encoder-Decoder로 구성된 seq2seq model

:Architecture
    
![transformer](transformer.png)
    
- Input(source-spanish) Embedding + Positional Encoding
- Encoder block
    - **Multi-head Self-Attention**
        - MSA 이전(input), 이후 차원 동일
    - Residual Connection
    - Layer Norm
    
    - **Position-wise FFN**
        - head 별 mix up
    - Residual Connection
    - Layer Norm

- Output(target-english) Embedding + Positional Encoding

- Decoder block
    - Encoder와 동일 + 몇가지 장치 추가
    
    - **Masked Multi-Head Self-Attention**
        - 디코딩 시 미래 시점 참고 방지
    - Residual Connection
    - Layer Norm
    
    - **Encoder-Decoder Attention**
        - Query: decoder output, Key&Value: encoder output
    - Residual Connection
    - Layer Norm
    
    - **Position-wise FFN**
    - Residual Connection
    - Layer Norm
    


> 다 구현해보고자 했으나... 시간 부족으로 nn.Transformer 활용..

In [20]:
import torch
import torch.nn as nn
from torch import Tensor
from torch.nn import Transformer
import math

DEVICE = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

In [21]:
!nvidia-smi

Sun Nov 14 15:44:39 2021       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.82.00    Driver Version: 470.82.00    CUDA Version: 11.4     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  NVIDIA GeForce ...  Off  | 00000000:01:00.0  On |                  N/A |
| 31%   37C    P8    23W / 250W |    615MiB / 11016MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

In [22]:
## GPU check
print(torch.cuda.is_available())
print(torch.cuda.get_device_name(0))
print(torch.cuda.device_count())

True
NVIDIA GeForce RTX 2080 Ti
1


### 3.1. Positional Embedding (input, output)
Token Embedding + Positional Encoding

In [43]:
class PositionalEmbedding(nn.Module):
    def __init__(self, vocab_size, emb_size, device=DEVICE, maxlen=5000):
        super().__init__()
        self.token_embedding = nn.Embedding(vocab_size, emb_size) #(128, 17, 5)
        
        pos = torch.arange(0,maxlen).reshape(maxlen,1).to(device)
        #pos = pos.to(device='cuda')
        two_i = torch.arange(0,emb_size,2).to(device)
        #two_i = two_i.to(device='cuda')
        
        self.pos_encoding = torch.zeros((maxlen, emb_size))
        self.pos_encoding = self.pos_encoding.to(device)
        self.pos_encoding.requires_grad= False
        
        self.pos_encoding[:,0::2] = torch.sin(pos/ (10000**(two_i/emb_size))) #짝수 index
        self.pos_encoding[:,1::2] = torch.cos(pos/ (10000**(two_i/emb_size))) # 홀수 index
        
    def forward(self, x):
        try:
            sen_length = x.shape[1] #(batch_size, seq_len, emb_size) 3d tensor 기준
        except:
            sen_length = x.shape[0]
        return self.token_embedding(x) + self.pos_encoding[:sen_length,:]       

### 3.2. Transformer

`torch.nn.Transformer`: https://pytorch.org/docs/stable/generated/torch.nn.Transformer.html

In [44]:
class seq2seqTransformer(nn.Module):
    def __init__(self, n_head, n_enc_layers, n_dec_layers,
                emb_size, src_vocab_size, trg_vocab_size,
                d_ffn, dropout=0.1):
        super().__init__()
        
        self.src_pos_embedding = PositionalEmbedding(src_vocab_size, emb_size)
        self.trg_pos_embedding = PositionalEmbedding(trg_vocab_size, emb_size)
        
        
        self.transformer = Transformer(d_model = emb_size, nhead = n_head,
                                       num_encoder_layers = n_enc_layers,
                                       num_decoder_layers = n_dec_layers,
                                       dim_feedforward = d_ffn, dropout = dropout,
                                       custom_encoder=None, custom_decoder=None,
                                       layer_norm_eps=1e-05, batch_first=True) # batch_First = False(Default)
        
        self.translator = nn.Linear(emb_size, trg_vocab_size)
        
        
    def forward(self, src, trg, src_mask, trg_mask,
                src_padding_mask, trg_padding_mask,
                memory_key_padding_mask):

        src_emb = self.src_pos_embedding(src)
        trg_emb = self.trg_pos_embedding(trg)

        output = self.transformer(src_emb, trg_emb, src_mask, trg_mask, None,
                                  src_padding_mask, trg_padding_mask, memory_key_padding_mask)
        
        return self.translator(output)
    
    
    ## test 용도(for greedy decoder)
    def encode(self, src, src_mask):
        return self.transformer.encoder(self.src_pos_embedding(src), src_mask)
    
    def decode(self, trg, memory, trg_mask):
        return self.transformer.decoder(self.trg_pos_embedding(trg), memory, trg_mask) # memory: encoder output

### 3.3. Masking

N=batch_size, S= source length, T= target length, E=embedding dim

- src: :math:`(S, N, E)`, `(N, S, E)` if batch_first.
- tgt: :math:`(T, N, E)`, `(N, T, E)` if batch_first.
- src_mask: :math:`(S, S)`.
- tgt_mask: :math:`(T, T)`.
- memory_mask: :math:`(T, S)`.
- src_key_padding_mask: :math:`(N, S)`.
- tgt_key_padding_mask: :math:`(N, T)`.
- memory_key_padding_mask: :math:`(N, S)`.

`torch.triu`

In [26]:
(torch.triu(torch.ones((10,10), device=DEVICE)) == 1).transpose(0, 1)

tensor([[ True, False, False, False, False, False, False, False, False, False],
        [ True,  True, False, False, False, False, False, False, False, False],
        [ True,  True,  True, False, False, False, False, False, False, False],
        [ True,  True,  True,  True, False, False, False, False, False, False],
        [ True,  True,  True,  True,  True, False, False, False, False, False],
        [ True,  True,  True,  True,  True,  True, False, False, False, False],
        [ True,  True,  True,  True,  True,  True,  True, False, False, False],
        [ True,  True,  True,  True,  True,  True,  True,  True, False, False],
        [ True,  True,  True,  True,  True,  True,  True,  True,  True, False],
        [ True,  True,  True,  True,  True,  True,  True,  True,  True,  True]],
       device='cuda:0')

In [27]:
mask = (torch.triu(torch.ones((10, 10), device=DEVICE)) == 1).transpose(0, 1)
mask = mask.float().masked_fill(mask == 0, float('-inf')).masked_fill(mask == 1, float(0.0))
mask

tensor([[0., -inf, -inf, -inf, -inf, -inf, -inf, -inf, -inf, -inf],
        [0., 0., -inf, -inf, -inf, -inf, -inf, -inf, -inf, -inf],
        [0., 0., 0., -inf, -inf, -inf, -inf, -inf, -inf, -inf],
        [0., 0., 0., 0., -inf, -inf, -inf, -inf, -inf, -inf],
        [0., 0., 0., 0., 0., -inf, -inf, -inf, -inf, -inf],
        [0., 0., 0., 0., 0., 0., -inf, -inf, -inf, -inf],
        [0., 0., 0., 0., 0., 0., 0., -inf, -inf, -inf],
        [0., 0., 0., 0., 0., 0., 0., 0., -inf, -inf],
        [0., 0., 0., 0., 0., 0., 0., 0., 0., -inf],
        [0., 0., 0., 0., 0., 0., 0., 0., 0., 0.]], device='cuda:0')

위와 같이 예측 시점의 이후 시점들에 대해서는 모두 -inf로 마스킹 처리

In [28]:
batch_ex[0]

tensor([[   2,  177,   15,  ...,    1,    1,    1],
        [   2,  228,    6,  ...,    1,    1,    1],
        [   2,  806,   38,  ...,    1,    1,    1],
        ...,
        [   2,   17, 1963,  ...,    1,    1,    1],
        [   2,    7,   53,  ...,    1,    1,    1],
        [   2,  116,   15,  ...,    1,    1,    1]])

In [29]:
batch_ex[0] == PAD_IDX

tensor([[False, False, False,  ...,  True,  True,  True],
        [False, False, False,  ...,  True,  True,  True],
        [False, False, False,  ...,  True,  True,  True],
        ...,
        [False, False, False,  ...,  True,  True,  True],
        [False, False, False,  ...,  True,  True,  True],
        [False, False, False,  ...,  True,  True,  True]])

위와 같이 batch 별로 padding 위치에 마스킹

In [45]:
## 미래 시점 참고 못하도록 마스킹
def generate_square_subsequent_mask(sz):
    mask = (torch.triu(torch.ones((sz, sz), device=DEVICE)) == 1).transpose(0, 1)
    mask = mask.float().masked_fill(mask == 0, float('-inf')).masked_fill(mask == 1, float(0.0))
    return mask

## 패딩 마스킹, subsequent 마스킹
def create_mask(src, trg):
    src_seq_len = src.shape[1]
    trg_seq_len = trg.shape[1]

    trg_mask = generate_square_subsequent_mask(trg_seq_len)
    src_mask = torch.zeros((src_seq_len, src_seq_len),device=DEVICE).type(torch.bool)

    src_padding_mask = (src == PAD_IDX)
    trg_padding_mask = (trg == PAD_IDX)
    
    #src_padding_mask = (src == PAD_IDX).transpose(0, 1) # batch_first = False
    #trg_padding_mask = (trg == PAD_IDX).transpose(0, 1)
    return src_mask, trg_mask, src_padding_mask, trg_padding_mask

## 4. Training

In [46]:
torch.manual_seed(0) #랜덤 시드 고정

SRC_VOCAB_SIZE = len(vocab_dict[src_lang])
TRG_VOCAB_SIZE = len(vocab_dict[trg_lang])
EMB_SIZE = 512
NHEAD = 8
FFN_HID_DIM = 512
BATCH_SIZE = 128
NUM_ENCODER_LAYERS = 3
NUM_DECODER_LAYERS = 3


## 모델 선언
transformer = seq2seqTransformer(NHEAD, NUM_ENCODER_LAYERS, NUM_DECODER_LAYERS, EMB_SIZE,
                                 SRC_VOCAB_SIZE, TRG_VOCAB_SIZE, FFN_HID_DIM)

transformer = transformer.to(DEVICE)


## CE Loss 정의(패딩 부분은 무시하도록)
loss_fn = torch.nn.CrossEntropyLoss(ignore_index=PAD_IDX)

optimizer = torch.optim.Adam(transformer.parameters(), lr=0.0001, betas=(0.9, 0.98), eps=1e-9)

In [47]:
from torch.utils.data import DataLoader

def train_epoch(model, optimizer):
    model.train()
    losses = 0
    train_dataloader = DataLoader(train_pairs, batch_size=128, collate_fn=custom_collate_fn)

    for src, trg in train_dataloader:
        src = src.to(DEVICE)
        trg = trg.to(DEVICE)

        trg_input = trg[:, :-1] # batch_first=True
        ## trg_input = trg[:-1, :] # batch_first=False (sent_length, batch_size)

        src_mask, trg_mask, src_padding_mask, trg_padding_mask = create_mask(src, trg_input)

        
        logits = model(src, trg_input, src_mask, trg_mask, src_padding_mask, trg_padding_mask, src_padding_mask)

        optimizer.zero_grad()

        trg_out = trg[:, 1:]
        loss = loss_fn(logits.reshape(-1, logits.shape[-1]), trg_out.reshape(-1))
        loss.backward()

        optimizer.step()
        losses += loss.item()

    return losses / len(train_dataloader)

In [48]:
def evaluate(model):

    model.eval()
    losses = 0

    val_dataloader = DataLoader(val_pairs, batch_size=BATCH_SIZE, collate_fn=custom_collate_fn)

    for src, trg in val_dataloader:
        src = src.to(DEVICE)
        trg = trg.to(DEVICE)

        trg_input = trg[:, :-1]

        src_mask, trg_mask, src_padding_mask, trg_padding_mask = create_mask(src, trg_input)

        logits = model(src, trg_input, src_mask, trg_mask, src_padding_mask, trg_padding_mask, src_padding_mask)

        trg_out = trg[:, 1:]
        loss = loss_fn(logits.reshape(-1, logits.shape[-1]), trg_out.reshape(-1))
        losses += loss.item()

    return losses / len(val_dataloader)

In [161]:
from timeit import default_timer as timer
NUM_EPOCHS = 5

# training start!
for epoch in range(1, NUM_EPOCHS+1):
    start_time = timer()
    train_loss = train_epoch(transformer, optimizer)
    end_time = timer()
    val_loss = evaluate(transformer)
    print((f"Epoch: {epoch}, Train loss: {train_loss:.3f}, Val loss: {val_loss:.3f}, "f"Epoch time = {(end_time - start_time):.3f}s"))


# after traing -> save model.pth
path = "model/"
if not os.path.exists(path):
    os.mkdir(path)
torch.save(transformer, path+'transformer.pth')

Epoch: 1, Train loss: 1.471, Val loss: 1.501, Epoch time = 38.075s
Epoch: 2, Train loss: 1.326, Val loss: 1.413, Epoch time = 38.314s
Epoch: 3, Train loss: 1.206, Val loss: 1.362, Epoch time = 38.569s
Epoch: 4, Train loss: 1.100, Val loss: 1.327, Epoch time = 38.544s
Epoch: 5, Train loss: 1.008, Val loss: 1.291, Epoch time = 38.429s


## 5. Test

In [164]:
## 저장해둔 trasnforemr model 로드

model = torch.load(path+'transformer.pth')

In [171]:
# test용 greedy decoder
## <BOS>를 시작으로 target token 하나씩 들어가면서 max_prob인 토큰 반환
## <EOS> 나오면 종료

def greedy_decode(model, src, src_mask, max_len, start_symbol):
    
    src = src.to(DEVICE) #runtime error 방지
    src_mask = src_mask.to(DEVICE) #runtime error 방지

    memory = model.encode(src, src_mask) # transformer.encode
    result = torch.ones(1,1).fill_(start_symbol).type(torch.long).to(DEVICE) # 초기 result: BOS_IDX(2)로만 이루어진 (1,1) tensor
    
    for i in range(max_len-1):
        memory = memory.to(DEVICE) 
        trg_mask = (generate_square_subsequent_mask(result.size(0)).type(torch.bool)).to(DEVICE)
        result1 = result.transpose(0,1) ## batch_first=False -> x required
        out = model.decode(result1, memory, trg_mask) # transformer.decode
        ## out = out.transpose(0, 1) ## batch_first=False
        prob = model.translator(out[:, -1]) #(1,512) -> (1,vocab_size)
        _, next_word = torch.max(prob, dim=1) # (1,vocab_size) tensor 내 가장 큰 값 추출
        next_word = next_word.item() # 단어 인덱스
        
        result = torch.cat([result, torch.ones(1, 1).type_as(src.data).fill_(next_word)], dim=0) # next_word 추가
        
        if next_word == EOS_IDX:
            break
            
    return result

In [172]:
# 번역문 생성 함수

def translate(model, src_sentence: str):
    model.eval()
    src = txt_to_tensor(src_sentence, src_lang).unsqueeze(0) # 3d tensor
    ## src = txt_to_tensor(src_sentence, src_lang) ## batch_first=False
    num_tokens = src.shape[1]
    src_mask = (torch.zeros(num_tokens, num_tokens)).type(torch.bool)
    trg_tokens = greedy_decode(model, src, src_mask, max_len=num_tokens + 5, start_symbol=BOS_IDX).flatten()
    return " ".join(vocab_dict[trg_lang].lookup_tokens(list(trg_tokens.cpu().numpy()))).replace("<bos>", "").replace("<eos>", "")

### 번역 결과 출력

In [173]:
def get_result(sent_num, pairs):
    print(f"Spanish: {pairs[sent_num][0]} \n Translation to English: {translate(model, pairs[sent_num][0])} \n Original English: {pairs[sent_num][1]}")
    print("====="*10)

In [174]:
for i in [33,37,73,77,602,1123,1998,8888]:
    get_result(i, test_pairs)

Spanish: El automóvil se detuvo. 
 Translation to English:  The car stopped .  
 Original English: The automobile stopped.
Spanish: Yo lo vi, también. 
 Translation to English:  I saw it , too .  
 Original English: I saw it, too.
Spanish: Tom dejó solos a Mary y John momentáneamente. 
 Translation to English:  Tom left Mary and John kissing .  
 Original English: Tom left Mary and John alone momentarily.
Spanish: Sos mi amiga. 
 Translation to English:  You 're my friend .  
 Original English: You're my friend.
Spanish: Deshazte del arma. 
 Translation to English:  You look gun .  
 Original English: Get rid of the gun.
Spanish: Tom no es tan malo como Mary piensa que es. 
 Translation to English:  Tom is n't as bad as Mary is .  
 Original English: Tom isn't as bad as Mary thinks he is.
Spanish: Él corrió tan rápidamente como pudo. 
 Translation to English:  He ran as fast as he could .  
 Original English: He ran as fast as he could.
Spanish: ¿Se siente usted bien hoy? 
 Translation