# Sequence to sequence learning for performing number addition

[Original Keras Code](https://keras.io/examples/nlp/addition_rnn/)

<br/>

**Author**: Yookyung Kho

**Date presented**: 2022/03/21, DSBA keras2torch Study

**Task description**: multi-class classification, seq2seq, Training a model to learn to add two numbers, provided as strings.

**References**:

- [갓기현 NMT seq2seq](https://github.com/kh-kim/simple-nmt/blob/master/simple_nmt/models/seq2seq.py)
- [wikidocs seq2seq](https://wikidocs.net/24996)
- [torch seq2seq](https://codlingual.tistory.com/91)

---


**Example**:

- Input: `"61+535"`
- Output: `"596"`

**Results**:

For two digits (reversed):

- One layer LSTM (128 HN), 5k training examples = 99% train/test accuracy in 55 epochs

**Three digits (reversed):**

- One layer LSTM (128 HN), 50k training examples = 99% train/test accuracy in 100 epochs

Four digits (reversed):

- One layer LSTM (128 HN), 400k training examples = 99% train/test accuracy in 20 epochs

Five digits (reversed):

- One layer LSTM (128 HN), 550k training examples = 99% train/test accuracy in 30 epochs


<img src="seq2seq.png" width="900" height="900">

> **구현 상 원본 keras 코드를 따라가느라 애써 무시했지만 여전히 찝찝한 부분**
>
> (1) 임베딩을 아예 하지 않고 one-hot 벡터를 바로 LSTM input으로 보냄 (즉, 임베딩을 통해 lstm input을 continous하게 만들지 않고 discrete한 형태로 바로 투입)
> * _이래도 keras 구현체에서는 성능 잘 나온다고 함(torch 버전에서는 학습 잘 안됨..ㅠ)_
>
> (2) 패딩 처리가 모호함
> * _아예 안해줌. keras 코드에서도 따로 padding 옵션 걸어줘야 하는 것으로 알고 있는데, 이 구현체에서는 단순히 0번 index만 1로 채워진 원핫 벡터가 패딩 역할을 수행할 것이라고만 언급함_
> 
> (3) Decoding 시에 <EOS> token을 반환하면 stop하는 등의 후처리가 전혀 없음. output 숫자가 최대 4자릿수라 output length를 4로 고정하고 <PAD> 자체를 예측하도록 함
> * _input, output length의 범위가 고정되어 있어서(각각 7, 4) 디코딩 시에 <BOS>는 필요해도 <EOS>는 굳이 필요하지 않을 듯 하나..._

## 1. Import & Setup

In [1]:
import os
import random
import wandb
import numpy as np
from tqdm import tqdm
from timeit import default_timer as timer

import torch
import torch.nn as nn
from torch.utils.data import Dataset
from torchinfo import summary

In [2]:
# Parameters for the model and dataset.
SEED_NUM = 602
DATA_SIZE = 50000
DIGITS = 3
REVERSE = True
NUM_LAYERS = 1

NUM_EPOCHS = 100
BATCH_SIZE = 64
LEARNING_RATE = 0.001

PAD_IDX = 0

# Maximum length of input is 'int + int' (e.g., '345+678'). Maximum length of int is DIGITS.
MAXLEN = DIGITS + 1 + DIGITS

In [3]:
# seed 고정
def seed_everything(seed):
    random.seed(seed)
    np.random.seed(seed)
    os.environ["PYTHONHASHSEED"] = str(seed)
    torch.manual_seed(seed)
    torch.cuda.manual_seed(seed)  # type: ignore
    torch.backends.cudnn.deterministic = True  # type: ignore
    torch.backends.cudnn.benchmark = True  # type: ignore

seed_everything(SEED_NUM)

In [4]:
device = torch.device('cuda:0' if torch.cuda.is_available() else 'cpu')
device

device(type='cuda', index=0)

## 2. Data Preprocessing

### 2.1. Generate the data

In [5]:
def generate_data(data_size, max_num_len):
    ### max_num_len = 3
    max_ques_len = max_num_len + 1 + max_num_len #7
    questions = []
    answers = []
    seen = set()
    
    while len(questions) < data_size:

        # #(0) 숫자를 반복 샘플링해서 합치는 lambda 함수 f
        f = lambda: int(
            "".join(
                np.random.choice(list("0123456789")) # 숫자 랜덤 샘플링 (0~9)
                for i in range(np.random.randint(1, max_num_len + 1)) # i번 반복 (i: 1~3 중 랜덤)
            )
        )
        a, b = f(), f()
        ## ex) a=7, b=15

        # Skip any addition questions we've already seen :수식 중복 방지
        # Also skip any such that x+Y == Y+x (hence the sorting). :순서만 바뀐 동일 수식도 제외
        key = tuple(sorted((a, b)))

        # seen 집합에 중복 tuple 있을 경우 다음 iteration으로 넘어감 
        if key in seen:
            continue

        seen.add(key)
        ## ex) seen = {(7, 15), (35, 33), ...}
        ## ex) 만약 a=15, b=7이 주어져도 sorted를 통해 (7,15)와 같이 오름차순으로 정렬되기 때문에 중복으로 인지

        # 수식을 string 형태로 생성
        ques = f"{a}+{b}"
        # Pad the data with spaces such that it is always MAXLEN.
        query = ques + " " * (max_ques_len - len(ques)) # 수식 문자열에 공백 추가

        ans = str(a + b) # 정답 계산
        # Answers can be of maximum size max_num_len(3) + 1.
        ans += " " * (max_num_len + 1 - len(ans)) # 정답 문자열에 공백 추가
        
        questions.append(query)
        answers.append(ans)
        
    print("Total questions:", len(questions))
    
    return questions, answers

In [6]:
#(0) lambda 함수(익명함수): 결과를 return 키워드 없이 자동으로 반환 / 함수 이름 x

a = lambda x: list(x)
a("happy") # -> ['h', 'a', 'p', 'p', 'y']

b = lambda: list("happy")
b() # -> ['h', 'a', 'p', 'p', 'y']

['h', 'a', 'p', 'p', 'y']

In [7]:
questions, answers = generate_data(DATA_SIZE, 3)

print(f"questions: {questions[:5]}\nanswers: {answers[:5]}")

Total questions: 50000
questions: ['88+26  ', '9+2    ', '0+2    ', '4+1    ', '80+59  ']
answers: ['114 ', '11  ', '2   ', '5   ', '139 ']


### 2.2. Split the data

In [8]:
def split_data(question, answer):
    # Shuffle
    ques = np.array(question)
    ans = np.array(answer)
    
    indices = np.arange(ques.shape[0])
    np.random.shuffle(indices)
    
    question = ques[indices]
    answer = ans[indices]
    
    # Train(9)/Valid(1) Split
    split_at = len(question) - len(question) // 10
    (x_train, x_val) = question[:split_at], question[split_at:] # question
    (y_train, y_val) = answer[:split_at], answer[split_at:] # answer

    print(f"Training Data: {len(x_train)}")
    print(f"Validation Data: {len(x_val)}")
    
    return x_train, x_val, y_train, y_val

In [9]:
x_train, x_val, y_train, y_val = split_data(questions, answers)

print(x_train[:5])
print(y_train[:5])

Training Data: 45000
Validation Data: 5000
['127+16 ' '74+369 ' '438+673' '64+313 ' '52+869 ']
['143 ' '443 ' '1111' '377 ' '921 ']


### 2.3. Create dataset and dataloader

In [10]:
class CustomDataset(torch.utils.data.Dataset):
    """
    A PyTorch Dataset class to be used in a PyTorch DataLoader to create batches.
    """
    def __init__(self, questions, answers):
        self.questions = questions #np array 1d
        self.answers = answers #np array 1d
        
        # All the numbers, plus sign and space for padding.
        chars = "0123456789+ "
        self.chars = sorted(set(chars)) # (1) 숫자, 기호, 공백 분절 후 리스트로 저장
        self.chars.append('\t') # BOS token('\t') 추가
        
        # (2) one-hot 인코딩을 위해 character에 고유 인덱스 부여한 dictionary 생성
        self.char_indices = dict((c, i) for i, c in enumerate(self.chars))
        self.indices_char = dict((i, c) for i, c in enumerate(self.chars))
        
        self.max_len = 7
        self.max_len_output = 4
        self.num_class = len(self.char_indices) # 13
        
        
    def txt_to_tensor(self, char_txt: str, is_reverse: bool, is_idx: bool):
        """
        char string => continuous tensor(idx) or one-hot tensor 반환
        (idx tensor: loss 계산용)
        """
        
        if is_reverse:
            # question reverse: 공백이 앞으로 가게끔 수식 문자열 후처리 (Note the space used for padding.)
            ## 디코더에서 인코더의 관련 부분까지 경로를 단축하여 long term dependencies 문제를 완화 목적
            ## (Sutskever et al., 2014) 입력 값을 뒤집지 않은 경우에 비해 약간의 개선 효과
            ## ex) '12+345  ' => '  543+21'
            char_txt = char_txt[::-1] # 문자열 거꾸로
            
        char_tensor = torch.zeros(len(char_txt)).long() #long() 안해주면 0값이 float로 지정되어 아래 인덱싱에서 에러 발생
        
        for i in range(len(char_txt)):
            char_tensor[i]=self.char_indices[char_txt[i]] #index로 채워서 continuous tensor로 만들어줌
            
        char_onehot = nn.functional.one_hot(char_tensor, num_classes=self.num_class) # LongTensor(dtype=int32)
        char_onehot = char_onehot.to(torch.float32) ## torch gpu 연산 시 longtensor 처리할 수 없기 때문에 FloatTensor(dtype=float32)로 변환
        # |char_onehot| = (max_len(7),num_class(13)) or (max_len(4),num_class(13))
        
        if is_idx:
            return char_tensor, char_onehot
        else:
            return char_onehot
            

    def tensor_to_txt(self, char_tensor):
        """
        1d idx(continuous) tensor => char string 
        """
        char_array = char_tensor.cpu().detach().numpy()
        
        num_string = ""
        for i in range(len(char_array)):
            num_string += self.indices_char[char_array[i]]
            
        return num_string
                
        
    def __getitem__(self, idx: int):
        ques = self.questions[idx]
        ans = self.answers[idx]
        
        ques_onehot = self.txt_to_tensor(ques, is_reverse=True, is_idx=False)
        ans_idx, ans_onehot = self.txt_to_tensor(ans, is_reverse=False, is_idx=True)
        
        batch = {'question_onehot': ques_onehot, #2d tensor
                 'answer_idx_tensor': ans_idx, #1d tensor
                 'answer_onehot': ans_onehot} #2d tensor

        return batch


    def __len__(self):
        return len(self.questions)

    @property
    def num_classes(self):
        return self.num_class
    
    @property
    def max_ques_len(self):
        return self.max_len #7
    
    @property
    def max_ans_len(self):
        return self.max_len_output #4

In [12]:
# (1) 숫자, 기호, 공백 분절 후 리스트로 저장
chars = "0123456789+ "
chars = sorted(set(chars))
chars

[' ', '+', '0', '1', '2', '3', '4', '5', '6', '7', '8', '9']

공백이 포함되는 이유:

- 주어지는 수식이 `'  73+37'`와 같이 MAXLEN(7)에 맞게 앞부분은 공백으로 채워지기 때문
- 공백은 후에 padding 역할 수행

In [13]:
# (2) one-hot 인코딩을 위해 character에 고유 인덱스 부여한 dictionary 생성

char_indices = dict((c, i) for i, c in enumerate(chars))
indices_char = dict((i, c) for i, c in enumerate(chars))

char_indices

{' ': 0,
 '+': 1,
 '0': 2,
 '1': 3,
 '2': 4,
 '3': 5,
 '4': 6,
 '5': 7,
 '6': 8,
 '7': 9,
 '8': 10,
 '9': 11}

In [14]:
indices_char

{0: ' ',
 1: '+',
 2: '0',
 3: '1',
 4: '2',
 5: '3',
 6: '4',
 7: '5',
 8: '6',
 9: '7',
 10: '8',
 11: '9'}

In [15]:
char_indices['\t'] = 12
indices_char[12] = '\t'

In [16]:
char_indices

{' ': 0,
 '+': 1,
 '0': 2,
 '1': 3,
 '2': 4,
 '3': 5,
 '4': 6,
 '5': 7,
 '6': 8,
 '7': 9,
 '8': 10,
 '9': 11,
 '\t': 12}

In [17]:
train_dataset = CustomDataset(x_train, y_train)
valid_dataset = CustomDataset(x_val, y_val)

train_dataloader = torch.utils.data.DataLoader(train_dataset, batch_size = 64)
valid_dataloader = torch.utils.data.DataLoader(valid_dataset, batch_size = 64)

In [12]:
# batch size 2 예시
batch_ex = next(iter(train_dataloader))
batch_ex

{'question_onehot': tensor([[[1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
          [0., 0., 0., 0., 0., 0., 0., 0., 1., 0., 0., 0., 0.],
          [0., 0., 0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
          [0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
          [0., 0., 0., 0., 0., 0., 0., 0., 0., 1., 0., 0., 0.],
          [0., 0., 0., 0., 1., 0., 0., 0., 0., 0., 0., 0., 0.],
          [0., 0., 0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0.]],
 
         [[1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
          [0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 1., 0.],
          [0., 0., 0., 0., 0., 0., 0., 0., 1., 0., 0., 0., 0.],
          [0., 0., 0., 0., 0., 1., 0., 0., 0., 0., 0., 0., 0.],
          [0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
          [0., 0., 0., 0., 0., 0., 1., 0., 0., 0., 0., 0., 0.],
          [0., 0., 0., 0., 0., 0., 0., 0., 0., 1., 0., 0., 0.]]]),
 'answer_idx_tensor': tensor([[3, 6, 5, 0],
         [6, 6, 5, 0]]),
 'answer_o

In [13]:
# batch size 2 예시
print(f"question_onehot shape: {batch_ex['question_onehot'].shape}")
print(f"answer_idx_tensor: {batch_ex['answer_idx_tensor'].shape}")
print(f"answer_onehot: {batch_ex['answer_onehot'].shape}")

question_onehot shape: torch.Size([2, 7, 13])
answer_idx_tensor: torch.Size([2, 4])
answer_onehot: torch.Size([2, 4, 13])


## 3. Modeling

Lstm seq2seq model

- **Lstm Encoder**
    - hid dim=128
    - num_layers에 따라 stacking 가능(실험은 keras 구현체와 같이 1 layer로 진행)

- **Lstm Decoder**
    - hid dim=128
    - num_layers에 따라 stacking 가능(실험은 keras 구현체와 같이 1 layer로 진행)
    - training: teacher forcing
    - inference: greedy decoding


In [18]:
class Lstm_Encoder(nn.Module):
    def __init__(self, input_dim, hid_dim, num_layers):
        super().__init__()
        #self.embedding = nn.Embedding(input_dim, emb_dim) #(batch_size, seq_len, emb_dim)
        
        self.lstm = nn.LSTM(input_dim, hid_dim, num_layers) # dropout=dropout
        
    def forward(self, input_batch):
        # input_batch: (batch_size, seq_len, input_dim)
        ## input은 원핫 형태이므로 input_dim=13
        
        input_batch = torch.transpose(input_batch, 0, 1)
        # (seq_len, batch_size, input_dim) ### batch_first=False
        
        lstm_outs, (h_t, h_c) = self.lstm(input_batch)
    
        # lstm_outs: (seq_len, batch_size, hid dim * n directions)
        ## packed sequence which is the output of lstm at every step
        ## bidirectional RNN의 경우 : n_directions=2
        # h_t: (num layers * n directions, batch_size, hid dim)
        # c_t: (num layers * n directions, batch_size, hid dim)
        
        return h_t, h_c

In [19]:
class Lstm_Decoder(nn.Module):
    def __init__(self, input_dim, hid_dim, num_layers):
        super().__init__()
        ### input_dim=output_dim = 13
        self.lstm = nn.LSTM(input_dim, hid_dim, num_layers) # dropout=dropout
        self.linear = nn.Linear(hid_dim, input_dim) # output_dim = input_dim(13)
        
    def forward(self, input_one, h_t, h_c):
        """one time step"""
        # input_one: (batch_size, 1, input_dim)
        ## 원핫벡터이므로 input_dim=13
        
        input_one = torch.transpose(input_one, 0, 1) # (1, batch_size, input_dim(13)) ### batch_first=False
        dec_lstm_outs, (h_t, h_c) = self.lstm(input_one, (h_t, h_c))
        # dec_lstm_outs: (1, batch_size, hid dim * n directions)
        # h_t: (num layers * n directions, batch_size, hid dim)
        
        output = dec_lstm_outs.squeeze(0) # (batch size, hid dim * n directions)
        pred = self.linear(output) # (batch size, output dim) # output_dim = input_dim(13)
        
        return pred, h_t, h_c

In [27]:
ex = torch.randn(2,1,3).to(device)
print(ex)
torch.argmax(ex, dim=-1)

tensor([[[-0.7554, -1.8747, -1.5196]],

        [[ 1.7377,  1.0045,  0.0661]]], device='cuda:0')


tensor([[0],
        [0]], device='cuda:0')

In [28]:
nn.functional.one_hot(torch.argmax(ex, dim=-1), num_classes=dec_input_dim)

tensor([[[1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]],

        [[1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]]], device='cuda:0')

In [23]:
dec_input_dim = 12

In [29]:
### teacher forcing 제거 다시 실험
class Lstm_seq2seq(nn.Module):
    def __init__(self, enc_input_dim, dec_input_dim, hid_dim, num_layers, device):
        ### encoder decoder hid_dim, num_layers 같아야함!!!!
        super().__init__()
        self.encoder = Lstm_Encoder(enc_input_dim, hid_dim, num_layers)
        self.decoder = Lstm_Decoder(dec_input_dim, hid_dim, num_layers)
        self.device = device
        
    def forward(self, enc_input_batch, dec_input_batch):
        ### |enc_input_batch| = (batch_size, enc_seq_len(7), input_dim(13))
        ### |dec_input_batch| = (batch_size, seq_len(4), input_dim(13))
        
        # 1. encode
        enc_h_t, enc_h_c = self.encoder(enc_input_batch)
        
        # 2. decode(teacher forcing)
        batch_size, seq_len, dec_input_dim = dec_input_batch.shape
        
        input1 = torch.zeros(batch_size, 1, dec_input_dim).to(self.device)
        input1[:,:,12] = 1 #bos 직접 생성
        
        preds = []
        for t in range(seq_len): ###
            #print(t, input1) ##### error!!!
            pred, dec_h_t, dec_h_c = self.decoder(input1, enc_h_t, enc_h_c)
            # |pred| = (batch size, output dim)
            
            ## *방법1: pred tensor concat
            pred = pred.unsqueeze(1) # (batch_size, 1, output dim) # seq용 차원을 중간에 하나 추가해야 뒤에서 concat할 수 있음
            preds += [pred] #append와 같음
            
            input1 = nn.functional.one_hot(torch.argmax(pred, dim=-1), num_classes=dec_input_dim)
            input1 = input1.to(torch.float32)
            
            ## *방법2: zero tensor 채워넣기
            #outputs = torch.zeros(trg_len, batch_size, trg_vocab_size).to(self.device)
            #outputs[t] = pred
            
            # teacher forcing
            #input1 = dec_input_batch[:, t, :].unsqueeze(1)
            # |input1| = (batch_size, 1, dec out dim)
            
        outputs = torch.cat(preds, dim=1)
        # |outputs| = (batch_size, length(4), output dim)
        
        return outputs
    
    # inference 용도(valid, test)
    def encode(self, enc_input_batch):
        ### |enc_input_batch| = (batch_size(1), enc_seq_len(7), input_dim(13))
        enc_h_t, enc_h_c = self.encoder(enc_input_batch)
        return enc_h_t, enc_h_c
    
    def decode(self, dec_input, enc_h_t, enc_h_c):
        # |dec_input| = (1, current seq len, input_dim(13))!!! or (batch size, current seq len, input_dim(13))
        #맨 마지막 time step 벡터 가져오기
        real_input = dec_input[:,-1,:].unsqueeze(1) # |real_input| = (batch_size(1),1,input_dim(13))
        pred, _, _ = self.decoder(real_input, enc_h_t, enc_h_c) #dec_h_t, dec_h_c는 필요 없으므로 날려~
        # |pred| = (batch size, output dim)
        return pred


In [30]:
model = Lstm_seq2seq(enc_input_dim=13, dec_input_dim=13, hid_dim=128, num_layers=NUM_LAYERS, device=device).to(device)

#summary(model, ((1,7,13), (1,4,13)))
summary(model)

Layer (type:depth-idx)                   Param #
Lstm_seq2seq                             --
├─Lstm_Encoder: 1-1                      --
│    └─LSTM: 2-1                         73,216
├─Lstm_Decoder: 1-2                      --
│    └─LSTM: 2-2                         73,216
│    └─Linear: 2-3                       1,677
Total params: 148,109
Trainable params: 148,109
Non-trainable params: 0

## 4. Training

In [31]:
# CE Loss
criterion = torch.nn.CrossEntropyLoss()
#criterion = torch.nn.CrossEntropyLoss(ignore_index=PAD_IDX)

# Optimizer : Adam
optimizer = torch.optim.Adam(model.parameters(), lr=LEARNING_RATE)

In [36]:
def train_epoch(model, dataloader, optimizer, criterion, device):
    model.train()
    losses = 0
    
    for batch_idx, batch in enumerate(dataloader):
        ques_onehot = batch['question_onehot'].to(device)
        ans_idx = batch['answer_idx_tensor'].to(device) # |ans_idx| = (batch_size, length(4))
        ans_onehot = batch['answer_onehot'].to(device)
        
        outputs = model(ques_onehot, ans_onehot) # |outputs| = (batch_size, length(4), output dim)
        
        # Calculate loss, Update parameters
        optimizer.zero_grad()
        
        # loss 계산을 위해
        ## outputs(3d tensor)를 2d로 변환
        outputs_2d = outputs.contiguous().view(-1, outputs.size(-1)) # |outputs_2d| = (batch_size*length(4), output dim)
        ## ans_idx(정답 2d tensor)를 1d로 변환
        ans_idx_1d = ans_idx.contiguous().view(-1) # |ans_idx_1d| = (batch_size*length(5))
        
        loss = criterion(outputs_2d, ans_idx_1d)
        
        loss.backward()
        optimizer.step()
        
        losses += loss.item()
        
    return losses / len(dataloader)

In [37]:
def evaluate_epoch(model, dataloader, criterion, device):
    model.eval()
    losses = 0
    
    for batch_idx, batch in enumerate(dataloader):
        ques_onehot = batch['question_onehot'].to(device)
        ans_idx = batch['answer_idx_tensor'].to(device) # |ans_idx| = (batch_size, length(4))
        ans_onehot = batch['answer_onehot'].to(device)
        
        outputs = model(ques_onehot, ans_onehot) # |outputs| = (batch_size, length(4), output dim)
        
        # loss 계산을 위해
        ## outputs(3d tensor)를 2d로 변환
        outputs_2d = outputs.contiguous().view(-1, outputs.size(-1)) # |outputs_2d| = (batch_size*length(4), output dim)
        ## ans_idx(정답 2d tensor)를 1d로 변환
        ans_idx_1d = ans_idx.contiguous().view(-1) # |ans_idx_1d| = (batch_size*length(4))
        
        loss = criterion(outputs_2d, ans_idx_1d)
        
        losses += loss.item()
    
    return losses / len(dataloader)

In [38]:
### inference 용 greedy decode 함수
def greedy_decode(ques_txt, dataset, model, device):
    ques_onehot = dataset.txt_to_tensor(ques_txt, is_reverse=True, is_idx=False)
    
    input_ques = ques_onehot.unsqueeze(0).to(device)
    
    enc_h_t, enc_h_c = model.encode(input_ques)
    
    dec_input_dim = dataset.num_class
    input1 = torch.zeros(1, 1, dec_input_dim).to(device) # dataset.num_class = 13
    input1[:,:,dec_input_dim-1] = 1 #bos 직접 생성 #12
    
    for t in range(dataset.max_ans_len):
        pred = model.decode(input1, enc_h_t, enc_h_c)
        pred = pred.unsqueeze(0) # (batch_size(1), 1, output dim)
        input1 = torch.cat((input1,pred), dim=1) # |input1| = (batch_size(1), +1, output dim) #반복마다 seq 추가
    
    result = input1[:,1:,:].squeeze(0) # (seq len, output dim) = (4,13)
    result_idx_tensor = torch.argmax(result, dim=-1) # (seq len,)
    
    pred_ans = dataset.tensor_to_txt(result_idx_tensor) # string
    
    return pred_ans

In [39]:
### teacher forcing x

# Wandb Settings for logging
# wandb.init(project="Seq2seq Number Addition", name=f"lstm {NUM_LAYERS} layer", entity="yookyungkho")
        
# wandb.config.update({"epochs": NUM_EPOCHS,
#                      "batch_size": BATCH_SIZE,
#                      "learning_rate" : LEARNING_RATE})
        
# wandb.watch(model, log="all")

# training start!
for epoch in range(1, NUM_EPOCHS+1):
    
    start_time = timer()
    train_loss = train_epoch(model, train_dataloader, optimizer, criterion, device)
    end_time = timer()
    val_loss = evaluate_epoch(model, valid_dataloader, criterion, device)
    
    # Add log in wandb page
    # wandb.log({'train_loss': train_loss, 'val_loss': val_loss})
    
    if epoch % 5 == 0:
        print((f"Epoch: {epoch}, Train loss: {train_loss:.3f}, Val loss: {val_loss:.3f}, "f"Epoch time = {(end_time - start_time):.3f}s"))
    
        for i in range(7):
            idx = np.random.randint(0, len(x_val))
            x_txt, y_txt = x_val[idx], y_val[idx]
            y_pred = greedy_decode(x_txt, valid_dataset, model, device)
            if y_txt == y_pred:
                print(f">>> Q: {x_txt} => T: {y_txt}  ☑: {y_pred}")
            else:
                print(f">>> Q: {x_txt} => T: {y_txt}  ☒: {y_pred}")

Epoch: 5, Train loss: 1.304, Val loss: 1.264, Epoch time = 7.333s
>>> Q: 96+463  => T: 559   ☒: 5   
>>> Q: 1+96    => T: 97    ☒: 96  
>>> Q: 558+3   => T: 561   ☒: 5   
>>> Q: 939+388 => T: 1327  ☒: 16  
>>> Q: 43+210  => T: 253   ☒: 2   
>>> Q: 138+540 => T: 678   ☒: 6   
>>> Q: 48+988  => T: 1036  ☒: 16  
Epoch: 10, Train loss: 0.748, Val loss: 0.719, Epoch time = 6.479s
>>> Q: 858+174 => T: 1032  ☒: 1421
>>> Q: 65+842  => T: 907   ☒: 96  
>>> Q: 328+5   => T: 333   ☒: 3   
>>> Q: 601+991 => T: 1592  ☒: 1602
>>> Q: 566+22  => T: 588   ☒: 5   
>>> Q: 382+787 => T: 1169  ☒: 17 1
>>> Q: 561+775 => T: 1336  ☒: 17  
Epoch: 15, Train loss: 0.411, Val loss: 0.439, Epoch time = 7.199s
>>> Q: 10+720  => T: 730   ☒: 76  
>>> Q: 12+325  => T: 337   ☒: 3   
>>> Q: 788+902 => T: 1690  ☒: 1611
>>> Q: 119+422 => T: 541   ☒: 5511
>>> Q: 541+38  => T: 579   ☒: 5   
>>> Q: 606+58  => T: 664   ☒: 60  
>>> Q: 64+467  => T: 531   ☒: 5   
Epoch: 20, Train loss: 0.289, Val loss: 0.320, Epoch time = 6.289

In [25]:
### teacher forcing o

# Wandb Settings for logging
# wandb.init(project="Seq2seq Number Addition", name=f"lstm {NUM_LAYERS} layer", entity="yookyungkho")
        
# wandb.config.update({"epochs": NUM_EPOCHS,
#                      "batch_size": BATCH_SIZE,
#                      "learning_rate" : LEARNING_RATE})
        
# wandb.watch(model, log="all")

# training start!
for epoch in range(1, NUM_EPOCHS+1):
    
    start_time = timer()
    train_loss = train_epoch(model, train_dataloader, optimizer, criterion, device)
    end_time = timer()
    val_loss = evaluate_epoch(model, valid_dataloader, criterion, device)
    
    # Add log in wandb page
    wandb.log({'train_loss': train_loss, 'val_loss': val_loss})
    
    if epoch % 5 == 0:
        print((f"Epoch: {epoch}, Train loss: {train_loss:.3f}, Val loss: {val_loss:.3f}, "f"Epoch time = {(end_time - start_time):.3f}s"))
    
        for i in range(7):
            idx = np.random.randint(0, len(x_val))
            x_txt, y_txt = x_val[idx], y_val[idx]
            y_pred = greedy_decode(x_txt, valid_dataset, model, device)
            if y_txt == y_pred:
                print(f">>> Q: {x_txt} => T: {y_txt}  ☑: {y_pred}")
            else:
                print(f">>> Q: {x_txt} => T: {y_txt}  ☒: {y_pred}")

Failed to detect the name of this notebook, you can set it manually with the WANDB_NOTEBOOK_NAME environment variable to enable code saving.
[34m[1mwandb[0m: Currently logged in as: [33myookyungkho[0m (use `wandb login --relogin` to force relogin)
[34m[1mwandb[0m: wandb version 0.12.11 is available!  To upgrade, please run:
[34m[1mwandb[0m:  $ pip install wandb --upgrade


Epoch: 5, Train loss: 1.445, Val loss: 1.389, Epoch time = 6.388s
>>> Q: 96+463  => T: 559   ☒: 5   
>>> Q: 1+96    => T: 97    ☒: 9   
>>> Q: 558+3   => T: 561   ☒: 5   
>>> Q: 939+388 => T: 1327  ☒: 13  
>>> Q: 43+210  => T: 253   ☒: 2   
>>> Q: 138+540 => T: 678   ☒: 63  
>>> Q: 48+988  => T: 1036  ☒: 10  
Epoch: 10, Train loss: 0.731, Val loss: 0.704, Epoch time = 6.758s
>>> Q: 858+174 => T: 1032  ☒: 1441
>>> Q: 65+842  => T: 907   ☒: 938 
>>> Q: 328+5   => T: 333   ☒: 30 0
>>> Q: 601+991 => T: 1592  ☒: 1382
>>> Q: 566+22  => T: 588   ☒: 54  
>>> Q: 382+787 => T: 1169  ☒: 1482
>>> Q: 561+775 => T: 1336  ☒: 1441
Epoch: 15, Train loss: 0.382, Val loss: 0.395, Epoch time = 6.219s
>>> Q: 10+720  => T: 730   ☒: 740 
>>> Q: 12+325  => T: 337   ☒: 3940
>>> Q: 788+902 => T: 1690  ☒: 1380
>>> Q: 119+422 => T: 541   ☒: 5040
>>> Q: 541+38  => T: 579   ☒: 5404
>>> Q: 606+58  => T: 664   ☒: 64  
>>> Q: 64+467  => T: 531   ☒: 5438
Epoch: 20, Train loss: 0.245, Val loss: 0.278, Epoch time = 6.443

- keras 결과와는 다르게 대부분 4자리 숫자로 정답 반환, loss 잘 안떨어짐
- EOS token 추가했을 때 결과 더 심각
