# Lab 12-5 Seq to Seq
> Simple Neural Machine Translation

## Seq2Seq
- Sequence의 입력을 받고, Sequence의 출력을 받음
    - ex) Translation, Chatbot
    - RNN과의 차이점을 하단에 상술된 내용으로 이해해보자.
- Seq2Seq가 필요한 이유
    - ex) 다음과 같은 대화가 있다고 하자
    ```
    A: 나 오늘 헤어졌어
    B: 에구... 어쩌냐...
    ---- 5 minuates later ----
    A: 오늘 날씨 좋네.. 날씨가 좋으니 더 외롭다..
    B: []
    ```
    이 때 RNN 모델을 쓰면 과연 B는 적절한 위로를 해줄 수 있을까?
    
### Seq2Seq의 특징
#### Encoder-Decoder
- RNN모델을 기반으로 Encoder와 Decoder 부분으로 구분 
- Encoder를 이용해 vector를 만들어 내고, Decoder를 활용하여 체계적으로 출력값 도출
##### Encoder
- 다수의 Vector(하나의 word)가 RNN 모델의 각 step에 들어감
    - RNN의 마지막 hidden Vector를 이용하여 Decoding
    - 이 Vector는 인코더의 정보를 요약하고 있음
##### Decoder
- 위에서 얻은 vector를 사용해 새롭게 RNN 학습 시작
- 각 step마다 하나의 출력값이 나오고, 이는 하나의 단어로 이뤄져 있음

## Reference
- https://arxiv.org/abs/1409.3215
  Seqence to Sequence Learning with Neural Networks
---

### import

In [1]:
from __future__ import absolute_import, division, print_function
import tensorflow as tf
from matplotlib import font_manager, rc
# rc('font', family='AppleGothic') #for mac
import matplotlib.pyplot as plt

from sklearn.model_selection import train_test_split
from tensorflow import keras
from tensorflow.keras.preprocessing.sequence import pad_sequences

from pprint import pprint
import numpy as np
import os

print(tf.__version__)

2.3.0


### Data Pipeline : Dataset
- Translate 예제

In [2]:
sources = [['I', 'feel', 'hungry'],
     ['tensorflow', 'is', 'very', 'difficult'],
     ['tensorflow', 'is', 'a', 'framework', 'for', 'deep', 'learning'],
     ['tensorflow', 'is', 'very', 'fast', 'changing']]
targets = [['나는', '배가', '고프다'],
           ['텐서플로우는', '매우', '어렵다'],
           ['텐서플로우는', '딥러닝을', '위한', '프레임워크이다'],
           ['텐서플로우는', '매우', '빠르게', '변화한다']]

### Data Pipeline : Vocab Dict

In [3]:
# vocabulary for sources
s_vocab = list(set(sum(sources, [])))
s_vocab.sort()
s_vocab = ['<pad>'] + s_vocab
source2idx = {word : idx for idx, word in enumerate(s_vocab)}
idx2source = {idx : word for idx, word in enumerate(s_vocab)}

pprint(source2idx)

{'<pad>': 0,
 'I': 1,
 'a': 2,
 'changing': 3,
 'deep': 4,
 'difficult': 5,
 'fast': 6,
 'feel': 7,
 'for': 8,
 'framework': 9,
 'hungry': 10,
 'is': 11,
 'learning': 12,
 'tensorflow': 13,
 'very': 14}


In [4]:
# vocabulary for targets
t_vocab = list(set(sum(targets, [])))
t_vocab.sort()
# Target Token에서는 시작하는 부분에 bos, 끝나는 부분에 eos 토큰 필요
t_vocab = ['<pad>', '<bos>', '<eos>'] + t_vocab
target2idx = {word : idx for idx, word in enumerate(t_vocab)}
idx2target = {idx : word for idx, word in enumerate(t_vocab)}

pprint(target2idx)

{'<bos>': 1,
 '<eos>': 2,
 '<pad>': 0,
 '고프다': 3,
 '나는': 4,
 '딥러닝을': 5,
 '매우': 6,
 '배가': 7,
 '변화한다': 8,
 '빠르게': 9,
 '어렵다': 10,
 '위한': 11,
 '텐서플로우는': 12,
 '프레임워크이다': 13}


### Data Pipeline: Preprocess

In [5]:
def preprocess(sequences, max_len, dic, mode = 'source'):
    assert mode in ['source', 'target'], 'source와 target 중에 선택해주세요.'
    
    if mode == 'source':
        # preprocessing for source (encoder)
        s_input = list(map(lambda sentence : [dic.get(token) for token in sentence], sequences))
        s_len = list(map(lambda sentence : len(sentence), s_input))
        s_input = pad_sequences(sequences = s_input, maxlen = max_len, padding = 'post', truncating = 'post')
        return s_len, s_input
    
    elif mode == 'target':
        # preprocessing for target (decoder)
        # input
        t_input = list(map(lambda sentence : ['<bos>'] + sentence + ['<eos>'], sequences))
        t_input = list(map(lambda sentence : [dic.get(token) for token in sentence], t_input))
        t_len = list(map(lambda sentence : len(sentence), t_input))
        t_input = pad_sequences(sequences = t_input, maxlen = max_len, padding = 'post', truncating = 'post')
        
        # output
        t_output = list(map(lambda sentence : sentence + ['<eos>'], sequences))
        t_output = list(map(lambda sentence : [dic.get(token) for token in sentence], t_output))
        t_output = pad_sequences(sequences = t_output, maxlen = max_len, padding = 'post', truncating = 'post')
        
        return t_len, t_input, t_output

In [6]:
# preprocessing for source
s_max_len = 10
s_len, s_input = preprocess(sequences = sources,
                            max_len = s_max_len, dic = source2idx, mode = 'source')
print(s_len, s_input)

[3, 4, 7, 5] [[ 1  7 10  0  0  0  0  0  0  0]
 [13 11 14  5  0  0  0  0  0  0]
 [13 11  2  9  8  4 12  0  0  0]
 [13 11 14  6  3  0  0  0  0  0]]


In [7]:
# preprocessing for target
t_max_len = 12
t_len, t_input, t_output = preprocess(sequences = targets,
                                      max_len = t_max_len, dic = target2idx, mode = 'target')
print(t_len, t_input, t_output)

[5, 5, 6, 6] [[ 1  4  7  3  2  0  0  0  0  0  0  0]
 [ 1 12  6 10  2  0  0  0  0  0  0  0]
 [ 1 12  5 11 13  2  0  0  0  0  0  0]
 [ 1 12  6  9  8  2  0  0  0  0  0  0]] [[ 4  7  3  2  0  0  0  0  0  0  0  0]
 [12  6 10  2  0  0  0  0  0  0  0  0]
 [12  5 11 13  2  0  0  0  0  0  0  0]
 [12  6  9  8  2  0  0  0  0  0  0  0]]


### Hyper-parameter

In [8]:
# hyper-parameters
epochs = 200
batch_size = 4
learning_rate = .005
total_step = epochs / batch_size
buffer_size = 100
n_batch = buffer_size//batch_size
embedding_dim = 32
units = 32

# input
data = tf.data.Dataset.from_tensor_slices((s_len, s_input, t_len, t_input, t_output))
data = data.shuffle(buffer_size = buffer_size)
data = data.batch(batch_size = batch_size)
# s_mb_len, s_mb_input, t_mb_len, t_mb_input, t_mb_output = iterator.get_next()

### Gru Function
- Encoder와 Decoder를 구성하기 위한 RNN 기반의 함수
- `glorot_uniform` : Xavior initialization과 같은 의미, 초기화 시 랜덤으로 했을 때의 weight가 saturate되는 걸 방지하거나 dead region에서 시작되는 걸 방지
    - 쉽게 말해 랜덤값이 너무 작거나 너무 커지는 것을 방지

In [9]:
def gru(units):
#     # If you have a GPU, we recommend using CuDNNGRU(provides a 3x speedup than GRU)
#     # the code automatically does that
#     if tf.test.is_gpu_available():
#         return tf.keras.layers.CuDNNGRu(units,
#                                        return_sequences=True,
#                                        return_state=True,
#                                        recurrenct_initializer='glorot_uniform')
#     else:
    return tf.keras.layers.GRU(units, 
                               return_sequences=True, 
                               return_state=True, 
                               recurrent_initializer='glorot_uniform')

### Encoder & Decoder

In [10]:
# Encoder Class
class Encoder(tf.keras.Model):
    def __init__(self, vocab_size, embedding_dim, enc_units, batch_sz):
        super(Encoder, self).__init__()
        self.batch_sz = batch_sz
        self.enc_units = enc_units
        self.embedding = tf.keras.layers.Embedding(vocab_size, embedding_dim)
        self.gru = gru(self.enc_units)
        
    def call(self, x, hidden):
        x = self.embedding(x)
        output, state = self.gru(x, initial_state = hidden)
        
        return output, state
    
    def initialize_hidden_state(self):
        return tf.zeros((self.batch_sz, self.enc_units))

In [11]:
# Decoder Class
class Decoder(tf.keras.Model):
    def __init__(self, vocab_size, embedding_dim, dec_units, batch_sz):
        super(Decoder, self).__init__()
        self.batch_sz = batch_sz
        self.dec_units = dec_units
        self.embedding = tf.keras.layers.Embedding(vocab_size, embedding_dim)
        self.gru = gru(self.dec_units)
        self.fc = tf.keras.layers.Dense(vocab_size)
                
    def call(self, x, hidden, enc_output):
        
        x = self.embedding(x)
        output, state = self.gru(x, initial_state = hidden)
        
        # output shape == (batch_size * 1, hidden_size)
        output = tf.reshape(output, (-1, output.shape[2]))
        
        # output shape == (batch_size * 1, vocab)
        x = self.fc(output)
        
        return x, state
        
    def initialize_hidden_state(self):
        return tf.zeros((self.batch_sz, self.dec_units))

In [12]:
encoder = Encoder(len(source2idx), embedding_dim, units, batch_size)
decoder = Decoder(len(target2idx), embedding_dim, units, batch_size)

### Loss & Optimizer Function

In [13]:
def loss_function(real, pred):
    mask = 1 - np.equal(real, 0)
    loss_ = tf.nn.sparse_softmax_cross_entropy_with_logits(labels=real, logits=pred) * mask
    
#     print("real: {}".format(real))
#     print("pred: {}".format(pred))
#     print("mask: {}".format(mask))
#     print("loss: {}".format(tf.reduce_mean(loss_)))
    
    return tf.reduce_mean(loss_)

# creating optimizer
optimizer = tf.keras.optimizers.Adam()

# creating check point (Object-based saving)
checkpoint_dir = './checkpoints/training_checkpoints'
checkpoint_prefix = os.path.join(checkpoint_dir, 'ckpt')
checkpoint = tf.train.Checkpoint(optimizer=optimizer,
                                encoder=encoder,
                                decoder=decoder)

# create writer for tensorboard
#summary_writer = tf.summary.create_file_writer(logdir=checkpoint_dir)

### Training
#### Teacher Forcing
- *'I feel hungry'*를 학습하는 예시
##### W/O Teacher Forcing
| X        | Y predict |
| -------- | --------- |
| [bos]    | a         |
| [bos], a | ?         |
- [bos]가 첫 입력값으로 주어질 때 Y predict는 I가 되어야 하는데,
   a가 나오므로 이는 애초에 잘못된 학습이라고 할 수 있다.
##### Teacher Forcing
| X              | Y predict     |
| -------------- | ------------- |
| [bos]          | I             |
| [bos], I       | I feel        |
| [bos], I, feel | I feel hungry |

In [14]:
for epoch in range(epochs):
    
    hidden = encoder.initialize_hidden_state()
    total_loss = 0
    
    for i, (s_len, s_input, t_len, t_input, t_output) in enumerate(data):
        loss = 0
        with tf.GradientTape() as tape:
            enc_output, enc_hidden = encoder(s_input, hidden)

            dec_hidden = enc_hidden
            
            dec_input = tf.expand_dims([target2idx['<bos>']] * batch_size, 1)
            
            #Teacher Forcing: feeding the target as the next input
            for t in range(1, t_input.shape[1]):
                
                predictions, dec_hidden = decoder(dec_input, dec_hidden, enc_output)
                
                loss += loss_function(t_input[:, t], predictions)
            
                dec_input = tf.expand_dims(t_input[:, t], 1) #using teacher forcing
                
        batch_loss = (loss / int(t_input.shape[1]))
        
        total_loss += batch_loss
        
        variables = encoder.variables + decoder.variables
        
        gradient = tape.gradient(loss, variables)
        
        optimizer.apply_gradients(zip(gradient, variables))
        
    if epoch % 10 == 0:
        #save model every 10 epoch
        print('Epoch {} Loss {:.4f} Batch Loss {:.4f}'.format(epoch,
                                            total_loss / n_batch,
                                            batch_loss.numpy()))
        checkpoint.save(file_prefix = checkpoint_prefix)

Epoch 0 Loss 0.0396 Batch Loss 0.9896
Epoch 10 Loss 0.0386 Batch Loss 0.9662
Epoch 20 Loss 0.0373 Batch Loss 0.9316
Epoch 30 Loss 0.0347 Batch Loss 0.8685
Epoch 40 Loss 0.0316 Batch Loss 0.7894
Epoch 50 Loss 0.0283 Batch Loss 0.7078
Epoch 60 Loss 0.0251 Batch Loss 0.6273
Epoch 70 Loss 0.0218 Batch Loss 0.5446
Epoch 80 Loss 0.0185 Batch Loss 0.4637
Epoch 90 Loss 0.0155 Batch Loss 0.3865
Epoch 100 Loss 0.0126 Batch Loss 0.3157
Epoch 110 Loss 0.0102 Batch Loss 0.2545
Epoch 120 Loss 0.0082 Batch Loss 0.2053
Epoch 130 Loss 0.0067 Batch Loss 0.1686
Epoch 140 Loss 0.0057 Batch Loss 0.1425
Epoch 150 Loss 0.0050 Batch Loss 0.1244
Epoch 160 Loss 0.0045 Batch Loss 0.1119
Epoch 170 Loss 0.0041 Batch Loss 0.1032
Epoch 180 Loss 0.0039 Batch Loss 0.0969
Epoch 190 Loss 0.0037 Batch Loss 0.0923


In [15]:

checkpoint.restore(tf.train.latest_checkpoint(checkpoint_dir))

<tensorflow.python.training.tracking.util.CheckpointLoadStatus at 0x25f75ca1a90>

### Testing

In [16]:
sentence = 'I feel hungry'

In [17]:
def prediction(sentence, encoder, decoder, inp_lang, targ_lang, max_length_inp, max_length_targ):
    
    inputs = [inp_lang[i] for i in sentence.split(' ')]
    inputs = tf.keras.preprocessing.sequence.pad_sequences([inputs], maxlen=max_length_inp, padding='post')
    inputs = tf.convert_to_tensor(inputs)
        
    result = ''
    
    hidden = [tf.zeros((1, units))]
    enc_out, enc_hidden = encoder(inputs, hidden)
        
    dec_hidden = enc_hidden
    dec_input = tf.expand_dims([targ_lang['<bos>']], 0)
    
    for t in range(max_length_targ):
        predictions, dec_hidden = decoder(dec_input, dec_hidden, enc_out)
        
        predicted_id = tf.argmax(predictions[0]).numpy()

        result += idx2target[predicted_id] + ' '

        if idx2target.get(predicted_id) == '<eos>':
            return result, sentence
        
        # the predicted ID is fed back into the model
        dec_input = tf.expand_dims([predicted_id], 0)    
    
    return result, sentence
    
result, output_sentence = prediction(sentence, encoder, decoder, source2idx, target2idx, s_max_len, t_max_len)

print(sentence)
print(result)

I feel hungry
나는 배가 고프다 <eos> 
