## 모두를 위한 딥러닝 #6

지난 시간에 RNN의 개념에 대해 살펴봤고 이제 어떤 식으로 응용할 수 있는지 간단하게 살펴보도록 하자.

### 학습 순서
- Seq2Seq2 모델 살펴보기 
    - Encoder: input을 받는 곳
    - Decoder: input에서의 마지막 hidden state를 받아서 이를 다시 RNN을 돌려서 sequence를 만드는 곳 
    - Attention: input의 각각의 hidden state의 값들을 가중치를 매겨서 취합한 뒤에 Decoder의 각 hidden state에 반영해서 최종 output을 만든다. 
- 학습 목표: toy code로 encoder로 문자열을 집어넣은 뒤 decoder로 문자열이 거꾸로 나오게 만들어주는 모델을 학습시켜본다.

![seq2seq](https://www.tensorflow.org/versions/master/images/seq2seq/attention_mechanism.jpg)

### 0. 데이터 전처리

0. `load_data`: 텍스트를 불러온다. 
1. `extract_character_vocab`: 문자로 된 사전을 만든다.
2. `pad_sentence_batch`: batch로 데이터를 불러왔을 때 정해진 길이에 미달하면 padding을 해준다. 
3. `get_batches`: 데이터로부터 batch를 가져온다. 

In [1]:
import numpy as np
import tensorflow as tf
import os
import sys

# 문자열의 길이가 다 제각각이므로 일정한 길이에 맞춰서 0으로 padding을 해준다. 
PAD = 0
# 사전에 없는 문자열은 unknown(UNK)로 받도록 하기 위해 1로 지정해준다.
UNK = 1
# 학습을 위해서 시작 문자를 GO라고 지정해준다. (모델은 GO를 받으면 항상 시작하도록 학습함. decoder 처음에 입력 값으로 주게 된다.)
GO = 2
# 학습을 위해서 끝 문자를 EOS라고 지정해준다. (모델이 EOS를 받으면 문자열 생성을 중단하도록 학습함.)
EOS = 3
start_token = GO
end_token = EOS


def load_data(path):
    input_file = os.path.join(path)
    with open(input_file, "r", encoding='utf-8', errors='ignore') as f:
        data = f.read()
    return data


def extract_character_vocab(data):
    special_words = ['<PAD>', '<UNK>', '<GO>', '<EOS>']
    set_words = set([character for line in data.split('\n') for character in line])
    int_to_vocab = {word_i: word for word_i, word in enumerate(special_words + list(set_words))}
    vocab_to_int = {word: word_i for word_i, word in int_to_vocab.items()}
    return int_to_vocab, vocab_to_int


def pad_sentence_batch(sentence_batch, pad_int):
    """Pad sentences with <PAD> so that each sentence of a batch has the same length"""
    max_sentence = max([len(sentence) for sentence in sentence_batch])
    return [sentence + [pad_int] * (max_sentence - len(sentence)) for sentence in sentence_batch]


def get_batches(targets, sources, batch_size, source_pad_int, target_pad_int):
    """Batch targets, sources, and the lengths of their sentences together"""
    for batch_i in range(0, len(sources) // batch_size):
        start_i = batch_i * batch_size
        sources_batch = sources[start_i:start_i + batch_size]
        targets_batch = targets[start_i:start_i + batch_size]
        pad_sources_batch = np.array(pad_sentence_batch(sources_batch, source_pad_int))
        pad_targets_batch = np.array(pad_sentence_batch(targets_batch, target_pad_int))

        # Need the lengths for the _lengths parameters
        pad_targets_lengths = []
        for target in pad_targets_batch:
            pad_targets_lengths.append(len(target))

        pad_source_lengths = []
        for source in pad_sources_batch:
            pad_source_lengths.append(len(source))
            
        yield pad_targets_batch, pad_sources_batch, np.array(pad_targets_lengths), np.array(pad_source_lengths)        

In [2]:
# 데이터를 불러온다.
source_sentences = load_data('./datasets/letters_source.txt')

In [5]:
# 문자열:숫자, 숫자:문자열 의 사전을 만든다.
source_int_to_letter, source_letter_to_int = extract_character_vocab(source_sentences)

In [19]:
len(source_int_to_letter)

30

In [20]:
len(source_letter_ids)

10000

In [8]:
# 문자열을 숫자로 변환한다.
source_letter_ids = [[source_letter_to_int.get(letter, source_letter_to_int['<UNK>']) for letter in line]\
                     for line in source_sentences.split('\n')]

In [14]:
config = Config()
valid_source = source_letter_ids[:config.batch_size]
valid_target = [list(reversed(i)) + [3] for i in valid_source]
(valid_targets_batch, valid_sources_batch, valid_targets_lengths, valid_sources_lengths) = \
    next(get_batches(valid_target, valid_source, config.batch_size, 0, 0))

### 1. config
1. `rnn_size`: encoder, decoder의 hidden_state의 크기이다. 
3. `embedding_size`: encoder, decoder가 각각 다르게 가져가며 input을 몇 개의 차원 vector로 표현할 지 결정한다. (크면 클 수록 많은 정보를 다양하게 표현할 수 있지만 연산량이 많아진다. 정보가 많다면 클 수록 좋지만 지금 다루는 건 간단한 예제이므로 작게 둔다.) 
4. `source_vocab`: 단어 개수에 해당되는 항으로 embedding 할 때 단어에 매칭해서 vector로 변환해야 하므로 embedding matrix 크기를 결정할 때 필요하다. 우리 예제에서는 문자 개수라고 생각하면 된다.

In [21]:
class Config():
    def __init__(self):
        self.epochs = 15
        # Batch Size
        self.batch_size = 128
        # RNN Size
        self.rnn_size = 50
        # Embedding Size
        self.encoding_embedding_size = 15
        self.decoding_embedding_size = 15
        # Learning Rate
        self.learning_rate = 0.001
        self.source_vocab = 30
        self.max_decode_step = 10

### 2. Model
0. `__init__`: mode는 decode, train 모드 2가지가 있다. 이 때 encoder의 입력 값은 같지만 decoder에서의 값이 많이 달라진다.
    - train: 학습한다.
    - decode: 입력 값에 대한 예측을 한다. (test로 보면 됨)
1. `add_placeholders`: 
    - `[batch_size, max_time_step]` 의 크기의 encoder, decoder input이 들어간다. 3-d tensor가 아닌데, 이 값이 embedding matrix를 통과해서 3-d tensor가 될 것이다.
    - train 일 때, decoder input에는 조금 수정이 가해진다. 위 그림에서 보면 &lt;s&gt;(**start token**)가 decoder input에 더해지는 것을 볼 수 있다. 또, &lt;/s&gt;(**end token**)이 decoder output(**decoder input을 한 칸 미룬 값**)에 더해지는 것도 볼 수 있는데 이를 구현해줘야 한다.
2. `build_encoder`: Bi-directional RNN으로 구현해준다. 양 방향 모두 LSTM을 사용하며 주의할 점은 양 쪽의 데이터를 합쳐서 활용해야 한다는 점이다.
3. `build_decoder`: 
    - decoder: LSTM을 사용하며 encoder에서 넘어온 값을 Attention을 통과시키고 decoder에서 나온 값과 합친다. 그리고 이 값을 Dense를 통과시켜서 최종 output(**사전에 대한 softmax 예측**)을 내놓게 된다.
    - loss: padding을 고려해서 backpropagation을 해야 하므로 mask에 대한 정보를 고려한 뒤 loss를 cross entropy로 구한다.
    - optimizer: RNN은 gradient가 튀는 경우가 많아서 이를 방지하기 위해서 gradient clipping을 해준다. (**gradient의 max 값을 지정해서 넘으면 max 값을 갖도록 한다.**)
    - inference: 모드가 **decode**일 때 예측을 해야 하는데, 이 때 **greedy**(문자열을 순서대로 예측할 때 가장 높은 확률을 갖는 값들로만 연속적으로 예측한다.)한 방법을 사용해서 예측한다. 
4. `train`, `eval`: encoder, decoder에 넣어줄 데이터의 값과 길이(padding을 확인하기 위함.)를 전달해주면 학습하고 예측할 수 있다.

In [22]:
"""Seq2Seq + Attention model"""

import tensorflow as tf
from tensorflow.python.layers.core import Dense

class Model(object):
    def __init__(self, config, mode):
        self.config = config
        self.mode = mode

    def add_placeholders(self):
        # encoder_inputs : [batch_size, max_time_steps]
        self.encoder_inputs = tf.placeholder(
            dtype=tf.int32, shape=(None, None), name='encoder_inputs')
        # encoder_inputs_length : [batch_size]
        self.encoder_inputs_length = tf.placeholder(
            dtype=tf.int32, shape=(None,), name='encoder_inputs_length')
        # get dynamic batch_size
        self.batch_size = tf.shape(self.encoder_inputs)[0]

        # train일 때 decoder가 필요한 데이터에 수정이 필요하다.
        if self.mode == 'train':
            # decoder_inputs : [batch_size, max_time_steps]
            self.decoder_inputs = tf.placeholder(
                dtype=tf.int32, shape=(None, None), name='decoder_inputs')
            # decoder_inputs_length: [batch_size]
            self.decoder_inputs_length = tf.placeholder(
                dtype=tf.int32, shape=(None,), name='decoder_inputs_length')

            decoder_start_token = tf.ones(
                shape=[self.batch_size, 1], dtype=tf.int32) * start_token
            decoder_end_token = tf.ones(
                shape=[self.batch_size, 1], dtype=tf.int32) * end_token

            # decoder_inputs_train: [batch_size , max_time_steps + 1]
            # insert _GO symbol in front of each decoder input
            self.decoder_inputs_train = tf.concat([decoder_start_token,
                                                  self.decoder_inputs], axis=1)

            # decoder_inputs_length_train: [batch_size]
            self.decoder_inputs_length_train = self.decoder_inputs_length + 1

            # decoder_targets_train: [batch_size, max_time_steps + 1]
            # insert EOS symbol at the end of each decoder input
            self.decoder_targets_train = tf.concat([self.decoder_inputs,
                                                   decoder_end_token], axis=1)

    def build_encoder(self):
        # ENCODER (Bi-directional
        encoder_embeddings = tf.get_variable(
            name='embeddings', shape=[self.config.source_vocab, self.config.encoding_embedding_size])
        encoder_embedded = tf.nn.embedding_lookup(encoder_embeddings, self.encoder_inputs)

        # Bi-RNN
        encoder_cell_fw = tf.contrib.rnn.BasicLSTMCell(self.config.rnn_size)
        encoder_cell_bw = tf.contrib.rnn.BasicLSTMCell(self.config.rnn_size)
        encoder_out, encoder_state = tf.nn.bidirectional_dynamic_rnn(
            encoder_cell_fw, encoder_cell_bw, encoder_embedded, self.encoder_inputs_length, dtype=tf.float32)
        encoder_c_state = tf.concat([encoder_state[0].c, encoder_state[1].c], -1)
        encoder_h_state = tf.concat([encoder_state[0].h, encoder_state[1].h], -1)
        self.encoder_state = tf.contrib.rnn.LSTMStateTuple(encoder_c_state, encoder_h_state)
        self.encoder_out = tf.concat(encoder_out, -1)

    def build_decoder(self):
        # ATTENTION
        decoder_cell = tf.contrib.rnn.BasicLSTMCell(self.config.rnn_size * 2)

        attention_mechanism = tf.contrib.seq2seq.LuongAttention(
            num_units=self.config.rnn_size * 2,
            memory=self.encoder_out,
            memory_sequence_length=self.encoder_inputs_length)

        # attention wrapper
        decoder_cell = tf.contrib.seq2seq.AttentionWrapper(
            cell=decoder_cell,
            attention_mechanism=attention_mechanism,
            attention_layer_size=self.config.rnn_size * 2)

        decoder_initial_state = decoder_cell.zero_state(
            self.config.batch_size, tf.float32).clone(cell_state=self.encoder_state)

        # DECODER
        Y_vocab_size = self.config.source_vocab
        output_layer = Dense(self.config.source_vocab, name='output_projection')
        self.decoder_embedding = tf.Variable(
            tf.random_uniform([Y_vocab_size, self.config.decoding_embedding_size], -1.0, 1.0),
            name='decoder_embedding')

        if self.mode == 'train':
            decoder_embedded = tf.nn.embedding_lookup(self.decoder_embedding, self.decoder_inputs_train)
            training_helper = tf.contrib.seq2seq.TrainingHelper(
                inputs=decoder_embedded,
                sequence_length=self.decoder_inputs_length_train,
                time_major=False,
                name='training_helper')

            # basic decoder
            training_decoder = tf.contrib.seq2seq.BasicDecoder(
                cell=decoder_cell,
                helper=training_helper,
                initial_state=decoder_initial_state,
                output_layer=output_layer)

            self.max_decoder_length = tf.reduce_max(self.decoder_inputs_length_train)
            training_decoder_output = tf.contrib.seq2seq.dynamic_decode(
                decoder=training_decoder,
                impute_finished=True,
                maximum_iterations=self.max_decoder_length)[0]
            training_logits = tf.identity(
                training_decoder_output.rnn_output, name='logits')

            # LOSS
            masks = tf.sequence_mask(self.decoder_inputs_length_train, self.max_decoder_length, dtype=tf.float32, name='mask')
            self.loss = tf.contrib.seq2seq.sequence_loss(
                logits=training_logits, targets=self.decoder_targets_train, weights=masks)

            # BACKWARD
            params = tf.trainable_variables()
            gradients = tf.gradients(self.loss, params)
            clipped_gradients, _ = tf.clip_by_global_norm(gradients, 5.0)
            self.train_op = tf.train.AdamOptimizer().apply_gradients(
                zip(clipped_gradients, params))

        # INFERENCE
        elif self.mode == 'inference':
            start_tokens = tf.ones([self.config.batch_size, ], dtype=tf.int32) * start_token
            end_tokens = end_token

            inference_helper = tf.contrib.seq2seq.GreedyEmbeddingHelper(
                embedding=self.decoder_embedding, start_tokens=start_tokens, end_token=end_tokens)
            inference_decoder = tf.contrib.seq2seq.BasicDecoder(
                cell=decoder_cell,
                helper=inference_helper,
                initial_state=decoder_initial_state,
                output_layer=output_layer)
            inference_decoder_output = tf.contrib.seq2seq.dynamic_decode(
                inference_decoder, impute_finished=True, maximum_iterations=self.config.max_decode_step)[0]
            self.inference_logits = tf.identity(
                inference_decoder_output.sample_id, name='predictions')

    # TRAIN STEP
    def train(self, sess, encoder_inputs, encoder_inputs_length,
              decoder_inputs, decoder_inputs_length):
        """
        :param sess: session
        :param encoder_inputs:
        :param encoder_inputs_length:
        :param decoder_inputs:
        :param decoder_inputs_length:
        :return:
        """
        # train인지 check
        if self.mode.lower() != 'train':
            raise ValueError("train step can only be operated in train mode")
        input_feed = self.check_feeds(encoder_inputs, encoder_inputs_length,
                                      decoder_inputs, decoder_inputs_length, False)

        output_feed = [self.train_op, self.loss]
        outputs = sess.run(output_feed, input_feed)
        return outputs[1]

    # EVAL STEP
    def eval(self, sess, encoder_inputs, encoder_inputs_length,
              decoder_inputs, decoder_inputs_length):
        """
        :param sess: session
        :param encoder_inputs:
        :param encoder_inputs_length:
        :param decoder_inputs:
        :param decoder_inputs_length:
        :return:
        """
        # train인지 check
        if self.mode.lower() != 'train':
            raise ValueError("train step can only be operated in train mode")
        input_feed = self.check_feeds(encoder_inputs, encoder_inputs_length,
                                      decoder_inputs, decoder_inputs_length, False)

        output_feed = [self.loss]
        outputs = sess.run(output_feed, input_feed)
        return outputs[0]


    def check_feeds(self, encoder_inputs, encoder_inputs_length,
                    decoder_inputs, decoder_inputs_length, decode):
        """
        Args:
          encoder_inputs: a numpy int matrix of [batch_size, max_source_time_steps]
              to feed as encoder inputs
          encoder_inputs_length: a numpy int vector of [batch_size]
              to feed as sequence lengths for each element in the given batch
          decoder_inputs: a numpy int matrix of [batch_size, max_target_time_steps]
              to feed as decoder inputs
          decoder_inputs_length: a numpy int vector of [batch_size]
              to feed as sequence lengths for each element in the given batch
          decode: a scalar boolean that indicates inference mode
        Returns:
          A feed for the model that consists of encoder_inputs, encoder_inputs_length,
          decoder_inputs, decoder_inputs_length
        """

        input_batch_size = encoder_inputs.shape[0]
        if input_batch_size != encoder_inputs_length.shape[0]:
            raise ValueError("Encoder inputs and their lengths must be equal in their "
                             "batch_size, %d != %d" % (input_batch_size, encoder_inputs_length.shape[0]))

        if not decode:
            target_batch_size = decoder_inputs.shape[0]
            if target_batch_size != input_batch_size:
                raise ValueError("Encoder inputs and Decoder inputs must be equal in their "
                                 "batch_size, %d != %d" % (input_batch_size, target_batch_size))
            if target_batch_size != decoder_inputs_length.shape[0]:
                raise ValueError("Decoder targets and their lengths must be equal in their "
                                 "batch_size, %d != %d" % (target_batch_size, decoder_inputs_length.shape[0]))

        input_feed = {}

        input_feed[self.encoder_inputs.name] = encoder_inputs
        input_feed[self.encoder_inputs_length.name] = encoder_inputs_length

        if not decode:
            input_feed[self.decoder_inputs.name] = decoder_inputs
            input_feed[self.decoder_inputs_length.name] = decoder_inputs_length

        return input_feed


    def build_model(self):
        self.add_placeholders()
        self.build_encoder()
        self.build_decoder()

### Train

1. Checkpoint path: 모델, summary 저장할 폴더 만들기 
2. training(`train-source`), validation(`valid_source`) 데이터 만들기
3. 위에서 정의한 코드들 다 불러서 실행시키고 학습.

In [None]:
import datetime

output_dir = './datasets/runs'

# checkpoint path
time_now = datetime.datetime.now().strftime('%Y-%m-%d %H-%M-%S')
output_dir = os.path.join(output_dir, time_now)
checkpoint_dir = os.path.join(output_dir, 'checkpoint')
summary_dir = os.path.join(output_dir, 'summaries')
checkpoint_prefix = os.path.join(checkpoint_dir, 'model')
if not os.path.exists(output_dir):
    os.makedirs(checkpoint_dir)
    os.makedirs(summary_dir)

# config
config = Config()

train_source = source_letter_ids[config.batch_size:]
train_target = [list(reversed(i)) + [3] for i in train_source]
valid_source = source_letter_ids[:config.batch_size]
valid_target = [list(reversed(i)) + [3] for i in valid_source]
(valid_targets_batch, valid_sources_batch, valid_targets_lengths, valid_sources_lengths) = \
    next(get_batches(valid_target, valid_source, config.batch_size, 0, 0))

display_step = 20 # Check training loss after every 20 batches

model = Model(config, mode='train')
model.build_model()

sess = tf.Session()
sess.run(tf.global_variables_initializer())

for epoch_i in range(1, config.epochs+1):
    for batch_i, (targets_batch, sources_batch, targets_lengths, sources_lengths) in enumerate(
            get_batches(train_target, train_source, config.batch_size, 0, 0)):
        train_loss = model.train(sess,
                                 sources_batch,
                                 sources_lengths,
                                 targets_batch,
                                 targets_lengths)

        if batch_i % display_step == 0 and batch_i > 0:
            val_loss = model.eval(sess,
                                  valid_sources_batch,
                                  valid_sources_lengths,
                                  valid_targets_batch,
                                  valid_targets_lengths)
            print('Epoch {:>3}/{} Batch {:>4}/{} - Loss: {:>6.3f}  - Validation loss: {:>6.3f}'
                  .format(epoch_i, config.epochs, batch_i,
                          len(train_source) // config.batch_size, train_loss, val_loss))
saver = tf.train.Saver()
saver.save(sess, checkpoint_prefix)
print('save model checkpoint to {}'.format(checkpoint_prefix))

Epoch   1/15 Batch   20/77 - Loss:  1.931  - Validation loss:  2.000
Epoch   1/15 Batch   40/77 - Loss:  1.877  - Validation loss:  1.833
Epoch   1/15 Batch   60/77 - Loss:  1.600  - Validation loss:  1.637
Epoch   2/15 Batch   20/77 - Loss:  1.400  - Validation loss:  1.459
Epoch   2/15 Batch   40/77 - Loss:  1.447  - Validation loss:  1.406
Epoch   2/15 Batch   60/77 - Loss:  1.332  - Validation loss:  1.369
Epoch   3/15 Batch   20/77 - Loss:  1.253  - Validation loss:  1.304
Epoch   3/15 Batch   40/77 - Loss:  1.312  - Validation loss:  1.257
Epoch   3/15 Batch   60/77 - Loss:  1.151  - Validation loss:  1.176
Epoch   4/15 Batch   20/77 - Loss:  0.929  - Validation loss:  0.971
Epoch   4/15 Batch   40/77 - Loss:  0.878  - Validation loss:  1.208
Epoch   4/15 Batch   60/77 - Loss:  0.687  - Validation loss:  0.728
Epoch   5/15 Batch   20/77 - Loss:  0.286  - Validation loss:  0.340
Epoch   5/15 Batch   40/77 - Loss:  0.202  - Validation loss:  0.209
Epoch   5/15 Batch   60/77 - Loss:

### Decode

- checkpoint를 설정하고 모델을 불러온다. 
- 예측하고자 하는 데이터를 입력해준다. 
- 데이터를 숫자로 변경한다. 
- 모델을 통해 예측한다. 

In [None]:
tf.reset_default_graph()

checkpoint_dir = './datasets/runs/2017-12-21 01-23-49/checkpoint/'

### sample
source_path = './datasets/letters_source.txt'

# source -> seq 변환
def source_to_seq(text):
    '''Prepare the text for the model'''
    sequence_length = 7
    return [source_letter_to_int.get(word, source_letter_to_int['<UNK>']) for word in text]\
           + [source_letter_to_int['<PAD>']]*(sequence_length-len(text))
    
# seq -> source 변환
def seq_to_source(seq):
    return ''.join([source_int_to_letter[i] for i in seq])

config = Config()
sess = tf.Session()
model = Model(config, mode='inference')
model.build_model()

checkpoint_file = tf.train.latest_checkpoint(checkpoint_dir)
saver = tf.train.Saver()
saver.restore(sess, checkpoint_file)
print('model restored from {}'.format(checkpoint_file))

while True:
    text_input = input()
    text = source_to_seq(text_input)
    logits = model.inference_logits
    pred = sess.run(logits, feed_dict={model.encoder_inputs: [text]*config.batch_size,
                                       model.encoder_inputs_length: [len(text)]*config.batch_size})[1]
    print('PREDICTION: {}'.format(seq_to_source(pred)))
    print('INPUT: {}'.format(seq_to_source(text)))

INFO:tensorflow:Restoring parameters from ./datasets/runs/2017-12-21 01-23-49/checkpoint/model
model restored from ./datasets/runs/2017-12-21 01-23-49/checkpoint/model
hi
PREDICTION: ih<EOS>
INPUT: hi<PAD><PAD><PAD><PAD><PAD>
hello
PREDICTION: olleh<EOS>
INPUT: hello<PAD><PAD>
power
PREDICTION: rewop<EOS>
INPUT: power<PAD><PAD>
fuch
PREDICTION: hcuf<EOS>
INPUT: fuch<PAD><PAD><PAD>
fuck
PREDICTION: kcuf<EOS>
INPUT: fuck<PAD><PAD><PAD>
ssibal
PREDICTION: labiss<EOS>
INPUT: ssibal<PAD>
