# SEP532 인공지능 이론과 실제
## Deep Learning Practice 
#### Prof. Ho-Jin Choi
#### School of Computing, KAIST

---

## Advanced Models
### Sequence to Sequence (seq2seq)

Today we will learn a new neural network architecture, so-called sequence-to-sequence (seq2seq), for language processing (e.g. language translation, image captioning, conversational models and text summarization).

This is made possible by the simple but powerful idea of the sequence to sequence network, in which two recurrent neural networks work together to transform one sequence to another. An encoder network condenses an input sequence into a single vector, and a decoder network unfolds that vector into a new sequence.

![Sequence to sequence model](images/seq2seq.png)

#### Encoder
- A stack of several recurrent units (LSTM or GRU cells for better performance) where each accepts a single element of the input sequence, collects information for that element and propagates it forward.
- In question-answering problem, the input sequence is a collection of all words from the question. Each word is represented as $x_i$ where $i$ is the order of that word.

#### Context Vector
- This is the final hidden state produced from the encoder part of the model. It is calculated using the formula above.
- This vector aims to encapsulate the information for all input elements in order to help the decoder make accurate predictions.
- It acts as the initial hidden state of the decoder part of the model.

#### Decoder
- A stack of several recurrent units where each predicts an output $y_t$ at a time step $t$.
- Each recurrent unit accepts a hidden state from the previous unit and produces and output as well as its own hidden state.
- In the question-answering problem, the output sequence is a collection of all words from the answer. Each word is represented as $y_i$ where $i$ is the order of that word.

### Chatbot

In [1]:
import tensorflow as tf
import pandas as pd

### Load and preprocess the dataset

We'll use a Korean chatbot dataset provided by https://github.com/songys/Chatbot_data. This dataset contains almost 12,000 question and answer pairs and served as the csv format.

In [2]:
!wget --output-document="chatbot.csv" https://raw.githubusercontent.com/songys/Chatbot_data/master/ChatbotData%20.csv

--2021-05-10 11:59:08--  https://raw.githubusercontent.com/songys/Chatbot_data/master/ChatbotData%20.csv
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.111.133, 185.199.110.133, 185.199.109.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.111.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 889842 (869K) [text/plain]
Saving to: ‘chatbot.csv’


2021-05-10 11:59:08 (4.07 MB/s) - ‘chatbot.csv’ saved [889842/889842]



This dataset consists of three columns:
- `Q`: question sentence
- `A`: answer sentence
- `label`: sentiment lab

Among these columns, we are going to use only `Q` and `A`. Let's load this dataset using `pandas.read_csv()`. For readability purposes, we are going to rename columns from `Q` and `A` to `question` and `answer` respectively. 

In [3]:
df = pd.read_csv('chatbot.csv', usecols=['Q', 'A']).rename(columns={'Q': 'question', 'A': 'answer'})
df.head(10)

Unnamed: 0,question,answer
0,12시 땡!,하루가 또 가네요.
1,1지망 학교 떨어졌어,위로해 드립니다.
2,3박4일 놀러가고 싶다,여행은 언제나 좋죠.
3,3박4일 정도 놀러가고 싶다,여행은 언제나 좋죠.
4,PPL 심하네,눈살이 찌푸려지죠.
5,SD카드 망가졌어,다시 새로 사는 게 마음 편해요.
6,SD카드 안돼,다시 새로 사는 게 마음 편해요.
7,SNS 맞팔 왜 안하지ㅠㅠ,잘 모르고 있을 수도 있어요.
8,SNS 시간낭비인 거 아는데 매일 하는 중,시간을 정하고 해보세요.
9,SNS 시간낭비인데 자꾸 보게됨,시간을 정하고 해보세요.


Now, the dataset is loaded. Let's vectorize this dataset to feed this dataset into a neural network. To do that, we need to build a tokenizer to split a sentence into several tokens and to give indexes to each token. 

For Korean language, there are several methods to build tokenizers but, in this notebook, we will use the following two methods:
- A subword tokenizer
- A tokenizer based on part-of-speech tagger

#### Build the subword tokenizer
![Subword](images/subword.png)

Let's build the subword tokenizer to tokenize the given texts as several subwords and to transform the subword tokens into integer vectors. To do that, we are going to use `tensorflow_datasets.features.SubwordTextEncoder`.

In [15]:
corpus = pd.concat([df['question'], df['answer']], ignore_index=True)

In [16]:
import tensorflow_datasets as tfds
SubwordTextEncoder = tfds.deprecated.text.SubwordTextEncoder

tokenizer = SubwordTextEncoder.build_from_corpus(corpus, target_vocab_size=2 ** 13)

In [17]:
start_token = tokenizer.vocab_size
end_token = tokenizer.vocab_size + 1

def encode(text):
    return [start_token] + tokenizer.encode(text) + [end_token]

def decode(tokens):
    return tokenizer.decode(tokens)

In [18]:
number_of_words = tokenizer.vocab_size + 2

#### Build the tokenizer based on part-of-speech tagger
![Part-of-speech](images/part-of-speech.png)

Let's build the tokenizer based on part-of-speech tagger to tokenize the given texts as several morphemes and to transform the morpheme tokens into integer vectors. To do that, we are going to use `konlpy` library which provides several part-of-speech taggers.

In [8]:
!pip install konlpy

You should consider upgrading via the '/usr/local/bin/python -m pip install --upgrade pip' command.[0m


In [32]:
from konlpy.tag import Okt

okt = Okt()
def preprocess_sentence(text):
    return '<start> {} <end>'.format(' '.join(okt.morphs(text)))

corpus = pd.concat([df['question'], df['answer']], ignore_index=True)
corpus = corpus.apply(preprocess_sentence)

In [33]:
tokenizer = tf.keras.preprocessing.text.Tokenizer(filters='')
tokenizer.fit_on_texts(corpus)

In [34]:
start_token = tokenizer.word_index['<start>']
end_token = tokenizer.word_index['<end>']

def encode(text):
    return [start_token] + tokenizer.texts_to_sequences([' '.join(okt.morphs(text))])[0] + [end_token]

def decode(tokens):
    return tokenizer.sequences_to_texts([tokens])[0]

In [35]:
number_of_words = len(tokenizer.word_index) + 1 

After building tokenizers, `encode()` and `decode()`, let's vectorize the dataset.

In [36]:
questions = [encode(question) for question in df['question']]
questions = tf.keras.preprocessing.sequence.pad_sequences(questions, padding='post')

answers = [encode(answer) for answer in df['answer']]
answers = tf.keras.preprocessing.sequence.pad_sequences(answers, padding='post')

Using `tf.data.Dataset`'s methods, shuffle the dataset and make its batches.

In [37]:
batch_size = 128
number_of_dataset = questions.shape[0]

dataset = tf.data.Dataset.from_tensor_slices((questions, answers)).shuffle(number_of_dataset).batch(batch_size, drop_remainder=True)

### Define seq2seq model

Now, it is time to build the encoder and decoder models. Because these models are not provided by TensorFlow and Keras by default, we need to define our `tf.keras.Model` by manual using the class inheritance.

`Encoder` model takes an input vector and produces a context vector which summarizes all the input vector. To do that, we need the following layers:

- `tf.keras.layers.Embedding`
- `tf.keras.layers.GRU`

In [38]:
class Encoder(tf.keras.Model):
    def __init__(self, vocab_size, embedding_dim, units):
        super().__init__()
        self.units = units
        self.embedding = tf.keras.layers.Embedding(vocab_size, embedding_dim)
        self.gru = tf.keras.layers.GRU(self.units, return_state=True)
        
    def call(self, encoder_input, encoder_state):
        # encoder_input = (batch_size, length)
        # encoder_state = (batch_size, units)

        # encoder_input = (batch_size, length, embedding_dim)
        encoder_input = self.embedding(encoder_input)
        
        # encoder_output = (batch_size, units)
        # encoder_state = (batch_size, units)
        encoder_output, encoder_state = self.gru(encoder_input, initial_state=encoder_state)
        
        return encoder_output, encoder_state

`Decoder` model takes the context vector from the `Encoder` and predicts a next word given the previous word inputs. In other words, `Decoder` model calculates this conditional probability: $ P(\text{word}_{t + 1}|\text{context}, \text{word}_1, \text{word}_2, \dots, \text{word}_t) $

To do that, we need the following layers:

- `tf.keras.layers.Embedding`
- `tf.keras.layers.GRU`
- `tf.keras.layers.Dense`

In [39]:
class Decoder(tf.keras.Model):
    def __init__(self, vocab_size, embedding_dim, units):
        super().__init__()
        self.units = units
        self.embedding = tf.keras.layers.Embedding(vocab_size, embedding_dim)
        self.gru = tf.keras.layers.GRU(self.units, return_state=True)
        self.fc = tf.keras.layers.Dense(vocab_size)
        
    def call(self, decoder_input, decoder_state):
        # decoder_input = (batch_size, 1)
        # decoder_state = (batch_size, units)

        # decoder_input = (batch_size, 1, embedding_dim)
        decoder_input = self.embedding(decoder_input)

        # decoder_output = (batch_size, units)
        # decoder_state = (batch_size, units)
        decoder_output, decoder_state = self.gru(decoder_input, initial_state=decoder_state)
        
        # decoder_output = (batch_size, vocab_size)
        decoder_output = self.fc(decoder_output)
        
        return decoder_output, decoder_state

Once both the encoder and decoder are defined, we can initiate them like normal Python classes.

In [40]:
embedding_dim = 256
units = 1024

encoder = Encoder(number_of_words, embedding_dim, units)
decoder = Decoder(number_of_words, embedding_dim, units)

### Define the loss and optimizer

Let's define the loss functions and the optimizers for the seq2seq model. Here, because the input dataset consists sentences of various lengths, we need to consider that point when caclculating the loss. Otherwise, the loss will be too grater than expected. To do that, we create a `mask` matrix and discard unnecessary values.

In [41]:
optimizer = tf.keras.optimizers.Adam()

_loss = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True, reduction='none')
def calculate_loss(actual, predicted):
    mask = tf.math.logical_not(tf.math.equal(actual, 0))
    loss = _loss(actual, predicted)
    
    mask = tf.cast(mask, dtype=loss.dtype)
    loss *= mask
    
    return tf.reduce_mean(loss)

### Train seq2seq model using `tf.GradientTape`

In this notebook, rather than using `tf.keras.Model.fit()`, we will train the model more manaully using `tf.GradientTape`.

In [42]:
def train_step(encoder_input, decoder_target):
    loss = 0
    with tf.GradientTape() as tape:
        encoder_state = tf.zeros((batch_size, encoder.units))
        encoder_output, encoder_state = encoder(encoder_input, encoder_state)
        
        decoder_state = encoder_state
        decoder_input = tf.expand_dims([start_token] * batch_size, 1)
        
        for step in range(1, decoder_target.shape[1]):
            predictions, decoder_state = decoder(decoder_input, decoder_state)
            loss += calculate_loss(decoder_target[:, step], predictions)
            
            decoder_input = tf.expand_dims(decoder_target[:, step], 1)
            
    batch_loss = loss / int(decoder_target.shape[1])
    
    variables = encoder.trainable_variables + decoder.trainable_variables
    gradients = tape.gradient(loss, variables)
    optimizer.apply_gradients(zip(gradients, variables))
    
    return batch_loss

In [43]:
from tqdm.auto import tqdm

epochs = 10

epoch_loss = tf.keras.metrics.Mean()
with tqdm(total=epochs) as epoch_progress:
    for epoch in range(epochs):
        epoch_loss.reset_states()

        with tqdm(total=number_of_dataset // batch_size) as batch_progress:
            for batch, (encoder_input, decoder_target) in enumerate(dataset):
                batch_loss = train_step(encoder_input, decoder_target)
                epoch_loss(batch_loss)
                
                if (batch % 10) == 0:
                    batch_progress.set_description(f'Epoch {epoch + 1}')
                    batch_progress.set_postfix(Batch=batch, Loss=batch_loss.numpy())
                batch_progress.update()
        
        epoch_progress.set_description(f'Epoch {epoch + 1}')
        epoch_progress.set_postfix(Loss=epoch_loss.result().numpy())
        epoch_progress.update()

  0%|          | 0/10 [00:00<?, ?it/s]

  0%|          | 0/92 [00:00<?, ?it/s]

  0%|          | 0/92 [00:00<?, ?it/s]

  0%|          | 0/92 [00:00<?, ?it/s]

  0%|          | 0/92 [00:00<?, ?it/s]

  0%|          | 0/92 [00:00<?, ?it/s]

  0%|          | 0/92 [00:00<?, ?it/s]

  0%|          | 0/92 [00:00<?, ?it/s]

  0%|          | 0/92 [00:00<?, ?it/s]

  0%|          | 0/92 [00:00<?, ?it/s]

  0%|          | 0/92 [00:00<?, ?it/s]

### Let's generate a response for a given sentence

In [45]:
def listen(sentence):
    encoder_input = encode(sentence)
    encoder_input = tf.keras.preprocessing.sequence.pad_sequences([encoder_input], maxlen=questions.shape[1], padding='post')

    encoder_state = tf.zeros((1, encoder.units))
    encoder_output, encoder_state = encoder(encoder_input, encoder_state)

    decoder_state = encoder_state
    decoder_input = tf.expand_dims([start_token], 0)

    predicted = []
    for step in range(answers.shape[1]):
        predictions, decoder_state = decoder(decoder_input, decoder_state)

        predicted_id = tf.argmax(predictions[0]).numpy()
        if predicted_id == end_token:
            break

        predicted.append(predicted_id)
        decoder_input = tf.expand_dims([predicted_id], 0)

    return decode(predicted)

In [46]:
listen('반갑습니다')

'저 도 배워 보고 싶어요 .'

In [47]:
listen('오늘 날씨가 좋네요')

'잘 전달 할 수 있을 거 예요 .'