# SEP532 인공지능 이론과 실제
## Deep Learning Practice 
#### Prof. Ho-Jin Choi
#### School of Computing, KAIST

---

## Advanced Models
### Sequence to Sequence (seq2seq)

Today we will learn a new neural network architecture, so-called sequence-to-sequence (seq2seq), for language processing (e.g. language translation, image captioning, conversational models and text summarization).

This is made possible by the simple but powerful idea of the sequence to sequence network, in which two recurrent neural networks work together to transform one sequence to another. An encoder network condenses an input sequence into a single vector, and a decoder network unfolds that vector into a new sequence.

![Sequence to sequence model](images/seq2seq.png)

#### Encoder
- A stack of several recurrent units (LSTM or GRU cells for better performance) where each accepts a single element of the input sequence, collects information for that element and propagates it forward.
- In question-answering problem, the input sequence is a collection of all words from the question. Each word is represented as $x_i$ where $i$ is the order of that word.

#### Context Vector
- This is the final hidden state produced from the encoder part of the model. It is calculated using the formula above.
- This vector aims to encapsulate the information for all input elements in order to help the decoder make accurate predictions.
- It acts as the initial hidden state of the decoder part of the model.

#### Decoder
- A stack of several recurrent units where each predicts an output $y_t$ at a time step $t$.
- Each recurrent unit accepts a hidden state from the previous unit and produces and output as well as its own hidden state.
- In the question-answering problem, the output sequence is a collection of all words from the answer. Each word is represented as $y_i$ where $i$ is the order of that word.

### Chatbot

In [None]:
import tensorflow as tf
import pandas as pd

### Load and preprocess the dataset

We'll use a Korean chatbot dataset provided by https://github.com/songys/Chatbot_data. This dataset contains almost 12,000 question and answer pairs and served as the csv format.

In [None]:
!wget --output-document="chatbot.csv" https://raw.githubusercontent.com/songys/Chatbot_data/master/ChatbotData.csv

This dataset consists of three columns:
- `Q`: question sentence
- `A`: answer sentence
- `label`: sentiment lab

Among these columns, we are going to use only `Q` and `A`. Let's load this dataset using `pandas.read_csv()`. For readability purposes, we are going to rename columns from `Q` and `A` to `question` and `answer` respectively. 

In [None]:
df = pd.
df.head(10)

Now, the dataset is loaded. Let's vectorize this dataset to feed this dataset into a neural network. To do that, we need to build a tokenizer to split a sentence into several tokens and to give indexes to each token. 

For Korean language, there are several methods to build tokenizers but, in this notebook, we will use the following two methods:
- A subword tokenizer
- A tokenizer based on part-of-speech tagger

#### Build the subword tokenizer
![Subword](images/subword.png)

Let's build the subword tokenizer to tokenize the given texts as several subwords and to transform the subword tokens into integer vectors. To do that, we are going to use `SubwordTextEncoder`.

In [None]:
corpus = pd.concat([df['question'], df['answer']], ignore_index=True)

In [None]:
import tensorflow_datasets as tfds
SubwordTextEncoder = tfds.deprecated.text.SubwordTextEncoder

tokenizer = SubwordTextEncoder.

In [None]:
start_token = 
end_token = 

def encode(text):
    pass

def decode(tokens):
    pass

In [None]:
number_of_words = 

#### Build the tokenizer based on part-of-speech tagger
![Part-of-speech](images/part-of-speech.png)

Let's build the tokenizer based on part-of-speech tagger to tokenize the given texts as several morphemes and to transform the morpheme tokens into integer vectors. To do that, we are going to use `konlpy` library which provides several part-of-speech taggers.

In [None]:
!pip install konlpy

In [None]:
from konlpy.tag import Okt

okt = Okt()
def preprocess_sentence(text):
    pass

corpus = 
corpus = 

In [None]:
tokenizer = 
tokenizer.

In [None]:
start_token = 
end_token = 

def encode(text):
    pass

def decode(tokens):
    pass

In [None]:
number_of_words = 

After building tokenizers, `encode()` and `decode()`, let's vectorize the dataset.

In [None]:
questions = 
questions = 

answers = 
answers = 

Using `tf.data.Dataset`'s methods, shuffle the dataset and make its batches.

In [None]:
batch_size = 128
number_of_dataset = questions.shape[0]

dataset = 

### Define seq2seq model

Now, it is time to build the encoder and decoder models. Because these models are not provided by TensorFlow and Keras by default, we need to define our `tf.keras.Model` by manual using the class inheritance.

`Encoder` model takes an input vector and produces a context vector which summarizes all the input vector. To do that, we need the following layers:

- `tf.keras.layers.Embedding`
- `tf.keras.layers.GRU`

In [None]:
class Encoder(tf.keras.Model):
    def __init__(self, vocab_size, embedding_dim, units):
        super().__init__()
        
    def call(self, encoder_input, encoder_state):
        # encoder_input = (batch_size, length)
        # encoder_state = (batch_size, units)

        # encoder_input = (batch_size, length, embedding_dim)
        encoder_input = 
        
        # encoder_output = (batch_size, units)
        # encoder_state = (batch_size, units)
        encoder_output, encoder_state = 
        
        return encoder_output, encoder_state

`Decoder` model takes the context vector from the `Encoder` and predicts a next word given the previous word inputs. In other words, `Decoder` model calculates this conditional probability: $ P(\text{word}_{t + 1}|\text{context}, \text{word}_1, \text{word}_2, \dots, \text{word}_t) $

To do that, we need the following layers:

- `tf.keras.layers.Embedding`
- `tf.keras.layers.GRU`
- `tf.keras.layers.Dense`

In [None]:
class Decoder(tf.keras.Model):
    def __init__(self, vocab_size, embedding_dim, units):
        super().__init__()
        
    def call(self, decoder_input, decoder_state):
        # decoder_input = (batch_size, 1)
        # decoder_state = (batch_size, units)

        # decoder_input = (batch_size, 1, embedding_dim)
        decoder_input = 

        # decoder_output = (batch_size, units)
        # decoder_state = (batch_size, units)
        decoder_output, decoder_state = 
        
        # decoder_output = (batch_size, vocab_size)
        decoder_output = 
        
        return decoder_output, decoder_state

Once both the encoder and decoder are defined, we can initiate them like normal Python classes.

In [None]:
embedding_dim = 256
units = 1024

encoder = 
decoder = 

### Define the loss and optimizer

Let's define the loss functions and the optimizers for the seq2seq model. Here, because the input dataset consists sentences of various lengths, we need to consider that point when caclculating the loss. Otherwise, the loss will be too grater than expected. To do that, we create a `mask` matrix and discard unnecessary values.

In [None]:
optimizer = tf.keras.optimizers.Adam()

_loss = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True, reduction='none')
def calculate_loss(actual, predicted):
    pass

### Train seq2seq model using `tf.GradientTape`

In this notebook, rather than using `tf.keras.Model.fit()`, we will train the model more manaully using `tf.GradientTape`.

In [None]:
def train_step(encoder_input, decoder_target):
    loss = 0
    with tf.GradientTape() as tape:
        pass
    
    batch_loss = loss / int(decoder_target.shape[1])
    
    variables = 
    gradients = 
    optimizer.
    
    return batch_loss

In [None]:
from tqdm.auto import tqdm

epochs = 10

epoch_loss = tf.keras.metrics.Mean()
with tqdm(total=epochs) as epoch_progress:
    for epoch in range(epochs):
        epoch_loss.reset_states()

        with tqdm(total=number_of_dataset // batch_size) as batch_progress:
            for batch, (encoder_input, decoder_target) in enumerate(dataset):
                batch_loss = train_step(encoder_input, decoder_target)
                epoch_loss(batch_loss)
                
                if (batch % 10) == 0:
                    batch_progress.set_description(f'Epoch {epoch + 1}')
                    batch_progress.set_postfix(Batch=batch, Loss=batch_loss.numpy())
                batch_progress.update()
        
        epoch_progress.set_description(f'Epoch {epoch + 1}')
        epoch_progress.set_postfix(Loss=epoch_loss.result().numpy())
        epoch_progress.update()

### Let's generate a response for a given sentence

In [None]:
def listen(sentence):
    encoder_input = 
    encoder_input = 

    encoder_state = 
    encoder_output, encoder_state = 

    decoder_state = encoder_state
    decoder_input = 

    predicted = []
    for step in range(answers.shape[1]):
        predictions, decoder_state = 

        predicted_id = 
        if predicted_id == end_token:
            break

        predicted.append(predicted_id)
        decoder_input = 

    return decode(predicted)

In [None]:
listen('반갑습니다')

In [None]:
listen('오늘 날씨가 좋네요')