# SEP532 인공지능 이론과 실제
## Deep Learning Practice 
#### Prof. Ho-Jin Choi
#### School of Computing, KAIST

---

### Text classification with an RNN

Let's build text classification model with RNN on the IMDB dataset for sentiment analysis.

In [None]:
import numpy as np

import tensorflow_datasets as tfds
import tensorflow as tf

#### Setup input pipeline

The IMDB large movie review dataset is a binary classification dataset—all the reviews have either a positive or negative sentiment.

Download the dataset using [`TFDS`](https://www.tensorflow.org/datasets).
- Args of [`tfds.load`](https://www.tensorflow.org/datasets/api_docs/python/tfds/load):
    - `name`: str, the registered name of the DatasetBuilder.
    - `with_info`: bool, if True, tfds.load will return the tuple (tf.data.Dataset, tfds.core.DatasetInfo) containing the info associated with the builder.
    - `as_supervised`: bool, if True, the returned tf.data.Dataset will have a 2-tuple structure (input, label) according to builder.info.supervised_keys. If False, the default, the returned tf.data.Dataset will have a dictionary with all the features.

In [None]:
dataset, info = 
train_dataset, test_dataset = dataset['train'], dataset['test']

train_dataset.element_spec

Initially this returns a dataset of (text, label pairs):

In [None]:
for example, label in train_dataset.take(1):
    print('text: ', example.numpy())
    print('label: ', label.numpy())

Next shuffle the data for training and create batches of these (text, label) pairs:

In [None]:
BUFFER_SIZE = 10000
BATCH_SIZE = 64

train_dataset = 
test_dataset = 

In [None]:
for example, label in train_dataset.take(1):
    print('texts: ', example.numpy()[:3])
    print()
    print('labels: ', label.numpy()[:3])

#### Create the text encoder
The raw text loaded by [`tfds`](https://www.tensorflow.org/datasets/api_docs/python/tfds) needs to be processed before it can be used in a model. The simplest way to process text for training is using the [`TextVectorization`](https://www.tensorflow.org/api_docs/python/tf/keras/layers/TextVectorization) layer. This layer has many capabilities, but this practice sticks to the default behavior.

Create the layer, and pass the dataset's text to the layer's `.adapt()` method:

In [None]:
VOCAB_SIZE = 1000
encoder = tf.keras.layers.TextVectorization(max_tokens=VOCAB_SIZE)
encoder.adapt(train_dataset.map(lambda text, label: text))

The `.adapt()` method sets the layer's vocabulary. Here are the first 20 tokens. After the padding and unknown tokens they're sorted by frequency:

In [None]:
vocab = np.array(encoder.get_vocabulary())
vocab[:20]

Once the vocabulary is set, the layer can encode text into indices. The tensors of indices are 0-padded to the longest sequence in the batch (unless you set a fixed `output_sequence_length`):

In [None]:
encoded_example = encoder(example)[:3].numpy()
encoded_example

With the default settings, the process is not completely reversible. There are three main reasons for that:
1. The default value for `TextVectorization`'s `standardize` argument is "`lower_and_strip_punctuation`".
2. The limited vocabulary size and lack of character-based fallback results in some unknown tokens.

In [None]:
for n in range(3):
    print("Original: ", example[n].numpy())
    print("Round-trip: ", " ".join(vocab[encoded_example[n]]))
    print()

#### Create the model
![A drawing of the information flow in the model](images/bidirectional.png)

Above is a diagram of the model. 

1. This model can be build as a `tf.keras.Sequential`.
2. The first layer is the `encoder`, which converts the text to a sequence of token indices.
3. After the encoder is an embedding layer. An embedding layer stores one vector per word. When called, it converts the sequences of word indices to sequences of vectors. These vectors are trainable. After training (on enough data), words with similar meanings often have similar vectors. This index-lookup is much more efficient than the equivalent operation of passing a one-hot encoded vector through a `tf.keras.layers.Dense` layer.
4. A recurrent neural network (RNN) processes sequence input by iterating through the elements. RNNs pass the outputs from one timestep to their input on the next timestep. The `tf.keras.layers.Bidirectional` wrapper can also be used with an RNN layer. This propagates the input forward and backwards through the RNN layer and then concatenates the final output. 
    - The main advantage of a bidirectional RNN is that the signal from the beginning of the input doesn't need to be processed all the way through every timestep to affect the output.  
    - The main disadvantage of a bidirectional RNN is that you can't efficiently stream predictions as words are being added to the end.
5. After the RNN has converted the sequence to a single vector the two `layers.Dense` do some final processing, and convert from this vector representation to a single logit as the classification output. 

The code to implement this is below:

In [None]:
model = 

Compile the Keras model to configure the training process:

In [None]:
model.

#### Train the model

In [None]:
history = 

In [None]:
test_loss, test_acc = 

print('Test Loss:', test_loss)
print('Test Accuracy:', test_acc)

In [None]:
import matplotlib.pyplot as plt

def plot_graphs(history, metric):
    plt.plot(history.history[metric])
    plt.plot(history.history['val_'+metric], '')
    plt.xlabel("Epochs")
    plt.ylabel(metric)
    plt.legend([metric, 'val_'+metric])

plt.figure(figsize=(16, 8))
plt.subplot(1, 2, 1)
plot_graphs(history, 'accuracy')
plt.ylim(None, 1)
plt.subplot(1, 2, 2)
plot_graphs(history, 'loss')
plt.ylim(0, None)

Run a prediction on a new sentence: If the prediction is >= `0.0`, it is positive else it is negative.

In [None]:
sample_text = ('The movie was cool. The animation and the graphics '
               'were out of this world. I would recommend this movie.')
predictions = 

#### Stack two or more LSTM layers

Keras recurrent layers have two available modes that are controlled by the `return_sequences` constructor argument:

* If `False` it returns only the last output for each input sequence (a 2D tensor of shape (batch_size, output_features)). This is the default, used in the previous model.
* If `True` the full sequences of successive outputs for each timestep is returned (a 3D tensor of shape `(batch_size, timesteps, output_features)`).

Here is what the flow of information looks like with `return_sequences=True`:

![layered_bidirectional](images/layered-bidirectional.png)

The interesting thing about using an RNN with return_sequences=True is that the output still has 3-axes, like the input, so it can be passed to another RNN layer, like this:

In [None]:
model_stacked = 

In [None]:
model_stacked.

In [None]:
history = 

In [None]:
test_loss, test_acc = 

print('Test Loss:', test_loss)
print('Test Accuracy:', test_acc)

In [None]:
sample_text = ('The movie was not good. The animation and the graphics '
               'were terrible. I would not recommend this movie.')
predictions = 
print(predictions)

In [None]:
plt.figure(figsize=(16, 6))
plt.subplot(1, 2, 1)
plot_graphs(history, 'accuracy')
plt.subplot(1, 2, 2)
plot_graphs(history, 'loss')