# CS492 전산학특강<인공지능 산업 및 스마트에너지>
## Deep Learning Practice 
#### Prof. Ho-Jin Choi
#### School of Computing, KAIST

---

### 5-3. Text classification with an RNN

Let's build text classification model with RNN on the IMDB dataset for sentiment analysis.

In [None]:
from __future__ import absolute_import, division, print_function, unicode_literals

try:
  # %tensorflow_version only exists in Colab.
  %tensorflow_version 2.x
except Exception:
    pass
import tensorflow_datasets as tfds
import tensorflow as tf

#### Setup input pipeline

The IMDB large movie review dataset is a binary classification dataset—all the reviews have either a positive or negative sentiment.

Download the dataset using [`TFDS`](https://www.tensorflow.org/datasets). The dataset comes with an inbuilt subword tokenizer.
- Args of [`tfds.load`](https://www.tensorflow.org/datasets/api_docs/python/tfds/load):
    - `name`: str, the registered name of the DatasetBuilder.
    - `with_info`: bool, if True, tfds.load will return the tuple (tf.data.Dataset, tfds.core.DatasetInfo) containing the info associated with the builder.
    - `as_supervised`: bool, if True, the returned tf.data.Dataset will have a 2-tuple structure (input, label) according to builder.info.supervised_keys. If False, the default, the returned tf.data.Dataset will have a dictionary with all the features.

In [None]:
dataset, info = tfds.load('imdb_reviews/subwords8k', with_info=True,
                          as_supervised=True)

train_dataset, test_dataset = dataset['train'], dataset['test']

As this is a subwords tokenizer, it can be passed any string and the tokenizer will tokenize it.
- Methods of [`tfds.features.text.SubwordTextEncoder`](https://www.tensorflow.org/datasets/api_docs/python/tfds/features/text/SubwordTextEncoder#encode):
    - [`encode`](https://www.tensorflow.org/datasets/api_docs/python/tfds/features/text/SubwordTextEncoder#encode): Encodes text into a list of integers.
    - [`decode`](https://www.tensorflow.org/datasets/api_docs/python/tfds/features/text/SubwordTextEncoder#decode): Decodes a list of integers into text.

In [None]:
tokenizer = 

print ('Vocabulary size: {}'.format(tokenizer.vocab_size))

In [None]:
sample_string = 'TensorFlow is cool.'

# Encode the sample string to integers
tokenized_string = 
print ('Tokenized string is {}'.format(tokenized_string))

# Decode the encoded integers to the string 
original_string = 
print ('The original string: {}'.format(original_string))


assert original_string == sample_string

The tokenizer encodes the string by breaking it into subwords if the word is not in its dictionary.

In [None]:
for ts in tokenized_string:
    print ('{} ----> {}'.format(ts, tokenizer.decode([ts])))

In [None]:
BUFFER_SIZE = 10000
BATCH_SIZE = 64

[`tf.data.Dataset.padded_batch`](https://www.tensorflow.org/versions/r2.0/api_docs/python/tf/data/Dataset#padded_batch): Combines consecutive elements of this dataset into padded batches.

In [None]:
train_dataset = 
train_dataset = 

test_dataset = 

#### Create the model
Build a `tf.keras.Sequential` model and start with an _embedding layer_. An _embedding layer_ stores one vector per word. When called, **it converts the sequences of word indices to sequences of vectors**. **These vectors are trainable**. After training (on enough data), words with similar meanings often have similar vectors.

A recurrent neural network (RNN) processes sequence input by iterating through the elements. RNNs pass the outputs from one timestep to their input—and then to the next.


[`tf.keras.layers.LSTM`](https://www.tensorflow.org/versions/r2.0/api_docs/python/tf/keras/layers/LSTM): An LSTM layer with size units=hidden_units.

In [None]:
from tensorflow.keras import Sequential
from tensorflow.keras.layers import Embedding, Bidirectional, Dense, LSTM

lstm_model = Sequential(


)

Compile the Keras model to configure the training process:

In [None]:
lstm_model.compile(

)

#### Train the model

In [None]:
lstm_history = lstm_model.fit(

)

In [None]:
from google.colab import drive

drive.mount('/gdrive')

In [None]:
import os

gdrive_root = '/gdrive/My Drive'
print('In gdrive :', os.listdir(gdrive_root))

notebook_dir = os.path.join(gdrive_root, 'Colab Notebooks')
print('In Colab Notebooks :', os.listdir(notebook_dir))

In [None]:
lstm_model = tf.keras.models.load_model(os.path.join(notebook_dir, 'lstm_model_10ep.h5'))

In [None]:
test_loss, test_acc = 

print('Test Loss: {}'.format(test_loss))
print('Test Accuracy: {}'.format(test_acc))

The [`tf.keras.layers.Bidirectional`](https://www.tensorflow.org/versions/r2.0/api_docs/python/tf/keras/layers/Bidirectional) wrapper can also be used with an RNN layer. This propagates the **input forward and backwards through the RNN layer and then concatenates the output**. This helps the RNN to learn long range dependencies.
- Example of Bidirectonal LSTM model
<img src=https://www.i2tutorials.com/wp-content/uploads/2019/05/Deep-Dive-into-Bidirectional-LSTM-i2tutorials.jpg>

In [None]:
blstm_model = Sequential(

)

blstm_model.compile(

)

In [None]:
blstm_history = blstm_model.fit(train_dataset, 
                                epochs=10,
                                validation_data=test_dataset)

In [None]:
blstm_model = tf.keras.models.load_model(os.path.join(notebook_dir, 'blstm_model_10ep.h5'))

In [None]:
test_loss, test_acc = blstm_model.evaluate(test_dataset)

print('Test Loss: {}'.format(test_loss))
print('Test Accuracy: {}'.format(test_acc))

The above **model does not mask the padding applied to the sequences**. This can lead to skewness if we train on padded sequences and test on un-padded sequences. Ideally the model would learn to ignore the padding, but as you can see below it does have a small effect on the output.

If the prediction is >= 0.5, it is positive else it is negative.

In [None]:
def pad_to_size(vec, size):
    zeros = [0] * (size - len(vec))
    vec.extend(zeros)
    return vec

In [None]:
def sample_predict(sentence, pad):
    tokenized_sample_pred_text = tokenizer.encode(sample_pred_text)

    if pad:
        tokenized_sample_pred_text = pad_to_size(tokenized_sample_pred_text, 64)

    predictions = blstm_model.predict(tf.expand_dims(tokenized_sample_pred_text, 0))

    return (predictions)

In [None]:
# predict on a sample text without padding.

sample_pred_text = ('The movie was cool. The animation and the graphics '
                    'were out of this world. I would recommend this movie.')
predictions = sample_predict(sample_pred_text, pad=False)
print (predictions)

In [None]:
# predict on a sample text with padding

sample_pred_text = ('The movie was cool. The animation and the graphics '
                    'were out of this world. I would recommend this movie.')
predictions = sample_predict(sample_pred_text, pad=True)
print (predictions)

In [None]:
import matplotlib.pyplot as plt

def plot_graphs(history, string):
    plt.plot(history.history[string])
    plt.plot(history.history['val_'+string])
    plt.xlabel("Epochs")
    plt.ylabel(string)
    plt.legend([string, 'val_'+string])
    plt.show()

In [None]:
plot_graphs(blstm_history, 'accuracy')

#### Stack two or more LSTM layers
Keras recurrent layers have two available modes that are controlled by the **_return_sequences_** constructor argument:
- Return either the full sequences of successive outputs for each timestep (a 3D tensor of shape `(batch_size, timesteps, output_features)`).
- Return only the last output for each input sequence (a 2D tensor of shape `(batch_size, output_features)`).

In [None]:
blstm_2_model = Sequential([
])

In [None]:
blstm_2_model.compile(loss='binary_crossentropy',
                     optimizer='adam',
                     metrics=['accuracy'])

In [None]:
blstm_2_history = blstm_2_model.fit(train_dataset, epochs=10,
                           validation_data=test_dataset)

In [None]:
blstm_2_model = tf.keras.models.load_model(os.path.join(notebook_dir, 'blstm_2_model_10ep.h5'))

In [None]:
test_loss, test_acc = blstm_2_model.evaluate(test_dataset)

print('Test Loss: {}'.format(test_loss))
print('Test Accuracy: {}'.format(test_acc))

In [None]:
plot_graphs(lstm_2_history, 'accuracy')

In [None]:
plot_graphs(lstm_2_history, 'loss')