<a href="https://colab.research.google.com/github/nyp-sit/iti107/blob/main/session-6/text_classification_rnn.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Text Classification using RNN
In this lab exercise, we will learn to use LSTM (an RNN variant) to train a model to classify a piece of text as expressing positive sentiment or negative sentiment.

## Setup

In [None]:
import os
import shutil
import tensorflow as tf
import tensorflow.keras as keras
from datetime import datetime

### Download the IMDb Dataset
You will use the [Large Movie Review Dataset](http://ai.stanford.edu/~amaas/data/sentiment/). You will train a sentiment classifier model on this dataset.

In [None]:
url = "https://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz"

dataset = keras.utils.get_file("aclImdb_v1.tar.gz", url,
                                    untar=True, cache_dir='.',
                                    cache_subdir='')

dataset_dir = os.path.join(os.path.dirname(dataset), 'aclImdb')
os.listdir(dataset_dir)

Take a look at the `train/` directory. It has `pos` and `neg` folders with movie reviews labelled as positive and negative respectively. You will use reviews from `pos` and `neg` folders to train a binary classification model.

In [None]:
train_dir = os.path.join(dataset_dir, 'train')
os.listdir(train_dir)

The `train` directory also has additional folders which should be removed before creating training dataset.

In [None]:
remove_dir = os.path.join(train_dir, 'unsup')
shutil.rmtree(remove_dir)

Next, create a `tf.data.Dataset` using `tf.keras.preprocessing.text_dataset_from_directory`. You can read more about this utility from the [api documentation](https://www.tensorflow.org/api_docs/python/tf/keras/preprocessing/text_dataset_from_directory). 

Use the `train` directory to create both train and validation datasets with a split of 20% for validation. Also note that here we use a smaller batch size of 128, as our model now is more complex, and will use up some significant memory, leaving little room for larger batch size.

***Important note***

Note: When using the `validation_split` and `subset` arguments, make sure to either specify a random seed, or to pass `shuffle=False`, so that the validation and training splits have no overlap.

In [None]:
batch_size = 128
seed = 123
train_ds = keras.preprocessing.text_dataset_from_directory(
    'aclImdb/train', batch_size=batch_size, validation_split=0.2, 
    subset='training', seed=seed)
val_ds = keras.preprocessing.text_dataset_from_directory(
    'aclImdb/train', batch_size=batch_size, validation_split=0.2, 
    subset='validation', seed=seed)
test_ds = keras.preprocessing.text_dataset_from_directory(
    'aclImdb/test', 
    batch_size=batch_size)

Take a look at a few movie reviews and their labels `(1: positive, 0: negative)` from the train dataset.


In [None]:
for text_batch, label_batch in train_ds.take(1):
    for i in range(3):
        print(label_batch[i].numpy(), text_batch.numpy()[i])

In [None]:
train_ds = train_ds.cache().prefetch(buffer_size=tf.data.AUTOTUNE)
val_ds = val_ds.cache().prefetch(buffer_size=tf.data.AUTOTUNE)
test_ds = test_ds.cache().prefetch(buffer_size=tf.data.AUTOTUNE)

## Text preprocessing

Next, define the dataset preprocessing steps required for your sentiment classification model. Initialize a TextVectorization layer with the desired parameters to vectorize movie reviews. 

TextVectorization layer is a text tokenizer which breaks up the text into words (it is similar to Keras Tokenizer but implemented as a layer). You can read more about TextVectorization layer [here](https://www.tensorflow.org/api_docs/python/tf/keras/layers/experimental/preprocessing/TextVectorization).

There are two ways to use TextVectorization: 
1. as part of the `tf.data` pipeline
2. as part of the model (i.e. as a layer in the model)

If you are doing training on GPU, it is better to use option 1 as it allows you to do asynchronous preprocessing of your data on CPU (because text vectorization does not use GPU), while GPU runs the model on one batch of data.This will lead to better training throughput. 

However, if you are exporting the model for inference in the production environment, you want to package the TextVectorization layer as part of the model, to make entire model self-contained without having to deploy additional preprocessing codes.

In the code below, we will use option 1 for pre-processing text.

In [None]:
# Vocabulary size and number of words in a sequence.
VOCAB_SIZE = 10000
MAX_SEQUENCE_LENGTH = 200

# Use the text vectorization layer to normalize, split, and map strings to 
# integers.
# Set maximum_sequence length as all samples are not of the same length.

vectorize_layer = keras.layers.TextVectorization(
    max_tokens=VOCAB_SIZE, 
    output_sequence_length=MAX_SEQUENCE_LENGTH
)

# Make a text-only dataset (no labels) and call adapt to build the vocabulary.
text_ds = train_ds.map(lambda x, y: x)
vectorize_layer.adapt(text_ds)

In [None]:
print(len(vectorize_layer.get_vocabulary()))

In [None]:
int_train_ds = train_ds.map(
    lambda x, y: (vectorize_layer(x), y),
    num_parallel_calls=4)

int_val_ds = val_ds.map(
    lambda x, y: (vectorize_layer(x), y),
    num_parallel_calls=4)

int_test_ds = test_ds.map(
    lambda x, y: (vectorize_layer(x), y),
    num_parallel_calls=4)

## Create a classification model

<img src="https://nyp-aicourse.s3-ap-southeast-1.amazonaws.com/resources/it3103/bidirectionalRNN.png"/>

Above is a diagram of the model. 

1. This model can be built as a `tf.keras.Sequential`.

2. The first layer is the vectorization layer, which converts the text to a sequence of token indices.

3. After the vectorization layer is an embedding layer. An embedding layer stores one vector per word. When called, it converts the sequences of word indices to sequences of vectors. These vectors are trainable. After training (on enough data), words with similar meanings often have similar vectors.

  This index-lookup is much more efficient than the equivalent operation of passing a one-hot encoded vector through a `tf.keras.layers.Dense` layer.

4. We use a masking layer to mask off those padded positions so that the padded positions will not be used in the computation of loss. 

5. A recurrent neural network (RNN) processes sequence input by iterating through the elements. RNNs pass the outputs from one timestep to their input on the next timestep.

  The `tf.keras.layers.Bidirectional` wrapper can also be used with an RNN layer. This propagates the input forward and backwards through the RNN layer and then concatenates the final output. 

  * The main advantage of a bidirectional RNN is that the signal from the beginning of the input doesn't need to be processed all the way through every timestep to affect the output.  

  * The main disadvantage of a bidirectional RNN is that you can't efficiently stream predictions as words are being added to the end.

6. After the RNN has converted the sequence to a single vector the two `layers.Dense` do some final processing, and convert from this vector representation to a single logit as the classification output. 


In [None]:
EMBEDDING_DIM=128

inputs = keras.Input(shape=(MAX_SEQUENCE_LENGTH,))

embedded = keras.layers.Embedding(input_dim=VOCAB_SIZE, 
                                  output_dim=EMBEDDING_DIM, 
                                  mask_zero=True)(inputs)
x = keras.layers.Bidirectional(keras.layers.LSTM(32))(embedded)
x = keras.layers.Dropout(0.5)(x)
x = keras.layers.Dense(32)(x)
x = keras.layers.Dropout(0.5)(x)
outputs = keras.layers.Dense(1, activation='sigmoid')(x)

model = keras.Model(inputs, outputs)

## Compile and train the model

We will use the model_checkpoint_callback to save our best checkpoint in terms of validation accuracy.

In [None]:
def save_best_model(checkpoint_path): 

    model_checkpoint_callback = keras.callbacks.ModelCheckpoint(
        filepath=checkpoint_path,
        save_weights_only=True,
        monitor='val_accuracy',
        mode='max',
        save_best_only=True)
    
    return model_checkpoint_callback

Compile and train the model using the `Adam` optimizer and `BinaryCrossentropy` loss. 

In [None]:
model.compile(optimizer='adam',
              loss=keras.losses.BinaryCrossentropy(from_logits=False),
              metrics=['accuracy'])


In [None]:
model.fit(
    int_train_ds, 
    validation_data=int_val_ds,
    epochs=3, 
    callbacks=[save_best_model('best_checkpoint_1_bilstm')])

The model reaches a validation accuracy of around 85% after 3 epochs of training.

Note: Your results may be a bit different, depending on how weights were randomly initialized before training the embedding layer. 


Let's evaluate the model on our test dataset.

In [None]:
model.load_weights("best_checkpoint_1_bilstm")
model.evaluate(int_test_ds)

## Prepare Model for Deployment 

As mentioned earlier, it is better to package the TextVectorization layer as part of the model for ease of deployment, so that we can run the raw text directly through the model during inference.

In the code below, we declare a Input layer that takes in a string (shape=(1,)), and we add the Text Vectorization layer, and then stick them to our previous model.

In [None]:
inputs = keras.Input(shape=(1,), dtype="string")
processed_inputs = vectorize_layer(inputs)
outputs = model(processed_inputs)
inference_model = keras.Model(inputs, outputs)


Let's go ahead and save our model. 

In [None]:
inference_model.save('sentiment_model')

Now let us put our model in use!!  We will first load our saved model.


In [None]:
loaded_model = keras.models.load_model('sentiment_model')

In [None]:
loaded_model.summary()

Run the following cell and type in your own text at the prompt:

In [None]:
text = input("Write your review here:")

In [None]:
text = tf.convert_to_tensor(text)
text = tf.expand_dims(text, axis=0)
pred = loaded_model(text)[0]
print(pred)
if pred >= 0.5: 
    print('positive sentiment')
else:
    print('negative sentiment')

## Stack two or more LSTM layers

Keras recurrent layers have two available modes that are controlled by the `return_sequences` constructor argument:

* If `False` it returns only the last output for each input sequence (a 2D tensor of shape (batch_size, output_features)). This is the default, used in the previous model.

* If `True` the full sequences of successive outputs for each timestep is returned (a 3D tensor of shape `(batch_size, timesteps, output_features)`).

Here is what the flow of information looks like with `return_sequences=True`:

<img src="https://nyp-aicourse.s3-ap-southeast-1.amazonaws.com/resources/it3103/layered_bidirectional.png"/>

In [None]:
inputs = keras.Input(shape=(MAX_SEQUENCE_LENGTH,))
embedded = keras.layers.Embedding(input_dim=VOCAB_SIZE, 
                                  output_dim=EMBEDDING_DIM,
                                  mask_zero=True,
                                  name='embedding')(inputs)
x = keras.layers.Bidirectional(keras.layers.LSTM(64,  return_sequences=True))(embedded)
x = keras.layers.Bidirectional(keras.layers.LSTM(32))(x)
x = keras.layers.Dropout(0.4)(x)
x = keras.layers.Dense(64, activation='relu')(x)
x = keras.layers.Dropout(0.4)(x)
outputs = keras.layers.Dense(1, activation='sigmoid')(x)

model = keras.Model(inputs, outputs)

In [None]:
model.compile(loss=keras.losses.BinaryCrossentropy(from_logits=False),
              optimizer=keras.optimizers.Adam(1e-4),
              metrics=['accuracy'])

In [None]:
model.fit(int_train_ds, 
          epochs=5,
          validation_data=int_val_ds,
          callbacks=[save_best_model('best_checkpoint_1_stackedlstm')]
)

In [None]:
model.load_weights("best_checkpoint_1_stackedlstm")
model.evaluate(int_test_ds)

As before, we prepare the model for deployment by adding in vectorization layer.

In [None]:
inputs = keras.Input(shape=(1,), dtype="string")
processed_inputs = vectorize_layer(inputs)
outputs = model(processed_inputs)
inference_model = keras.Model(inputs, outputs)
inference_model.compile(loss=keras.losses.BinaryCrossentropy(from_logits=False),
              optimizer=keras.optimizers.Adam(1e-4),
              metrics=['accuracy'])

In [None]:
inference_model.evaluate(test_ds)

In [None]:
sample_text = "I kind of dozed off halfway through the show."
text = tf.convert_to_tensor(sample_text)
text = tf.expand_dims(text, axis=0)
#sample_text = "This movie has to be the worst I have seen. "
pred = inference_model(text, training=False)[0]
if pred >= 0.5: 
    print(f'positive sentiment: {pred}')
else:
    print(f'negative sentiment: {pred}')

## Exercises

For the first model, experiment with any of following to see if you get better or worse validation accuracy.

1. Increase/decrease vocabulary size 
2. Increase/decrease Embedding dimensions 
3. Use uni-directional LSTM instead of bidirectional LSTM.

Provide a plausible explanation to your observation.
