<a href="https://colab.research.google.com/github/jaredteoh/Sentimental-Analysis-on-IMDb-Reviews-Dataset/blob/main/Sentimental_Analysis_on_IMDb_Reviews_Dataset.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Introduction

The IMDb reviews dataset consists of 50000 movie reviews (25000 for training, 25000 for testing), along with a binary target for each review indicating whether it is negative (0) or positive (1).

Reference: Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow by Aurelien Geron

## Data Preprocessing

In [1]:
import tensorflow as tf
from tensorflow import keras

We will start by loading the original IMDb reviews, as text, using TensorFlow Datasets:

In [2]:
import tensorflow_datasets as tfds

datasets, info = tfds.load("imdb_reviews", as_supervised=True, with_info=True)
train_size = info.splits["train"].num_examples
test_size = info.splits["test"].num_examples

[1mDownloading and preparing dataset imdb_reviews/plain_text/1.0.0 (download: 80.23 MiB, generated: Unknown size, total: 80.23 MiB) to /root/tensorflow_datasets/imdb_reviews/plain_text/1.0.0...[0m


Dl Completed...: 0 url [00:00, ? url/s]

Dl Size...: 0 MiB [00:00, ? MiB/s]





0 examples [00:00, ? examples/s]

Shuffling and writing examples to /root/tensorflow_datasets/imdb_reviews/plain_text/1.0.0.incompleteBVN22A/imdb_reviews-train.tfrecord


  0%|          | 0/25000 [00:00<?, ? examples/s]

0 examples [00:00, ? examples/s]

Shuffling and writing examples to /root/tensorflow_datasets/imdb_reviews/plain_text/1.0.0.incompleteBVN22A/imdb_reviews-test.tfrecord


  0%|          | 0/25000 [00:00<?, ? examples/s]

0 examples [00:00, ? examples/s]

Shuffling and writing examples to /root/tensorflow_datasets/imdb_reviews/plain_text/1.0.0.incompleteBVN22A/imdb_reviews-unsupervised.tfrecord


  0%|          | 0/50000 [00:00<?, ? examples/s]



[1mDataset imdb_reviews downloaded and prepared to /root/tensorflow_datasets/imdb_reviews/plain_text/1.0.0. Subsequent calls will reuse this data.[0m


In [3]:
train_size, test_size

(25000, 25000)

Next we will define a preprocessing function:

In [4]:
def preprocess(X_batch, y_batch):
  X_batch = tf.strings.substr(X_batch, 0, 300)
  X_batch = tf.strings.regex_replace(X_batch, b"<br\s*/?>", b" ")
  X_batch = tf.strings.regex_replace(X_batch, b"[^a-zA-Z']", b" ")
  X_batch = tf.strings.split(X_batch)
  return X_batch.to_tensor(default_value=b"<pad>"), y_batch

The function above goes to the process of:
1. Truncating the reviews, keeping only the first 300 characters: it will speed up training and won't impact performance too much.
2. Replaces any characters other than letters and quotes with spaces.
3. Splits the reviews by the spaces, which returns a ragged tensor.
4. Converts the ragged tensor to a dense tensor, padding all reviews with the padding token so that they all have the same length.

Now we will construct the vocabulary, by going through the whole training set once, applying the preprocess() function, and using a Counter to count the number of occurrences of each word:

In [5]:
from collections import Counter

vocabulary = Counter()
for X_batch, y_batch in datasets["train"].batch(32).map(preprocess):
  for review in X_batch:
    vocabulary.update(list(review.numpy()))

Let's look at the three most common words:

In [6]:
vocabulary.most_common()[:3]

[(b'<pad>', 214309), (b'the', 61137), (b'a', 38564)]

In [7]:
len(vocabulary)

53893

We will only be keeping the top 10000 of the most common vocabulary:

In [8]:
vocab_size = 10000
truncated_vocabulary = [word for word, count in vocabulary.most_common()[:vocab_size]]

Now we need to add a preprocessing step to replace each word with its ID. We will create a lookup table for this, using 1000 out-of-vocabulary (oov) buckets:

In [9]:
words = tf.constant(truncated_vocabulary)
words_ids = tf.range(len(truncated_vocabulary), dtype=tf.int64)
vocab_init = tf.lookup.KeyValueTensorInitializer(words, words_ids)
num_oov_buckets = 1000
table = tf.lookup.StaticVocabularyTable(vocab_init, num_oov_buckets)

We can use this table to look up the IDs of a few words:

In [10]:
table.lookup(tf.constant([b"This movie was faaaaaantastic".split()]))

<tf.Tensor: shape=(1, 4), dtype=int64, numpy=array([[   22,    12,    11, 10053]])>

Next we will define a function to encode the preprocessed words:

In [11]:
def encode_words(X_batch, y_batch):
  return table.lookup(X_batch), y_batch

Now we are ready to create the final training set:

In [12]:
train_set = datasets["train"].batch(32).map(preprocess)
train_set = train_set.map(encode_words).prefetch(1)

## Model Building

In [13]:
embed_size = 128
model = keras.models.Sequential([
    keras.layers.Embedding(vocab_size + num_oov_buckets, embed_size, mask_zero=True, input_shape=[None]), 
    keras.layers.GRU(128, return_sequences=True), 
    keras.layers.GRU(128), 
    keras.layers.Dense(1, activation="sigmoid")])

1. The first layer is an Embedding layer, which will convert word IDs into embeddings. The output will be a 3D tensor of shape [batch_size, time_steps, embedding_size].
2. The model is composed of 2 GRU layers, with the second one returning only the output of the last time step.
3. The output layer is a single neuron using the sigmoid activation function to output the estimated probability that the review expresses a positive sentiment regarding the movie.
4. We set mask_zero=True in the Embedding layer, which ignores the padding tokens.

In [14]:
model.compile(loss="binary_crossentropy", optimizer="adam", metrics=["accuracy"])
history = model.fit(train_set, epochs=20)

Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20
Epoch 11/20
Epoch 12/20
Epoch 13/20
Epoch 14/20
Epoch 15/20
Epoch 16/20
Epoch 17/20
Epoch 18/20
Epoch 19/20
Epoch 20/20


We can also reuse pretrained embeddings. In this case we will use the nnlm-en-dim50 sentence embedding module from the TensorFlow Hub repository:

In [15]:
import tensorflow_hub as hub

model = keras.Sequential([
    hub.KerasLayer("https://tfhub.dev/google/tf2-preview/nnlm-en-dim50/1", 
                   dtype=tf.string, input_shape=[], output_shape=[50]), 
    keras.layers.Dense(128, activation="relu"), 
    keras.layers.Dense(1, activation="sigmoid")])

This particular module takes strings as input and encodes each one as a single 50-dimensional vector. It computes the mean of all the word embeddings and the result is the sentence embedding. Note that the hub.KerasLayer is not trainable by default. 

In [16]:
model.compile(loss="binary_crossentropy", optimizer="adam", metrics=["accuracy"])

Next we can just load the IMDb reviews dataset without preprocessing it, and directly train the model:

In [17]:
dataset, info = tfds.load('imdb_reviews', as_supervised=True, with_info=True)
train_size = info.splits['train'].num_examples
batch_size = 32
train_set = datasets['train'].batch(batch_size).prefetch(1)

In [18]:
history = model.fit(train_set, epochs=10)

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


Although this model is less accurate than the one before, it takes much lesser time to train at each epoch. Plus, we can still use more complex layers (e.g., LSTM and GRU layers) to improve its accuracy.