# IMDB Movie Review Sentiment Classification
__

**Author:** Amritesh Kumar, Neuraldemy  
**Course:** Natural Language Processing  
**Notebook No:** 02.5   
**Website:** https://neuraldemy.com/  
__

Readers are expected to have gone through the theory discussed in our free NLP tutorial and previous notebooks


In [5]:
import os
import keras
import tensorflow as tf
import numpy as np
from keras import layers

**Dataset Name:** ACLImdb Movie Review Sentiment Analysis Dataset

**Dataset Description:** The ACLImdb dataset is a benchmark dataset commonly used for sentiment analysis tasks, particularly in the domain of natural language processing (NLP). It consists of movie reviews collected from the IMDb website, spanning a wide range of genres, ratings, and sentiments. Each review is labeled with its corresponding sentiment, indicating whether the review expresses a positive or negative sentiment towards the movie. 

### Loading The Raw Dataset

In [26]:
# Download and extract the dataset using these commands
!curl -O https://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz && tar -xf aclImdb_v1.tar.gz

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 80.2M  100 80.2M    0     0  5431k      0  0:00:15  0:00:15 --:--:-- 6909k


In [28]:
!ls aclImdb/train

labeledBow.feat  pos	unsupBow.feat  urls_pos.txt
neg		 unsup	urls_neg.txt   urls_unsup.txt


In [29]:
# remove the unnecessay files
!rm -r aclImdb/train/unsup

In [30]:
# create train, and val set using tensorflow
batch_size = 32
raw_train_ds, raw_val_ds = keras.utils.text_dataset_from_directory("aclImdb/train",
                                                      batch_size = batch_size,
                                                      validation_split = 0.2,
                                                      subset = "both",
                                                      seed = 42)  # add seed or don't shuffle

Found 25000 files belonging to 2 classes.
Using 20000 files for training.
Using 5000 files for validation.


In [31]:
# create test set
raw_test_ds = keras.utils.text_dataset_from_directory("aclImdb/test", batch_size = batch_size)

Found 25000 files belonging to 2 classes.


In [23]:
print(f"Number of batches in raw_train_ds: {raw_train_ds.cardinality()}")
print(f"Number of batches in raw_val_ds: {raw_val_ds.cardinality()}")
print(f"Number of batches in raw_test_ds: {raw_test_ds.cardinality()}")

Number of batches in raw_train_ds: 1875
Number of batches in raw_val_ds: 469
Number of batches in raw_test_ds: 782


This means in each batch we have 32 elements (reviews and target). We can check them this way.

In [37]:
for text_batch, label_batch in raw_train_ds.take(1):
    for i in range(2):
        print("text:")
        print(text_batch.numpy()[i])
        print("/n")
        print("Label:")
        print(label_batch.numpy()[i])

text:
b'Great movie - especially the music - Etta James - "At Last". This speaks volumes when you have finally found that special someone.'
/n
Label:
0
text:
b"I am shocked. Shocked and dismayed that the 428 of you IMDB users who voted before me have not given this film a rating of higher than 7. 7?!?? - that's a C!. If I could give FOBH a 20, I'd gladly do it. This film ranks high atop the pantheon of modern comedy, alongside Half Baked and Mallrats, as one of the most hilarious films of all time. If you know _anything_ about rap music - YOU MUST SEE THIS!! If you know nothing about rap music - learn something!, and then see this! Comparisons to 'Spinal Tap' fail to appreciate the inspired genius of this unique film. If you liked Bob Roberts, you'll love this. Watch it and vote it a 10!"
/n
Label:
1


### Preparing The Data - Normalization

We are now going to remove some of the tags present in the text and lowercase.

In [56]:
import string
import re

def custom_standardization(input_data):
    lowercase = tf.strings.lower(input_data)
    stripped_html = tf.strings.regex_replace(lowercase, "<br />", " ")
    return tf.strings.regex_replace(
        stripped_html, f"[{re.escape(string.punctuation)}]", ""
    )

In [57]:
max_features = 20000 # most frequent words
embedding_dim = 128  
sequence_length = 500 # sequence length truct

vectorizer_layer = layers.TextVectorization(standardize = custom_standardization,
                                                 max_tokens = max_features,
                                                 output_mode = "int",
                                                 output_sequence_length = sequence_length,)
text_ds = raw_train_ds.map(lambda x, y: x)
vectorizer_layer.adapt(text_ds)

In [62]:
# vectorize the data
def vectorize_text(text, label):
    text = tf.expand_dims(text, -1)
    return vectorizer_layer(text), label

train_ds = raw_train_ds.map(vectorize_text)
val_ds = raw_val_ds.map(vectorize_text)
test_ds = raw_test_ds.map(vectorize_text)

In [63]:
# for GPU performance
train_ds = train_ds.cache().prefetch(buffer_size = 10)
val_ds = val_ds.cache().prefetch(buffer_size = 10)
test_ds = test_ds.cache().prefetch(buffer_size = 10)

### Create First Model: Baseline

In [67]:
from keras.callbacks import EarlyStopping, ReduceLROnPlateau

early_stopping = EarlyStopping(monitor='val_loss', patience=5, restore_best_weights=True)
reduce_lr_on_plateu = ReduceLROnPlateau(monitor = "val_loss", fraction = 0.5, patience = 2, verbose = 1)

inputs = keras.Input(shape = (None,), dtype = "int64")
x = layers.Embedding(max_features, embedding_dim)(inputs)
x = layers.Dropout(0.5)(x)
x = layers.Conv1D(128, 7, padding = 'valid', activation = "relu", strides = 3)(x)
x = layers.Conv1D(128, 7, padding = "valid", activation = "relu", strides = 3)(x)
x = layers.GlobalMaxPooling1D()(x)
x = layers.Dense(128, activation = "relu")(x)
x = layers.Dropout(0.5)(x)

outputs = layers.Dense(1, activation = "sigmoid", name = "predictions")(x)

model = keras.Model(inputs, outputs)
model.compile(loss = "binary_crossentropy", optimizer = "adam", metrics = ["accuracy"])

print("Model Summary:")
model.summary()

print("Training The Model Now:")

model.fit(train_ds, validation_data=val_ds, epochs= 20, callbacks = [early_stopping, reduce_lr_on_plateu])

Model Summary:


Training The Model Now:
Epoch 1/20
[1m625/625[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m53s[0m 82ms/step - accuracy: 0.5706 - loss: 0.6404 - val_accuracy: 0.8696 - val_loss: 0.3255 - learning_rate: 0.0010
Epoch 2/20
[1m625/625[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m82s[0m 83ms/step - accuracy: 0.8793 - loss: 0.2926 - val_accuracy: 0.8832 - val_loss: 0.2962 - learning_rate: 0.0010
Epoch 3/20
[1m625/625[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m52s[0m 83ms/step - accuracy: 0.9445 - loss: 0.1512 - val_accuracy: 0.8672 - val_loss: 0.4487 - learning_rate: 0.0010
Epoch 4/20
[1m625/625[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 78ms/step - accuracy: 0.9695 - loss: 0.0830
Epoch 4: ReduceLROnPlateau reducing learning rate to 0.00010000000474974513.
[1m625/625[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m52s[0m 82ms/step - accuracy: 0.9695 - loss: 0.0830 - val_accuracy: 0.8364 - val_loss: 0.6670 - learning_rate: 0.0010
Epoch 5/20
[1m625/625[0m [32m━━━━━

<keras.src.callbacks.history.History at 0x7b04dc7b3970>

In [70]:
early_stopping.best

0.2962287962436676

In [71]:
model.evaluate(test_ds)

[1m782/782[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m15s[0m 19ms/step - accuracy: 0.8701 - loss: 0.3177


[0.31076136231422424, 0.8747199773788452]