# IMDB Movie Review Sentiment Classification

**Dataset Name:** ACLImdb Movie Review Sentiment Analysis Dataset

**Dataset Description:** The ACLImdb dataset is a benchmark dataset commonly used for sentiment analysis tasks, particularly in the domain of natural language processing (NLP). It consists of movie reviews collected from the IMDb website, spanning a wide range of genres, ratings, and sentiments. Each review is labeled with its corresponding sentiment, indicating whether the review expresses a positive or negative sentiment towards the movie. 

In [35]:
import os
import keras
import tensorflow as tf
import numpy as np
from keras import layers

### Loading The Raw Dataset

In [3]:
# Download and extract the dataset using these commands
!curl -O https://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz && tar -xf aclImdb_v1.tar.gz

  pid, fd = os.forkpty()


  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 80.2M  100 80.2M    0     0  12.3M      0  0:00:06  0:00:06 --:--:-- 13.0M


In [4]:
!ls aclImdb/train

labeledBow.feat  pos	unsupBow.feat  urls_pos.txt
neg		 unsup	urls_neg.txt   urls_unsup.txt


In [5]:
# remove the unnecessay files
!rm -r aclImdb/train/unsup

In [6]:
# create train, and val set using tensorflow
batch_size = 32
raw_train_ds, raw_val_ds = keras.utils.text_dataset_from_directory("aclImdb/train",
                                                      batch_size = batch_size,
                                                      validation_split = 0.2,
                                                      subset = "both",
                                                      seed = 42)  # add seed or don't shuffle

Found 25000 files belonging to 2 classes.
Using 20000 files for training.
Using 5000 files for validation.


In [7]:
# create test set
raw_test_ds = keras.utils.text_dataset_from_directory("aclImdb/test", batch_size = batch_size)

Found 25000 files belonging to 2 classes.


In [8]:
print(f"Number of batches in raw_train_ds: {raw_train_ds.cardinality()}")
print(f"Number of batches in raw_val_ds: {raw_val_ds.cardinality()}")
print(f"Number of batches in raw_test_ds: {raw_test_ds.cardinality()}")

Number of batches in raw_train_ds: 625
Number of batches in raw_val_ds: 157
Number of batches in raw_test_ds: 782


This means in each batch we have 32 elements (reviews and target). We can check them this way.

In [9]:
for text_batch, label_batch in raw_train_ds.take(1):
    for i in range(2):
        print("text:")
        print(text_batch.numpy()[i])
        print("/n")
        print("Label:")
        print(label_batch.numpy()[i])

text:
b'"Pandemonium" is a horror movie spoof that comes off more stupid than funny. Believe me when I tell you, I love comedies. Especially comedy spoofs. "Airplane", "The Naked Gun" trilogy, "Blazing Saddles", "High Anxiety", and "Spaceballs" are some of my favorite comedies that spoof a particular genre. "Pandemonium" is not up there with those films. Most of the scenes in this movie had me sitting there in stunned silence because the movie wasn\'t all that funny. There are a few laughs in the film, but when you watch a comedy, you expect to laugh a lot more than a few times and that\'s all this film has going for it. Geez, "Scream" had more laughs than this film and that was more of a horror film. How bizarre is that?<br /><br />*1/2 (out of four)'
/n
Label:
0
text:
b"David Mamet is a very interesting and a very un-equal director. His first movie 'House of Games' was the one I liked best, and it set a series of films with characters whose perspective of life changes as they get int

### Preparing The Data - Normalization

We are now going to remove some of the tags present in the text and lowercase.

In [10]:
import string
import re

def custom_standardization(input_data):
    lowercase = tf.strings.lower(input_data)
    stripped_html = tf.strings.regex_replace(lowercase, "<br />", " ")
    return tf.strings.regex_replace(
        stripped_html, f"[{re.escape(string.punctuation)}]", ""
    )

In [11]:
max_features = 20000 # most frequent words
embedding_dim = 128  
sequence_length = 500 # sequence length truct

vectorizer_layer = layers.TextVectorization(standardize = custom_standardization,
                                                 max_tokens = max_features,
                                                 output_mode = "int",
                                                 output_sequence_length = sequence_length,)
text_ds = raw_train_ds.map(lambda x, y: x)
vectorizer_layer.adapt(text_ds)

In [12]:
# vectorize the data
def vectorize_text(text, label):
    text = tf.expand_dims(text, -1)
    return vectorizer_layer(text), label

train_ds = raw_train_ds.map(vectorize_text)
val_ds = raw_val_ds.map(vectorize_text)
test_ds = raw_test_ds.map(vectorize_text)

In [13]:
# for GPU performance
train_ds = train_ds.cache().prefetch(buffer_size = 10)
val_ds = val_ds.cache().prefetch(buffer_size = 10)
test_ds = test_ds.cache().prefetch(buffer_size = 10)

### Create First Model: Baseline

In [14]:
from keras.callbacks import EarlyStopping, ReduceLROnPlateau

early_stopping = EarlyStopping(monitor='val_loss', patience=5, restore_best_weights=True)
reduce_lr_on_plateu = ReduceLROnPlateau(monitor = "val_loss", fraction = 0.5, patience = 2, verbose = 1)

inputs = keras.Input(shape = (None,), dtype = "int64")
x = layers.Embedding(max_features, embedding_dim)(inputs)
x = layers.Dropout(0.5)(x)
x = layers.Conv1D(128, 7, padding = 'valid', activation = "relu", strides = 3)(x)
x = layers.Conv1D(128, 7, padding = "valid", activation = "relu", strides = 3)(x)
x = layers.GlobalMaxPooling1D()(x)
x = layers.Dense(128, activation = "relu")(x)
x = layers.Dropout(0.5)(x)

outputs = layers.Dense(1, activation = "sigmoid", name = "predictions")(x)

model = keras.Model(inputs, outputs)
model.compile(loss = "binary_crossentropy", optimizer = "adam", metrics = ["accuracy"])

print("Model Summary:")
model.summary()

print("Training The Model Now:")

model.fit(train_ds, validation_data=val_ds, epochs= 20, callbacks = [early_stopping, reduce_lr_on_plateu])

Model Summary:


Training The Model Now:
Epoch 1/20
[1m625/625[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m62s[0m 95ms/step - accuracy: 0.5810 - loss: 0.6296 - val_accuracy: 0.8688 - val_loss: 0.3046 - learning_rate: 0.0010
Epoch 2/20
[1m625/625[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m57s[0m 92ms/step - accuracy: 0.8863 - loss: 0.2840 - val_accuracy: 0.8840 - val_loss: 0.2940 - learning_rate: 0.0010
Epoch 3/20
[1m625/625[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m82s[0m 91ms/step - accuracy: 0.9476 - loss: 0.1453 - val_accuracy: 0.8704 - val_loss: 0.4182 - learning_rate: 0.0010
Epoch 4/20
[1m625/625[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 88ms/step - accuracy: 0.9758 - loss: 0.0711
Epoch 4: ReduceLROnPlateau reducing learning rate to 0.00010000000474974513.
[1m625/625[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m58s[0m 93ms/step - accuracy: 0.9758 - loss: 0.0711 - val_accuracy: 0.8786 - val_loss: 0.4601 - learning_rate: 0.0010
Epoch 5/20
[1m625/625[0m [32m━━━━━

<keras.src.callbacks.history.History at 0x7cc4672a2a40>

In [15]:
early_stopping.best

0.2940349280834198

In [16]:
model.evaluate(test_ds)

[1m782/782[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m17s[0m 22ms/step - accuracy: 0.8748 - loss: 0.3157


[0.31300050020217896, 0.8762800097465515]

In [17]:
model.save("model1.h5")

### Using Bidrectional LSTM

In [19]:
inputs = keras.Input(shape = (None,), dtype = "int64")
x = layers.Embedding(max_features, embedding_dim)(inputs)
x = layers.Bidirectional(layers.LSTM(32, return_sequences = True))(x)
x = layers.GlobalMaxPool1D()(x)
x = layers.Dense(20, activation = "relu")(x)
x = layers.Dropout(0.5)(x)
outputs = layers.Dense(1, activation = "sigmoid", name = "predictions")(x)

model = keras.Model(inputs, outputs)
model.compile(loss = "binary_crossentropy", optimizer = "adam", metrics = ["accuracy"])

print("Model Summary:")
model.summary()

print("Training The Model Now:")

model.fit(train_ds, validation_data=val_ds, epochs= 20, callbacks = [early_stopping, reduce_lr_on_plateu])

Model Summary:


Training The Model Now:
Epoch 1/20
[1m625/625[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m195s[0m 305ms/step - accuracy: 0.6830 - loss: 0.5569 - val_accuracy: 0.8870 - val_loss: 0.2803 - learning_rate: 0.0010
Epoch 2/20
[1m625/625[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m191s[0m 306ms/step - accuracy: 0.9085 - loss: 0.2514 - val_accuracy: 0.8886 - val_loss: 0.2968 - learning_rate: 0.0010
Epoch 3/20
[1m625/625[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 290ms/step - accuracy: 0.9534 - loss: 0.1501
Epoch 3: ReduceLROnPlateau reducing learning rate to 0.00010000000474974513.
[1m625/625[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m191s[0m 306ms/step - accuracy: 0.9535 - loss: 0.1501 - val_accuracy: 0.8860 - val_loss: 0.3317 - learning_rate: 0.0010
Epoch 4/20
[1m625/625[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m190s[0m 305ms/step - accuracy: 0.9777 - loss: 0.0788 - val_accuracy: 0.8906 - val_loss: 0.3641 - learning_rate: 1.0000e-04
Epoch 5/20
[1m625/625[

<keras.src.callbacks.history.History at 0x7cc434cd6ad0>

In [20]:
early_stopping.best

0.2803375720977783

In [21]:
model.evaluate(test_ds)

[1m782/782[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m53s[0m 68ms/step - accuracy: 0.8759 - loss: 0.2968


[0.290151983499527, 0.8795599937438965]