___

<center><h1>IMBD Classification Movie Reviews - DL</h1></center>

___

<center><h2>DSM0150 - Neural Networks</h2></center><br>
<center><strong>Teacher:</strong> Tim Blackwell</center>

___
<p></p>
<center style="color: #AA6373; font-weight: 400;"><strong>Presented by:</strong></center>
<center style="color: #AA6373; font-weight: 400;">Jorge Forero L.</center>
<center style="color: #AA6373; font-weight: 400;">Student Number: 240323983</center>
<center style="color: #AA6373; font-weight: 400;">Student Portal Username: JEFL1</center>
<center>March 2025</center>
<p></p>
___

## 1. Introduction & Problem Statement
<p></p>
This project aims to classify movie reviews in the IMDB dataset as positive or negative using a fully connected (Dense) neural network architecture constrained by Dropout layers, following the universal workflow in Deep Learning with Python (Part 1). By systematically analyzing textual data, tuning hyperparameters, and applying targeted regularization, we seek to identify the primary factors affecting sentiment classification accuracy and mitigate overfitting. Specifically, we investigate how different configurations of layers, units, dropout rates, optimizers, and batch sizes influence model performance, and we examine the extent to which early stopping and dropout help stabilize validation accuracy.

Through this process, we develop a robust predictive model that addresses three core questions: 
    - how architectural choices and parameter tuning impact accuracy
    - how dropout-based regularization reduces overfitting
    - how these findings can guide best practices for sentiment analysis using only Dense and Dropout layers.

The complete analysiswill be version-controlled and hosted on GitHub for easy access and collaboration. You can view and contribute to the project at the following URL: https://github.com/jforeroluque/DSM150_NeuralNetworks_CW1.

<p></p>


### Aims and Objectives
<p></p>
The primary aim of this project is to develop a high-performing sentiment classification model on the IMDB dataset, leveraging fully connected (Dense) layers and Dropout for regularization. We seek to accurately distinguish between positive and negative reviews while minimizing overfitting and maintaining model interpretability.
<p></p>

#### Objectives
<p></p>

1. Experiment with varying numbers of Dense layers, hidden units, dropout rates, and optimizers to strike an optimal balance between performance, generalizability, and training efficiency.
2. Identify both obvious and hidden patterns in the data using innovative visualization and clustering techniques to reveal insights into delivery risks.
3. Examine how Dropout (and possibly early stopping or other forms of regularization) can mitigate overfitting, ensuring that the model generalizes well to unseen data.
4. Compare the tuned model’s accuracy and loss metrics to a common-sense baseline (50% accuracy) and simpler architectures, evaluating the real impact of regularization and hyperparameter tuning.

<p></p>

#### Ethical Considerations
<p></p>

Transparency and Reproducibility
All steps of data preprocessing, model design, and hyperparameter tuning have been documented to ensure that others can replicate and validate the findings. Code and plots are presented in a notebook format for clarity and reproducibility.

For this Coursework we will be improving the results obtained in the previous one, where we got the following results:

Top 5 configurations:
Layers=3, Units=32, Dropout=0.5, Optimizer=adam, BatchSize=1024 -> val_acc=0.8910
Layers=2, Units=32, Dropout=0.2, Optimizer=rmsprop, BatchSize=1024 -> val_acc=0.8907
Layers=3, Units=16, Dropout=0.5, Optimizer=adam, BatchSize=512 -> val_acc=0.8904
Layers=2, Units=16, Dropout=0.2, Optimizer=adam, BatchSize=1024 -> val_acc=0.8892
Layers=2, Units=16, Dropout=0.5, Optimizer=rmsprop, BatchSize=512 -> val_acc=0.8891

30/30 ━━━━━━━━━━━━━━━━━━━━ 0s 8ms/step - accuracy: 0.9733 - loss: 0.0830 - val_accuracy: 0.8851 - val_loss: 0.5093


In [None]:
# Common Modules

import numpy as np
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers
import matplotlib.pyplot as plt
import seaborn as sns
from tensorflow.keras.datasets import imdb


## 2. Data Understanding [1]

In this phase, we collect, describe, and explore the IMDB movie review dataset to gain insights into its structure and primary attributes. The dataset includes 50,000 reviews, split evenly into training and test sets (25,000 each). Each review is presented as a sequence of word indices, representing the words and their frequencies in the movie review text. For our project, we focus on the top 10,000 most frequently used words to reduce sparsity and maintain manageable vector sizes. This IMDB data supports a quantitative evaluation of our hypotheses on how neural network architectures (specifically Dense and Dropout layers) can effectively classify sentiment, revealing which hyperparameters—such as the number of units, dropout rate, or optimizer—are most influential in achieving strong generalization performance.

### Limitations and constrains of the Data

While the IMDB dataset is a valuable resource for exploring sentiment classification, it presents several limitations and constraints that may influence our analysis and model performance:

1. Although we limit the vocabulary to the top 10,000 words for practical modeling, this approach can exclude less frequent but potentially significant words. As a result, some nuances in language usage may be lost, potentially leading to an oversimplified representation of the full sentiment space.
2. Converting each review into a multi-hot vector of word occurrences does not preserve word order or context. This simplification may limit the model’s ability to leverage sequence-based nuances (e.g., negation words or sarcasm), though this trade-off is acceptable under the constraints of using only Dense layers.

### Data Load

In [None]:
(train_data, train_labels), (test_data, test_labels) = imdb.load_data(num_words=10000) # Reference [2] page 95 DLWP

print(f"Training samples: {len(x_train)}")
print(f"Test samples: {len(x_test)}")

### Exploratory Data Analysis

**Sequence Lenght Analysis**

Each review is a list of integers (word indices). We can look at how many words each review contains to understand the distribution of sequence lengths.

In [None]:
review_lengths = [len(sequence) for sequence in x_train]

print(f"Minimum review length: {np.min(review_lengths)}")
print(f"Maximum review length: {np.max(review_lengths)}")
print(f"Average review length: {np.mean(review_lengths):.2f}")

plt.figure(figsize=(10,4))
plt.hist(review_lengths, bins=50, color='blue')
plt.title("Distribution of Review Lengths (Number of Words)")
plt.xlabel("Review Length")
plt.ylabel("Frequency")
plt.show()

#### Data Preprocessing

Since we already did the model using only dense layers now we want to 
Since we are restricted to Dense layers, we will need to convert integer-encoded reviews to multi-hot vectors (or one-hot vectors) of shape (10000,). For this our approach will be the following:

In [None]:
maxlen = 500
x_train_padded = keras.preprocessing.sequence.pad_sequences(x_train, maxlen=maxlen)
x_test_padded = keras.preprocessing.sequence.pad_sequences(x_test, maxlen=maxlen)

## 3. Recurrent Neural Network Model

Here we will be building a model that uses an Embedding layer to learn a dense representation for words, we also will be applying a bidirectional LSTM to capture sequence dependencies from both directions

In [None]:
# RDD model architecture
model_rnn = keras.Sequential([
    layers.Embedding(input_dim=10000, output_dim=128, input_length=maxlen),
    layers.Bidirectional(layers.LSTM(64, return_sequences=True, dropout=0.5, recurrent_dropout=0.5)),
    layers.Bidirectional(layers.LSTM(32, dropout=0.2, recurrent_dropout=0.5)),
    layers.Dropout(0.5),
    layers.Dense(1, activation='sigmoid')
])

In [None]:
model_rnn.compile(
    optimizer='adam',
    loss='binary_crossentropy',
    metrics=['accuracy']
)

In [None]:
# Display model architecture
model_rnn.summary()

In [None]:
# Split into training & validation sets
x_val = x_train[:10000]
partial_x_train = x_train[10000:]
y_val = y_train[:10000]
partial_y_train = y_train[10000:]

#Model Training with Callbacks

early_stopping = keras.callbacks.EarlyStopping(monitor='val_loss', patience=2, restore_best_weights=True) # Reference [2] page 97 DLWP

history_rnn = model_rnn.fit(
    partial_x_train_rnn,
    partial_y_train_rnn,
    epochs=10,              
    batch_size=1024,
    validation_data=(x_val_rnn, y_val_rnn),
    callbacks=[early_stopping]
)

In [None]:
# Evaluating the model on test data

test_loss, test_acc = model_rnn.evaluate(x_test_padded, y_test)
print("Test accuracy:", test_acc)

#### Plotting Training and Validation Metrics

In [None]:
history_dict = history_rnn.history
epochs = range(1, len(history_dict['loss']) + 1)

plt.figure(figsize=(12, 4))

In [None]:
# Plot loss
plt.subplot(1, 2, 1)
plt.plot(epochs, history_dict['loss'], 'bo-', label='Training Loss')
plt.plot(epochs, history_dict['val_loss'], 'ro-', label='Validation Loss')
plt.title('Training & Validation Loss')
plt.xlabel('Epochs')
plt.ylabel('Loss')
plt.legend()


In [None]:
# Plot accuracy
plt.subplot(1, 2, 2)
plt.plot(epochs, history_dict['accuracy'], 'bo-', label='Training Accuracy')
plt.plot(epochs, history_dict['val_accuracy'], 'ro-', label='Validation Accuracy')
plt.title('Training & Validation Accuracy')
plt.xlabel('Epochs')
plt.ylabel('Accuracy')
plt.legend()

plt.tight_layout()
plt.show()

**Signs of Overfitting**

We can see that the training loss continously decreases, approaching near-zero, whereas the validation loss bottoms out around epochs 4-5, then starts climbing. This is a classic sign that the network is like "memorizing" the training data and losing generalizability.

Also we can see that the training accuracy keeps rising surpassing 95% and eventually approachinv 100%. However, the validation accuracy peaks around epochs 4-5 (roughly 88% to 90%) and then starts to decline gradually. This gap between training and validation performance widens with further training, again indicating overfitting.

**Remedies**

One straightforward approach is to monitor validation loss (or validation accuracy) and stop training as soon as it ceases to improve, in this case around epoch 4 or 5. This prevents the model from overfitting further.
Another approach is to use regularization techniques, such as L1 or L2 regularization which penalizes large weights, dropout so it would randomly "dropping" units during training.
We can also consider reducing the number of layers or the number of hidden units so this can help the model generalize better by limiting its capacity to overfit.


## 4. Bigger Model Development

To see the effect of model capacity, we created a bigger model by increasing the number of layers and units so the model could learn training data faster:

In [None]:
model_big = keras.Sequential([ # Reference [4]
    layers.Dense(64, activation='relu', input_shape=(10000,)),
    layers.Dense(64, activation='relu'),
    layers.Dense(64, activation='relu'),
    layers.Dense(1, activation='sigmoid')
])

model_big.compile(
    optimizer='rmsprop',
    loss='binary_crossentropy',
    metrics=['accuracy']
)

history_big = model_big.fit(
    partial_x_train,
    partial_y_train,
    epochs=20,
    batch_size=512,
    validation_data=(x_val, y_val)
)

In [None]:
model_big = keras.Sequential([ # Reference [4]
    layers.Dense(64, activation='relu', input_shape=(10000,)),
    layers.Dense(64, activation='relu'),
    layers.Dense(64, activation='relu'),
    layers.Dense(1, activation='sigmoid')
])

model_big.compile(
    optimizer='rmsprop',
    loss='binary_crossentropy',
    metrics=['accuracy']
)

history_big = model_big.fit(
    partial_x_train,
    partial_y_train,
    epochs=20,
    batch_size=512,
    validation_data=(x_val, y_val)
)

In [None]:
# Extracting history of the bigger model
history_dict = history_big.history
training_accuracy = history_dict['accuracy']
validation_accuracy = history_dict['val_accuracy']
training_loss = history_dict['loss']
validation_loss = history_dict['val_loss']

epochs_range = range(1, len(training_accuracy) + 1)

plt.figure(figsize=(12,4))

In [None]:
# Plot loss
plt.subplot(1,2,1)
plt.plot(epochs_range, training_loss, 'bo-', label='Training Loss')
plt.plot(epochs_range, validation_loss, 'ro-', label='Validation Loss')
plt.title('Training & Validation Loss (Bigger Model)')
plt.xlabel('Epochs')
plt.ylabel('Loss')
plt.legend()

In [None]:
# Plot accuracy
plt.subplot(1,2,2)
plt.plot(epochs_range, training_accuracy, 'bo-', label='Training Accuracy')
plt.plot(epochs_range, validation_accuracy, 'ro-', label='Validation Accuracy')
plt.title('Training & Validation Accuracy (Bigger Model)')
plt.xlabel('Epochs')
plt.ylabel('Accuracy')
plt.legend()

plt.tight_layout()
plt.show()

**Signs of Overfitting**

With this bigger model we can see that the training loss drops amost to zero again by around epoch 10 and remains extremely low afterward. This indicates the model has learned to fit the training data very well. On the other hand, the validation loss initially goes down, around epochs 1-3 but starts to climb from about epoch 4 onward, eventually reaching values above 0.7. This divergence is a classic sign of overfitting because the model keeps fitting the training set more colesly but fails to generalize well to new, unseen data (validation set)

The training accuracy shoots up quickly and plateaus around 98-100%, showing that the model again has almost "memorized" the training data. The validation accuracy improves initially (around epochs 1-3), then hovers around 85% range and even slightly decreases with more epochs. This widening gap between training accuracy (near-perfect) and validation accuracy (stalled in the mid-80s) further confirms overfitting.


## 4. Regularization Experiments

We will be adding Droput layers between the dense alyers to mitigate the overfitting that we are seeing on the previous experiments

In [None]:
from tensorflow.keras import Sequential
from tensorflow.keras.layers import Dense, Dropout, Input

model_dropout = Sequential([
    Input(shape=(10000,)),           
    Dense(16, activation='relu'),   
    Dropout(0.5),
    Dense(16, activation='relu'),
    Dropout(0.5),
    Dense(1, activation='sigmoid')
])

In [None]:
model_dropout.compile(
    optimizer='rmsprop',
    loss='binary_crossentropy',
    metrics=['accuracy']
)

history_dropout = model_dropout.fit(
    partial_x_train,
    partial_y_train,
    epochs=20,
    batch_size=512,
    validation_data=(x_val, y_val)
)

In [None]:
# Extracting history
history_dict = history_dropout.history
training_accuracy = history_dict['accuracy']
validation_accuracy = history_dict['val_accuracy']
training_loss = history_dict['loss']
validation_loss = history_dict['val_loss']

epochs_range = range(1, len(training_accuracy) + 1)

plt.figure(figsize=(12,4))

In [None]:
# Plot loss
plt.subplot(1,2,1)
plt.plot(epochs_range, training_loss, 'bo-', label='Training Loss')
plt.plot(epochs_range, validation_loss, 'ro-', label='Validation Loss')
plt.title('Training & Validation Loss (Bigger Model)')
plt.xlabel('Epochs')
plt.ylabel('Loss')
plt.legend()

In [None]:
# Plot accuracy
plt.subplot(1,2,2)
plt.plot(epochs_range, training_accuracy, 'bo-', label='Training Accuracy')
plt.plot(epochs_range, validation_accuracy, 'ro-', label='Validation Accuracy')
plt.title('Training & Validation Accuracy (Bigger Model)')
plt.xlabel('Epochs')
plt.ylabel('Accuracy')
plt.legend()

plt.tight_layout()
plt.show()

## 5. Investigation of Hyperparameter Settings [5]

We can try to find the best hyperparameters manually with this Droput model. This approach systematically tests different combinations of:
- Number of hidden units (16, 32, 64)
- Number of Dense layers (e.g., 2 or 3 hidden layers)
- Dropout rates (0.2, 0.5)
- Optimizers (rmsprop, adam)
- Batch sizes (128, 512, 1024)


In [None]:
# Start by defining a function to build a model with variable hyperparameters
def build_model(
    num_layers=2,      # number of hidden layers
    num_units=16,      # number of units per hidden layer
    dropout_rate=0.5,  # dropout rate
    optimizer='adam'   # optimizer: 'rmsprop' or 'adam'
):
    model = keras.Sequential()
    
    # Input + first hidden layer
    model.add(layers.Dense(num_units, activation='relu', input_shape=(num_words,)))
    model.add(layers.Dropout(dropout_rate))
    
    # Additional hidden layers
    for _ in range(num_layers - 1):
        model.add(layers.Dense(num_units, activation='relu'))
        model.add(layers.Dropout(dropout_rate))
    
    # Output layer
    model.add(layers.Dense(1, activation='sigmoid'))
    
    model.compile(
        optimizer=optimizer,
        loss='binary_crossentropy',
        metrics=['accuracy']
    )
    return model

In [None]:
# Then we define the hyperparameter grids
layers_list   = [2, 3]             
units_list    = [16, 32, 64]       
dropout_list  = [0.2, 0.5]         
optimizers    = ['rmsprop', 'adam']
batch_sizes   = [128, 512, 1024]

In [None]:
# We’ll store results here
results = []

In [None]:
# Loop over all combinations
for n_layers in layers_list:
    for units in units_list:
        for dropout_rate in dropout_list:
            for opt in optimizers:
                for bsz in batch_sizes:
                    
                    # Build a fresh model
                    model = build_model(
                        num_layers=n_layers,
                        num_units=units,
                        dropout_rate=dropout_rate,
                        optimizer=opt
                    )
                    
                    # Train briefly (just to compare)
                    history = model.fit(
                        partial_x_train,
                        partial_y_train,
                        epochs=7,           # short training for comparison
                        batch_size=bsz,
                        validation_data=(x_val, y_val),
                        verbose=0          # turn off training logs
                    )
                    
                    # Get final validation accuracy
                    final_val_acc = history.history['val_accuracy'][-1]
                    
                    # Store the hyperparams + result
                    results.append({
                        'layers': n_layers,
                        'units': units,
                        'dropout': dropout_rate,
                        'optimizer': opt,
                        'batch_size': bsz,
                        'val_acc': final_val_acc
                    })


In [None]:
# Sort results by validation accuracy (descending) for easy viewing
results_sorted = sorted(results, key=lambda x: x['val_acc'], reverse=True)

In [None]:
# Print the top 5 configurations
print("Top 5 configurations:")
for r in results_sorted[:5]:
    print(
        f"Layers={r['layers']}, "
        f"Units={r['units']}, "
        f"Dropout={r['dropout']}, "
        f"Optimizer={r['optimizer']}, "
        f"BatchSize={r['batch_size']} "
        f"-> val_acc={r['val_acc']:.4f}"
    )

## 6. Recurrent Model CW2

For this project

In [None]:
# ----------------------------
# 1. Data Loading and Preprocessing
# ----------------------------

num_words = 10000
maxlen = 500 #This will be the maximum lenght for the padding

In [None]:
# Load and pad sequences
(x_train, y_train), (x_test, y_test) = keras.datasets.imdb.load_data(num_words=num_words)
x_train = keras.preprocessing.sequence.pad_sequences(x_train, maxlen=maxlen)
x_test = keras.preprocessing.sequence.pad_sequences(x_test, maxlen=maxlen)

In [None]:
# ----------------------------
# 2. Build the Recurrent Model
# ----------------------------
model = keras.Sequential([
    # Embedding layer converts integer indices into dense vectors
    layers.Embedding(input_dim=num_words, output_dim=128),
    
    # Bidirectional LSTM layers capture sequence information from both directions
    layers.Bidirectional(layers.LSTM(64, return_sequences=True, dropout=0.2, recurrent_dropout=0.2)),
    layers.Bidirectional(layers.LSTM(32, dropout=0.2, recurrent_dropout=0.2)),
    
    # Dropout for regularization
    layers.Dropout(0.5),
    
    # Output layer with sigmoid activation for binary classification
    layers.Dense(1, activation='sigmoid')
])

model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])
model.summary()


In [None]:
# ----------------------------
# 3. Create Training and Validation Splits
# ----------------------------
x_val = x_train[:10000]
partial_x_train = x_train[10000:]
y_val = y_train[:10000]
partial_y_train = y_train[10000:]

In [None]:
# ----------------------------
# 4. Advanced Best Practice: EarlyStopping Callback
# ----------------------------
# The EarlyStopping callback will monitor the validation loss and stop training
# if it doesn't improve for 3 consecutive epochs, and it will restore the best model weights.
early_stopping = keras.callbacks.EarlyStopping(monitor='val_loss', patience=3, restore_best_weights=True)

In [None]:
# ----------------------------
# 5. Train the Model with the Callback
# ----------------------------
history = model.fit(
    partial_x_train,
    partial_y_train,
    epochs=20,
    batch_size=128,
    validation_data=(x_val, y_val),
    callbacks=[early_stopping]
)


In [None]:
# ----------------------------
# 6. Evaluate the Model on Test Data
# ----------------------------
test_loss, test_acc = model.evaluate(x_test, y_test)
print("Test accuracy:", test_acc)

In [None]:
# ----------------------------
# 7. Plot Training and Validation Metrics
# ----------------------------
history_dict = history.history
epochs = range(1, len(history_dict['loss']) + 1)

plt.figure(figsize=(12, 5))

In [None]:
# Plot Loss
plt.subplot(1, 2, 1)
plt.plot(epochs, history_dict['loss'], 'bo-', label='Training Loss')
plt.plot(epochs, history_dict['val_loss'], 'ro-', label='Validation Loss')
plt.title('Training & Validation Loss')
plt.xlabel('Epochs')
plt.ylabel('Loss')
plt.legend()

In [None]:
# Plot Accuracy
plt.subplot(1, 2, 2)
plt.plot(epochs, history_dict['accuracy'], 'bo-', label='Training Accuracy')
plt.plot(epochs, history_dict['val_accuracy'], 'ro-', label='Validation Accuracy')
plt.title('Training & Validation Accuracy')
plt.xlabel('Epochs')
plt.ylabel('Accuracy')
plt.legend()

plt.tight_layout()
plt.show()