<a href="https://colab.research.google.com/github/TurkuNLP/intro-to-nlp/blob/master/course_project_template.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Movie Review Sentiment Prediction Project

- Date: 6.5.2025
- Chosen Corpus: Stanford Sentiment Treebank (SST-2)


### Corpus information

- Description of the chosen corpus: The Stanford Sentiment Treebank contains movie review sentences from Rotten Tomatoes. The reviews are annotated with binary sentiment labels: positive and negative.
- Paper and other published materials related to the corpus: Socher et al., 2013: Recursive Deep Models for Semantic Compositionality Over a Sentiment Treebank
- State-of-the-art performance (best published results) on this corpus: T5-11B and MT-DNN-SMART, both with 97.5% accuracy. (Papers With Code)

---

## 1. Setup

In [None]:
import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import TensorDataset, DataLoader
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split
from datasets import load_dataset
import optuna
import random, numpy as np
import pandas as pd



---

## 2. Data download and preprocessing

### 2.1. Download the corpus

In [2]:
dataset = load_dataset("glue", "sst2")

### 2.2. Preprocessing

In [3]:

SEED = 2025
torch.manual_seed(SEED)
np.random.seed(SEED)
random.seed(SEED)
g = torch.Generator()
g.manual_seed(SEED)
texts = dataset["train"]["sentence"][:8000]
labels = dataset["train"]["label"][:8000]

X_train_texts, X_test_texts, y_train, y_test = train_test_split(
    texts, labels, test_size=0.1, random_state=2025
)

X_val_texts = dataset["validation"]["sentence"][:800]
y_val = dataset["validation"]["label"][:800]

vectorizer = CountVectorizer(binary=True, max_features=10000)
X_train_vec = vectorizer.fit_transform(X_train_texts)
X_test_vec = vectorizer.transform(X_test_texts)
X_val_vec = vectorizer.transform(X_val_texts)

X_train_tensor = torch.tensor(X_train_vec.toarray(), dtype=torch.float32)
y_train_tensor = torch.tensor(y_train, dtype=torch.long)
X_test_tensor = torch.tensor(X_test_vec.toarray(), dtype=torch.float32)
y_test_tensor = torch.tensor(y_test, dtype=torch.long)
X_val_tensor = torch.tensor(X_val_vec.toarray(), dtype=torch.float32)
y_val_tensor = torch.tensor(y_val, dtype=torch.long)

train_loader = DataLoader(TensorDataset(X_train_tensor, y_train_tensor), batch_size=32, shuffle=True, generator=g)
val_loader = DataLoader(TensorDataset(X_val_tensor, y_val_tensor), batch_size=32, generator=g)
test_loader = DataLoader(TensorDataset(X_test_tensor, y_test_tensor), batch_size=32, generator=g)

---

## 3. Machine learning model

### 3.1. Model training

In [4]:
# I mixed some layers to get non-linearity to the model. Also dropout layers are added to reduce overfitting.

class DeepBoWClassifier(nn.Module):
    def __init__(self, input_dim, hidden_dim=256, hidden_dim2=128, output_dim=2, dropout=0.3):
        super().__init__()
        self.fc1 = nn.Linear(input_dim, hidden_dim)
        self.tanh1 = nn.Tanh()
        self.dropout1 = nn.Dropout(dropout)

        self.fc2 = nn.Linear(hidden_dim, hidden_dim2)
        self.relu2 = nn.ReLU()
        self.dropout2 = nn.Dropout(dropout)

        self.output = nn.Linear(hidden_dim2, output_dim)

    def forward(self, x):
        x = self.tanh1(self.fc1(x))
        x = self.dropout1(x)
        x = self.relu2(self.fc2(x))
        x = self.dropout2(x)
        return self.output(x)

### 3.2 Hyperparameter optimization

In [5]:

device = torch.device("cpu")

sampler = optuna.samplers.TPESampler(seed=SEED)
study = optuna.create_study(direction="maximize", sampler=sampler)
def objective(trial):
    # hyperparameters
    hidden_dim = trial.suggest_int("hidden_dim", 64, 512)
    hidden_dim2 = trial.suggest_int("hidden_dim2", 32, 256)
    dropout = trial.suggest_float("dropout", 0.1, 0.5)
    lr = trial.suggest_float("lr", 1e-4, 1e-2, log=True)

    model = DeepBoWClassifier(
        input_dim=X_train_tensor.shape[1],
        hidden_dim=hidden_dim,
        hidden_dim2=hidden_dim2,
        dropout=dropout
    ).to(device)

    criterion = nn.CrossEntropyLoss()
    optimizer = torch.optim.Adam(model.parameters(), lr=lr)

    # Training
    epochs = 3
    for epoch in range(epochs):
        model.train()
        for X_batch, y_batch in train_loader:
            X_batch, y_batch = X_batch.to(device), y_batch.to(device)
            optimizer.zero_grad()
            outputs = model(X_batch)
            loss = criterion(outputs, y_batch)
            loss.backward()
            optimizer.step()

    # Validation
    model.eval()
    correct, total = 0, 0
    with torch.no_grad():
        for X_val_batch, y_val_batch in val_loader:
            X_val_batch, y_val_batch = X_val_batch.to(device), y_val_batch.to(device)
            preds = torch.argmax(model(X_val_batch), dim=1)
            correct += (preds == y_val_batch).sum().item()
            total += y_val_batch.size(0)

    accuracy = correct / total
    return accuracy


#run the search

study.optimize(objective, n_trials=20)

print("Best trial:")
print(f"  Accuracy: {study.best_value:.4f}")
print(f"  Params: {study.best_params}")

[I 2025-08-29 11:45:09,752] A new study created in memory with name: no-name-f83354a3-16d1-480a-a23f-5ce3064463ab
[I 2025-08-29 11:45:17,167] Trial 0 finished with value: 0.7725 and parameters: {'hidden_dim': 124, 'hidden_dim2': 231, 'dropout': 0.47304225595460103, 'lr': 0.0007782808205273017}. Best is trial 0 with value: 0.7725.
[I 2025-08-29 11:45:22,180] Trial 1 finished with value: 0.75125 and parameters: {'hidden_dim': 238, 'hidden_dim2': 89, 'dropout': 0.3629470341884151, 'lr': 0.0009665712540273478}. Best is trial 0 with value: 0.7725.
[I 2025-08-29 11:45:32,601] Trial 2 finished with value: 0.7625 and parameters: {'hidden_dim': 496, 'hidden_dim2': 212, 'dropout': 0.28208211408382994, 'lr': 0.004000517487488018}. Best is trial 0 with value: 0.7725.
[I 2025-08-29 11:45:34,733] Trial 3 finished with value: 0.76875 and parameters: {'hidden_dim': 82, 'hidden_dim2': 205, 'dropout': 0.10126844670211771, 'lr': 0.00038514014014473015}. Best is trial 0 with value: 0.7725.
[I 2025-08-29 1

Best trial:
  Accuracy: 0.7875
  Params: {'hidden_dim': 409, 'hidden_dim2': 146, 'dropout': 0.37794498762948087, 'lr': 0.00010236603042832742}


### 3.3. Evaluation on test set

In [6]:
# Extract the best hyperparameters
best_params = study.best_params

# Define the best model
best_model = DeepBoWClassifier(
    input_dim=X_train_tensor.shape[1],
    hidden_dim=best_params["hidden_dim"],
    hidden_dim2=best_params["hidden_dim2"],
    dropout=best_params["dropout"]
).to(device)

# Loss function
criterion = nn.CrossEntropyLoss()

optimizer = torch.optim.Adam(best_model.parameters(), lr=best_params["lr"])

# training 
epochs = 5
for epoch in range(epochs):
    best_model.train()
    total_loss = 0
    for X_batch, y_batch in train_loader:
        X_batch, y_batch = X_batch.to(device), y_batch.to(device)
        optimizer.zero_grad()
        outputs = best_model(X_batch)
        loss = criterion(outputs, y_batch)
        loss.backward()
        optimizer.step()
        total_loss += loss.item()
    print(f"Epoch {epoch+1}/{epochs} - Loss: {total_loss:.4f}")

# Evaluation with the test data
def evaluate(model, loader):
    model.eval()
    correct, total = 0, 0
    with torch.no_grad():   # no gradients since this is just evaluation. Thus it is computationally lighter
        for X, y in loader:
            X, y = X.to(device), y.to(device)
            preds = torch.argmax(model(X), dim=1)
            correct += (preds == y).sum().item()
            total += y.size(0)
    return correct / total

test_accuracy = evaluate(best_model, test_loader)
print(f" The Best model's accuracy: {test_accuracy:.4f}")

Epoch 1/5 - Loss: 152.8010
Epoch 2/5 - Loss: 112.8779
Epoch 3/5 - Loss: 64.9765
Epoch 4/5 - Loss: 42.4331
Epoch 5/5 - Loss: 29.4864
 The Best model's accuracy: 0.8337


---

## 4. Results and summary

### 4.1 Corpus insights

The STT-2 dataset is a binary sentiment classification task based on movie reviews. The sentences vary in length, typically between 5 to 40 words, and contain everyday language as well as film-specific terminology. Originally the dataset is really large (67000 training examples) but I subsampled 8000 with 10% test split. The dataset's official test set does not have the real labels so that is why one needed to be splitted. 

### 4.2 Results

I ran 20 Optuna trials to optimize the hyperparameters (hidden_dim, hidden_dim2, dropout and learning rate). The local optimum for these parameters were hidden_dim = 409, hidden_dim = 146, dropout = 0.378 and learning_rate = 1.023×10⁻⁴. This resulted a validation accuracy of 80,75%. After that I evaluated the model on the test set and the model reached 83.37% accuracy.

### 4.3 Relation to state of the art

The best performing models on STT-2 typically achieve around 95%. So compared to that there is clearly a room for improvement but considering how simple and light the model is, the results were alright. It provides a strong baseline model at least.


---

## 5. Bonus Task

### 5.1. Annotating out-of-domain documents

I chose 50 news headlines on various topics. The annotation was based if the topic was positive or negative. I tried to chose topics where such labeling were even possible.

### 5.2 Conversion into dataset

In [7]:
bonus_df = pd.read_csv("bonus_annotations.csv")
texts_bonus = bonus_df["sentence"].tolist()
labels_bonus= bonus_df["label"].tolist()  

X_bonus_vec    = vectorizer.transform(texts_bonus)      
X_bonus_tensor = torch.tensor(X_bonus_vec.toarray(), dtype=torch.float32)
y_bonus_tensor = torch.tensor(labels_bonus, dtype=torch.long)

bonus_dataset = TensorDataset(X_bonus_tensor, y_bonus_tensor)
bonus_loader  = DataLoader(bonus_dataset, batch_size=32, shuffle=False)

### 5.3. Model evaluation on out-of-domain test set

In [8]:
bonus_acc = evaluate(best_model, bonus_loader)
print(f"Out‐of‐domain accuracy: {bonus_acc:.4f}")

Out‐of‐domain accuracy: 0.6200


### 5.4 Bonus task results

The model's accuracy on the news headlines were 62% which is compareable to a coin toss. This is understandable since many headlines have contextual meaning rather than word-based meaning. For example "massive" usually indicates to positive with movie reviews but a headline can be something like "massive invasion to ... led to ...". Also news can be trickier to label since they usually have neutral point of view.

### 5.5. Annotated data

In [9]:
data = [
    ("Desperate, traumatised people’: Gaza faces wave of looting, theft and violence", 0),
    ("Pakistan decries ‘act of war’ as it retaliates against India missile attack", 0),
    ("Weather tracker: Deadly storms in India and huge hailstones in Paris", 0),
    ("Simone Inzaghi hails Inter for beating ‘best two sides in Europe’ on way to final", 1),
    ("Huma Bhabha review – ‘Giacometti is a foil to her flamboyance. She is today’s Picasso’", 1),
    ("‘It’ll be solemn, enshrining his ashes’: statue of Lemmy to be unveiled in his home town of Stoke-on-Trent", 1),
    ("Russian drone strike caused tens of millions worth of damage to Chornobyl", 0),
    ("Mirrors, caddies and skinny shelves: 12 space-saving tricks to make small rooms feel bigger", 1),
    ("US and China to start talks over trade war this week", 1),
    ("'It's hard to watch' - Solskjaer discusses Man Utd woes", 0),
    ("Ronaldo's son gets first Portugal Under-15s call-up", 1),
    ("EU plans to end Russian gas imports by end of 2027", 1),
    ("Russia accuses Ukraine of drone attack on Moscow days before WW2 parade", 0),
    ("Home of Ukrainian Eurovision contestant destroyed", 0),
    ("Smokey Robinson accused of sexual assault by four women", 0),
    ("Hours before possible ceasefire begins, Russia and Ukraine launch attacks with two killed in Kyiv", 0),
    ("OpenAI says non-profit will remain in control after backlash", 1),
    ("The people refusing to use AI", 0),
    ("Trump criticised after posting AI image of himself as Pope", 0),
    ("'It doesn't stick to the rules': The reason Sinners has become a true box-office sensation", 1),
    ("India Strikes Pakistan but Is Said to Have Lost Jets", 0),
    ("Welcome to Reno, the Mighty Mecca of All-You-Can-Eat Sushi", 1),
    ("Live Updates: Conclave to Elect New Pope Is Set to Begin", 1),
    ("Gazans Despair After Israel Announces More Displacement", 0),
    ("A Half-Ton Spacecraft Lost by the Soviets in 1972 Is Coming Home", 1),
    ("A Mother and Father Were Deported. What Happened to Their Toddler?", 0),
    ("The New York Nonprofit Where Generations of Artists Got Their Start", 1),
    ("National African American Museum Faces Uncertainty Without Its Leader", 0),
    ("‘Ragtime’ Is Returning to Broadway", 1),
    ("‘Forbidden Games’: A War Orphan’s Sweet, Ultimately Shattering Story", 0),
    ("India strikes Pakistan to avenge a terrorist attack", 0),
    ("America may be just weeks away from a mighty economic shock", 0),
    ("AI models could help negotiators secure peace deals", 1),
    ("Trump’s Ukraine ceasefire is slipping away", 0),
    ("Chinese military exercises foreshadow a blockade of Taiwan", 0),
    ("American tariffs are starting to hammer Chinese exporters", 0),
    ("Russian inflation is too high. Does that matter?", 0),
    ("Three charts show that America’s imports are booming", 1),
    ("How a dollar crisis would unfold", 0),
    ("The success of Ivory Coast is Africa’s best-kept secret", 1),
    ("For media companies, news is becoming a toxic asset", 0),
    ("How Donald Trump might steal Christmas", 0),
    ("A new way to recycle plastic is here", 1),
    ("Trump’s tariffs will pummel Vietnam", 0),
    ("America is at risk of a Trumpian economic slowdown", 0),
    ("Narendra Modi is struggling to boost Indian growth", 0),
    ("Economic bright spots are getting harder to find in Thailand", 0),
    ("Snooker targets Brisbane 2032 Olympics to capitalise on Zhao world championship win", 1),
    ("‘It means everything’: how Union Berlin Women completed epic journey to the top", 1),
    ("Norwegian fan trades five kilos of fish for ticket to BodO/Glimt v Tottenham", 1),
]
df = pd.DataFrame(data, columns=["sentence", "label"])
df.to_csv("bonus_annotations.csv", index=False)