<a href="https://colab.research.google.com/github/IAT-ExploringAI-2025/NLP-Week1-Text-Classification/blob/main/movie_review_sentiment_classifier_with_bert.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Introduction

Hello, I'm **Wesley**, nice to meet you!üëã

I was just reading the IMDb reviews of [*The Super Mario Bros. Movie*](https://www.imdb.com/title/tt6718170/), I thought why don't we make a **Sentiment Classifier** to categorize movie reviews! **WARNING: Spoilers ahead.**


Here we will be doing [transfer learning](https://en.wikipedia.org/wiki/Transfer_learning) on BERT [(blog)](https://ai.googleblog.com/2018/11/open-sourcing-bert-state-of-art-pre.html) [(paper)](https://arxiv.org/abs/1810.04805v2) with an IMDb dataset to make a sentiment classifier for movie reviews.

# Setup Python Libraries (pip)

In [1]:
#install some Python packages with pip

!pip install numpy torch datasets transformers evaluate --quiet

[?25l   [90m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ[0m [32m0.0/84.1 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [90m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ[0m [32m84.1/84.1 kB[0m [31m3.4 MB/s[0m eta [36m0:00:00[0m
[?25h

In [2]:
# let's check the version we are using

!pip freeze | grep -E '^numpy|^torch|^datasets|^transformers|^evaluate'

datasets==4.0.0
evaluate==0.4.6
numpy==2.0.2
torch==2.8.0+cu126
torchao==0.10.0
torchaudio==2.8.0+cu126
torchdata==0.11.0
torchsummary==1.5.1
torchtune==0.6.1
torchvision==0.23.0+cu126
transformers==4.56.2


# Create IMDB Dataset for Fine-tuning BERT

## Let's load the IMDB Dataset

In [3]:
from datasets import load_dataset

# let's load the imdb dataset from huggingface
# source: (https://huggingface.co/datasets/imdb)

raw_dataset = load_dataset('imdb')
raw_dataset

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


README.md: 0.00B [00:00, ?B/s]

plain_text/train-00000-of-00001.parquet:   0%|          | 0.00/21.0M [00:00<?, ?B/s]

plain_text/test-00000-of-00001.parquet:   0%|          | 0.00/20.5M [00:00<?, ?B/s]

plain_text/unsupervised-00000-of-00001.p(‚Ä¶):   0%|          | 0.00/42.0M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/25000 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/25000 [00:00<?, ? examples/s]

Generating unsupervised split:   0%|          | 0/50000 [00:00<?, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['text', 'label'],
        num_rows: 25000
    })
    test: Dataset({
        features: ['text', 'label'],
        num_rows: 25000
    })
    unsupervised: Dataset({
        features: ['text', 'label'],
        num_rows: 50000
    })
})

## Let's create the train, validation, test sets

In [4]:
# get train and validation set

dataset = raw_dataset['train'].train_test_split(test_size=0.2, seed=42, shuffle=True)
dataset

DatasetDict({
    train: Dataset({
        features: ['text', 'label'],
        num_rows: 20000
    })
    test: Dataset({
        features: ['text', 'label'],
        num_rows: 5000
    })
})

In [5]:
# rename validation key to 'val'

dataset['val'] = dataset['test']
dataset

DatasetDict({
    train: Dataset({
        features: ['text', 'label'],
        num_rows: 20000
    })
    test: Dataset({
        features: ['text', 'label'],
        num_rows: 5000
    })
    val: Dataset({
        features: ['text', 'label'],
        num_rows: 5000
    })
})

In [6]:
# copy test set from raw_dataset

dataset['test'] = raw_dataset['test']
dataset

DatasetDict({
    train: Dataset({
        features: ['text', 'label'],
        num_rows: 20000
    })
    test: Dataset({
        features: ['text', 'label'],
        num_rows: 25000
    })
    val: Dataset({
        features: ['text', 'label'],
        num_rows: 5000
    })
})

## We start by tokenizing our dataset with the BERT's Fast Tokenizer

In [7]:
# let's import the pretrained faster tokenizer from huggingface
# source: (https://huggingface.co/distilbert-base-uncased)

from transformers import AutoTokenizer

checkpoint = 'distilbert-base-uncased'
tokenizer = AutoTokenizer.from_pretrained(checkpoint, use_fast=True)
tokenizer

tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/483 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

DistilBertTokenizerFast(name_or_path='distilbert-base-uncased', vocab_size=30522, model_max_length=512, is_fast=True, padding_side='right', truncation_side='right', special_tokens={'unk_token': '[UNK]', 'sep_token': '[SEP]', 'pad_token': '[PAD]', 'cls_token': '[CLS]', 'mask_token': '[MASK]'}, clean_up_tokenization_spaces=False, added_tokens_decoder={
	0: AddedToken("[PAD]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	100: AddedToken("[UNK]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	101: AddedToken("[CLS]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	102: AddedToken("[SEP]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	103: AddedToken("[MASK]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
}
)

In [8]:
# tokenize the text in batches with truncation and padding based on BERT requirements

def tokenization(example):
    return tokenizer(example['text'], truncation=True, padding=True)

tokenized_dataset = dataset.map(tokenization, batched=True, remove_columns=['text'])
tokenized_dataset

Map:   0%|          | 0/20000 [00:00<?, ? examples/s]

Map:   0%|          | 0/25000 [00:00<?, ? examples/s]

Map:   0%|          | 0/5000 [00:00<?, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['label', 'input_ids', 'attention_mask'],
        num_rows: 20000
    })
    test: Dataset({
        features: ['label', 'input_ids', 'attention_mask'],
        num_rows: 25000
    })
    val: Dataset({
        features: ['label', 'input_ids', 'attention_mask'],
        num_rows: 5000
    })
})

# Setup Training Metrics (Accuracy, F1)

In [9]:
import evaluate
import numpy as np

# we setup the training to evaluate the accuracy and f1 scores

accuracy_metric = evaluate.load('accuracy')
f1_metric = evaluate.load('f1')

def compute_metrics(eval_pred):
    logits, labels = eval_pred
    predictions = np.argmax(logits, axis=-1)
    accuracy = accuracy_metric.compute(predictions=predictions, references=labels)
    f1 = f1_metric.compute(predictions=predictions, references=labels)
    return {**accuracy, **f1}

Downloading builder script: 0.00B [00:00, ?B/s]

Downloading builder script: 0.00B [00:00, ?B/s]

# Setup Training Configurations

In [25]:
import os
from transformers import AutoModelForSequenceClassification, Trainer, TrainingArguments, IntervalStrategy

# get bert model with a sequence classification head for sentiment analysis
# source: (https://huggingface.co/distilbert-base-uncased)
checkpoint = 'distilbert-base-uncased'
num_labels = 2
id2label = {0:'NEGATIVE',1:'POSITIVE'}
label2id = {'NEGATIVE':0,'POSITIVE':1}
model = AutoModelForSequenceClassification.from_pretrained(checkpoint, num_labels=num_labels, id2label=id2label, label2id=label2id)

# setup custom training arguments
# 1. store training checkpoints to 'results' output directory
# 2. fine-tune for just 1 epoch
# 3,4. use 16 as a batch size to speed things up
# 5. evaluate validation set every 500 steps (this is the default steps)
# 6. load the best model based on the lowest validation loss at the end of training
training_args = TrainingArguments(
    seed=42,
    output_dir = './results',
    num_train_epochs = 3,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    load_best_model_at_end=True,
    eval_strategy = "epoch",
    save_strategy = 'epoch'
)

# setup trainer with custom metrics (accuracy, f1)
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_dataset['train'],
    eval_dataset=tokenized_dataset['val'],
    compute_metrics=compute_metrics,
)

# disable wandb logging (a v4 huggingface artifact)
os.environ['WANDB_DISABLED']= "true"

Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Using the `WANDB_DISABLED` environment variable is deprecated and will be removed in v5. Use the --report_to flag to control the integrations used for logging result (for instance --report_to none).


# Evaluate UnFine-Tuned BERT on Test Set for a Baseline Metric


In [None]:
# let's first evaluate unfine-tuned model with test set

trainer.evaluate(tokenized_dataset['test'])

[34m[1mwandb[0m: Using wandb-core as the SDK backend.  Please refer to https://wandb.me/wandb-core for more information.


{'eval_loss': 0.6972731947898865,
 'eval_model_preparation_time': 0.0028,
 'eval_accuracy': 0.4988,
 'eval_f1': 0.6642550911039657,
 'eval_runtime': 389.6548,
 'eval_samples_per_second': 64.159,
 'eval_steps_per_second': 4.011}

Without fine-tuning BERT, our model currently has around **52% Accuracy (eval_accuracy)** and **19% F1 (eval_f1)**, which is pretty bad due to the test dataset having around 50% positive and 50% negative reviews. üòï


Let's make it better with transfer learning! ü¶æ

# Fine-Tune BERT with IMDb Dataset

In [None]:
# let's fine-tune BERT with the IMDb dataset

trainer.train()

Step,Training Loss,Validation Loss,Model Preparation Time,Accuracy,F1
500,0.3424,0.285967,0.0028,0.9016,0.900081
1000,0.2415,0.229587,0.0028,0.9172,0.918343


TrainOutput(global_step=1250, training_loss=0.27448213500976565, metrics={'train_runtime': 1203.8007, 'train_samples_per_second': 16.614, 'train_steps_per_second': 1.038, 'total_flos': 2649347973120000.0, 'train_loss': 0.27448213500976565, 'epoch': 1.0})

In [None]:
# let's see how well it did in the test set

trainer.evaluate(tokenized_dataset['test'])

{'eval_loss': 0.21517841517925262,
 'eval_model_preparation_time': 0.0028,
 'eval_accuracy': 0.92296,
 'eval_f1': 0.9246596776717259,
 'eval_runtime': 393.8639,
 'eval_samples_per_second': 63.474,
 'eval_steps_per_second': 3.968,
 'epoch': 1.0}

**WOAH!** We got a **92% Accuracy (eval_accuracy)** and **92% F1 (eval_f1)** with just **1 epoch**! ü§Ø

# Try out some examples!

In [None]:
from transformers import pipeline
import torch

# get current device with pytorch
device = torch.cuda.current_device()

# create pipeline for sentiment classifier with custom model and tokenizer
sentiment_classifier = pipeline(task='sentiment-analysis', model=model, tokenizer=tokenizer, device=device)

In [None]:
# let's see how our model classifies a good review
# this is from 'justinvitelli' (https://www.imdb.com/review/rw8972952)

review = """
First off this movie is for kids and fans of Nintendo and the Mario franchise.
I still think an adult who isnt a fan could still enjoy it but this movie is so
full of fan service that it will have you smiling the whole time.
The voice acting I was skeptical but they all work and work well too.
Jack Black is the star here. I love how they kept the story simple like all of the games.
Truly felt like a video game on screen.
This movie felt like a beautifully animated amusement park ride.
The audio in the movie was amazing too.
The sounds and the score with reimagined iconic music was perfect.
Some of the songs in the movie felt unnecessary but they worked.
I think they should've bumped the run time to 105-120 min.
90 min felt too short as it goes by quick.
I havent had this much wholesome fun at the movies in a long time.
If youre a fan you HAVE to see it.
"""
sentiment_classifier(review)

[{'label': 'POSITIVE', 'score': 0.9938808679580688}]

That is **99% POSITIVE**! *justinvitelli* loves the movie!

In [None]:
# let's see how our model classifies a bad review
# this is from 'industriousbug16' (https://www.imdb.com/review/rw8998214)

review = """
Flat, visual noise.
Fundamentally incurious. Potentially injurious.
The mystique generated by the characters in the games is here raked over and presented
haphazardly by hacks.
A hobbled attempt to explain a long and random evolution of characters who were never meant
to be narratised fails.
Doing it well is near impossible when you insist on EVERY LITTLE BIT OF LORE,
from the last forty years being shoehorned into 90 minutes.
Makes little sense, shamelessly leans on member berries to stimulate older viewers but offers
nothing else.
I feel sad for the animators who did a sterling job, but to no end as this movie has no soul.
"""
sentiment_classifier(review)

[{'label': 'NEGATIVE', 'score': 0.9951890707015991}]

That is **99% NEGATIVE**! *industriousbug16* must hate the movie very badly.

# Resources

### If you would like to use this model without running the entire notebook, try the model at my [HuggingFace](https://huggingface.co/wesleyacheng/movie-review-sentiment-classifier-with-bert).

### If you woud like to get this in GitHub, here's my [repo](https://github.com/wesleyacheng/movie-review-sentiment-classifier-with-bert).