Lecture: AI I - Advanced 

Previous:
[**Chapter 4.2.1: Transformer with GPT2**](../02_nlp/01_gpt2.ipynb)

---

# Chapter 4.2.2: Text Classification & Sentiment Analysis

In the Transformers excursion, you learned the inner workings of these models—how embeddings turn words into vectors, how attention lets every token attend to every other token, and how decoder-only models like GPT-2 generate text one token at a time. You even controlled GPT-2's output with temperature and sampling parameters.

Now it's time to put a Transformer to work on a concrete task: text classification. Specifically, we'll train a model to decide whether a movie review is positive or negative—a task called sentiment analysis. This is one of the most fundamental NLP tasks and a perfect entry point into fine-tuning pre-trained language models.

Instead of training a Transformer from scratch (which would require billions of examples and weeks of GPU time), we'll use a pre-trained model called DistilBERT and adapt it to our specific task. This approach—called fine-tuning—is how virtually all modern NLP applications are built.

**Text classification** assigns a label to a piece of text. The label could be a sentiment (positive/negative), a topic (sports/politics/tech), a language, or anything else. What makes it a classification problem rather than a generation problem is that the output is one of a fixed set of categories—not free-form text.

**Sentiment analysis** is classification applied to opinions. Given a movie review like "The acting was brilliant but the plot made no sense", the model should decide: is the overall sentiment positive or negative? This requires more than keyword matching. The model needs to understand sarcasm, weigh conflicting opinions, and grasp the overall tone—exactly the kind of contextual understanding that Transformer models excel at.

## Why DistilBERT?

BERT (Bidirectional Encoder Representations from Transformers) was a breakthrough: an encoder-only Transformer pre-trained on massive text corpora that could be fine-tuned for downstream tasks with remarkably little task-specific data. But BERT is large—110 million parameters for the base model—and slow at inference.

DistilBERT is a distilled (compressed) version of BERT. It was created using knowledge distillation: a smaller "student" model was trained to mimic the behavior of the larger "teacher" model (BERT). The student doesn't need to learn from raw text—it learns from BERT's outputs, which are richer training signals than raw labels alone.

The result: DistilBERT retains about 97% of BERT's performance on most tasks while being 40% smaller and 60% faster. For a course environment where GPU time matters, it's the ideal choice.

We use `distilbert-base-cased`, the cased variant—meaning it distinguishes between "The" and "the." Casing carries information (capitalized words at the start of sentences vs. proper nouns), so preserving it is slightly better for understanding meaning.

## The IMDB Movie Reviews Dataset

The IMDB dataset is hosted on HuggingFace and provides a clean, balanced split for supervised learning:

| Split | Samples | Labels |
|-------|---------|--------|
| train | 25000   | 2 (1 = positive, 0 = negative) |
| test  | 25000   | 2 (1 = positive, 0 = negative) |

Each review is a plain-text string of variable length. Reviews are drawn from the polar ends of the rating scale — reviews with a score ≤ 4 are labelled negative, and those with a score ≥ 7 are labelled positive. Reviews scoring 5 or 6 are excluded entirely, which keeps the classification task clean and well-separated.

The dataset is perfectly balanced — 12,500 positive and 12,500 negative reviews per split. Average review length is roughly 1,000 characters, though some reviews exceed 100,000 characters. This variation in length is important to consider when choosing tokenisation strategies.

## Loading the Data

We load the dataset using the HuggingFace datasets library:

In [1]:
from datasets import load_dataset

dataset = load_dataset("imdb")

print("length of training set:", len(dataset["train"]))
print("length of test set:", len(dataset["test"]))
print("example from training set:", dataset["train"][0])

length of training set: 25000
length of test set: 25000
example from training set: {'text': 'I rented I AM CURIOUS-YELLOW from my video store because of all the controversy that surrounded it when it was first released in 1967. I also heard that at first it was seized by U.S. customs if it ever tried to enter this country, therefore being a fan of films considered "controversial" I really had to see this for myself.<br /><br />The plot is centered around a young Swedish drama student named Lena who wants to learn everything she can about life. In particular she wants to focus her attentions to making some sort of documentary on what the average Swede thought about certain political issues such as the Vietnam War and race issues in the United States. In between asking politicians and ordinary denizens of Stockholm about their opinions on politics, she has sex with her drama teacher, classmates, and married men.<br /><br />What kills me about I AM CURIOUS-YELLOW is that 40 years ago, thi

## Tokenization

DistilBERT doesn't work with raw text—it needs token IDs. The tokenizer converts text into a sequence of integer IDs that map to DistilBERT's vocabulary. It also handles special tokens and padding.

DistilBERT uses WordPiece tokenization: words are split into subwords that exist in the model's vocabulary. Common words stay whole; rare words are broken into familiar pieces. The word "unhappiness" might become `["un", "##happiness"]`—the `##` prefix indicates a continuation of the previous word.
Two special tokens are always added:

- `[CLS]` at the start: its final hidden state is used as the sentence-level representation (this is what we'll feed into our classification head)
- `[SEP]` at the end: marks the boundary of the input

### Max Length and Truncation

DistilBERT has a maximum sequence length of 512 tokens. Reviews longer than this must be truncated. From our length analysis above, most reviews fit within this limit, but some don't.

In [5]:
from transformers import AutoTokenizer, DataCollatorWithPadding

tokenizer = AutoTokenizer.from_pretrained('distilbert-base-cased')

def preprocess_function(examples):
    return tokenizer(examples["text"][:1000], truncation=True)

tokenized_imdb = dataset.map(preprocess_function, batched=True)
data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

Map:   0%|          | 0/25000 [00:00<?, ? examples/s]

Map:   0%|          | 0/25000 [00:00<?, ? examples/s]

Map:   0%|          | 0/50000 [00:00<?, ? examples/s]

In [6]:
import evaluate
import numpy as np

accuracy = evaluate.load("accuracy")

def compute_metrics(eval_pred):
    predictions, labels = eval_pred
    predictions = np.argmax(predictions, axis=1)
    return accuracy.compute(predictions=predictions, references=labels)

## Training

Fine-tuning pre-trained language models requires a much smaller learning rate than training from scratch. The pre-trained weights already encode rich linguistic knowledge—large gradient updates would destroy that knowledge. A learning rate of `2e-5` to `5e-5` is standard for BERT-family fine-tuning.

In [None]:
from transformers import AutoModelForSequenceClassification, TrainingArguments, Trainer

id2label = {0: "NEGATIVE", 1: "POSITIVE"}
label2id = {"NEGATIVE": 0, "POSITIVE": 1}
model = AutoModelForSequenceClassification.from_pretrained(
    "distilbert-base-cased", num_labels=2, id2label=id2label, label2id=label2id
)

training_args = TrainingArguments(
    output_dir="./data/02_sentiment",
    learning_rate=2e-5,
    num_train_epochs=2,
    weight_decay=0.01,
    eval_strategy="epoch",
    save_strategy="epoch",
    load_best_model_at_end=True,
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_imdb["train"],
    eval_dataset=tokenized_imdb["test"],
    tokenizer=tokenizer,
    data_collator=data_collator,
    compute_metrics=compute_metrics,
)

trainer.train()

---

Lecture: AI I - Advanced 

Next: [**Chapter 4.2.3: Named Entity Recognition**](../02_nlp/03_ner.ipynb)