Lecture: AI I - Advanced 

Previous:
[**Chapter 4.2.1: Transformer with GPT2**](../02_nlp/01_gpt2.ipynb)

---

# Chapter 4.2.2: Text Classification & Sentiment Analysis

Text classification is one of the most fundamental tasks in Natural Language Processing (NLP). It assigns a predefined category or label to a piece of text — whether that text is a single sentence, a paragraph, or an entire document. Applications range from spam detection and topic labelling to sentiment analysis and intent recognition.

In this chapter, we focus on sentiment analysis as our primary example: the task of determining whether a piece of text expresses a positive, negative, or neutral opinion. We use the well-known IMDB Movie Reviews dataset, which contains 50000 binary-labelled movie reviews, making it an ideal benchmark for learning and experimenting with text classification pipelines.

We will walk through the full pipeline — from raw data loading and preprocessing, through a classical machine learning baseline using TF-IDF and Logistic Regression, to a modern deep-learning approach fine-tuning a pre-trained BERT model with HuggingFace Transformers.

## The IMDB Movie Reviews Dataset

The IMDB dataset is hosted on HuggingFace and provides a clean, balanced split for supervised learning:

| Split | Samples | Labels |
|-------|---------|--------|
| train | 25000   | 2 (1 = positive, 0 = negative) |
| test  | 25000   | 2 (1 = positive, 0 = negative) |

Each review is a plain-text string of variable length. Reviews are drawn from the polar ends of the rating scale — reviews with a score ≤ 4 are labelled negative, and those with a score ≥ 7 are labelled positive. Reviews scoring 5 or 6 are excluded entirely, which keeps the classification task clean and well-separated.

The dataset is perfectly balanced — 12,500 positive and 12,500 negative reviews per split. Average review length is roughly 1,000 characters, though some reviews exceed 100,000 characters. This variation in length is important to consider when choosing tokenisation strategies.

## Loading the Data

We load the dataset using the HuggingFace datasets library:

In [None]:
from datasets import load_dataset

dataset = load_dataset("imdb")

print("length of training set:", len(dataset["train"]))
print("length of test set:", len(dataset["test"]))
print("example from training set:", dataset["train"][0])

length of training set: 25000
length of test set: 25000
example from training set: {'text': 'I rented I AM CURIOUS-YELLOW from my video store because of all the controversy that surrounded it when it was first released in 1967. I also heard that at first it was seized by U.S. customs if it ever tried to enter this country, therefore being a fan of films considered "controversial" I really had to see this for myself.<br /><br />The plot is centered around a young Swedish drama student named Lena who wants to learn everything she can about life. In particular she wants to focus her attentions to making some sort of documentary on what the average Swede thought about certain political issues such as the Vietnam War and race issues in the United States. In between asking politicians and ordinary denizens of Stockholm about their opinions on politics, she has sex with her drama teacher, classmates, and married men.<br /><br />What kills me about I AM CURIOUS-YELLOW is that 40 years ago, thi

## Fine-Tuning a Pre-Trained DistilBERT Model



---

Lecture: AI I - Advanced 

Next: [**Chapter 4.2.3: Named Entity Recognition**](../02_nlp/03_ner.ipynb)