<a href="https://colab.research.google.com/github/lisabecker/nlp-fundamentals/blob/main/0404_transformers.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# 👍 or 👎? Classifying Movie Reviews with Transformers

In this exercise, we will train a transformer model `distilbert` using 🤗 Hugging Face on the [IMDB movie review dataset](https://huggingface.co/datasets/imdb) to classify if a review about a given movie is either positive or negative. In the end, we shall try feeding the trained model with our own reviews about recent movies to check if it is able to classify them according to our general sentiment about the movie.

##1. Set up the environment

In [None]:
!pip install --q transformers==4.36.2 datasets==2.16.1 torch==2.1.2

##2. Load the dataset

Huggingface 🤗 provides a datasets library which not only contains most of the  popular open-source NLP datasets, but also allows users to upload their own and share it with the world. For this exercise, we will use the `IMDB Movie Reviews` dataset which is popular for sentiment analysis tasks.

In [1]:
from datasets import load_dataset

dataset = load_dataset("imdb")

##3. Inspect the dataset

Let us inspect the downloaded the dataset.  We see that the dataset contains two fields: `text` and `label`. The `text` field contains the review and the `label` field contains the classification of the review as positive or negative. `label` can contain one of two values `0` and `1`. The former indicates a *negative* review while the latter, *positive*.

In [None]:
dataset["train"]

Dataset({
    features: ['text', 'label'],
    num_rows: 25000
})

In [None]:
# Some examples
dataset["train"][0:2]

{'text': ['I rented I AM CURIOUS-YELLOW from my video store because of all the controversy that surrounded it when it was first released in 1967. I also heard that at first it was seized by U.S. customs if it ever tried to enter this country, therefore being a fan of films considered "controversial" I really had to see this for myself.<br /><br />The plot is centered around a young Swedish drama student named Lena who wants to learn everything she can about life. In particular she wants to focus her attentions to making some sort of documentary on what the average Swede thought about certain political issues such as the Vietnam War and race issues in the United States. In between asking politicians and ordinary denizens of Stockholm about their opinions on politics, she has sex with her drama teacher, classmates, and married men.<br /><br />What kills me about I AM CURIOUS-YELLOW is that 40 years ago, this was considered pornographic. Really, the sex and nudity scenes are few and far b

##4. Process the dataset

Transformers generally deal with tokens and not raw characters. Hence, to train a model, we must setup utility functions to tokenize the given text. A tokenizer tokenizes the given text.

In [None]:
from transformers import AutoTokenizer, AutoModelForSequenceClassification

tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased")

def tokenize_function(samples):
    return tokenizer(samples["text"], padding="max_length", truncation=True, return_tensors='pt')

# This step will take a few minutes
tokenized_datasets = dataset.map(tokenize_function, batched=True, remove_columns=["text"])


Map:   0%|          | 0/25000 [00:00<?, ? examples/s]

In [None]:
tokenized_datasets.set_format(type="torch", columns=["input_ids", "attention_mask", "label"])

In [None]:
tokenized_datasets

DatasetDict({
    train: Dataset({
        features: ['label', 'input_ids', 'attention_mask'],
        num_rows: 25000
    })
    test: Dataset({
        features: ['label', 'input_ids', 'attention_mask'],
        num_rows: 25000
    })
    unsupervised: Dataset({
        features: ['label', 'input_ids', 'attention_mask'],
        num_rows: 50000
    })
})

##5. Create data loaders

Data loaders are used to create an iterator which can be used to load data in batches during the training loop. They can also perform various other tasks such as type conversion, collation etc.

In [None]:
import torch
from torch.utils.data import DataLoader

train_dataset = tokenized_datasets["train"].shuffle(seed=42).select(range(10000)) # smaller sample for quick training
test_dataset = tokenized_datasets["test"].shuffle(seed=42).select(range(1000))


train_dataloader = DataLoader(train_dataset, shuffle=True, batch_size=16, num_workers=2)
eval_dataloader = DataLoader(test_dataset, batch_size=16, num_workers=2)

##6. Load pretrained model and set training parameters

We use the pretrained `distilbert-base-uncased` model which is a smaller, faster version of BERT, a popular transformer model.

**AutoModelForSequenceClassification** customizes this model for the task of sequence classification (here, classifying movie reviews into positive or negative).

**Optimizer Setup:** AdamW is an optimization algorithm used to minimize the loss during training. The learning rate lr=5e-5 is set, which determines how much the model updates its weights in response to the estimated error each time the model weights are updated.

**Training Parameters:** We're setting the number of epochs (num_epochs) as 3, meaning the entire training dataset will be passed through the model three times. num_training_steps is the total number of training steps the model will undergo, calculated by multiplying the number of epochs by the number of batches in the training data.

**Learning Rate Scheduler:** The scheduler adjusts the learning rate over the course of training, here using a linear schedule with no warmup steps.

In [None]:
from transformers import AdamW
from transformers import get_scheduler

model = AutoModelForSequenceClassification.from_pretrained("distilbert-base-uncased", num_labels=2)
optimizer = AdamW(model.parameters(), lr=5e-5)
num_epochs = 3
num_training_steps = num_epochs * len(train_dataloader)


lr_scheduler = get_scheduler(
    "linear",
    optimizer=optimizer,
    num_warmup_steps=0,
    num_training_steps=num_training_steps
)

Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['pre_classifier.bias', 'classifier.weight', 'classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


##7. Run Training loop
**Setting Up the Device:** The model training will be performed on a GPU if available (faster) or on a CPU if a GPU is not available.

**Training Process:** For each epoch, the training data is passed through the model in batches. The model's output includes the loss, which measures the difference between the predicted and actual values.

**Backpropagation and Optimization:** The loss.backward() call performs backpropagation to calculate gradients, and optimizer.step() updates the model's weights based on these gradients. The learning rate is updated at each step using lr_scheduler.step(), and optimizer.zero_grad() clears old gradients, ensuring that only the most recent gradients are used in the next step.

**Progress Tracking:** A progress bar is displayed using tqdm, showing the number of steps completed out of the total.

In [None]:
from tqdm.auto import tqdm

# Use GPU if available - you can change it under "Runtime" -> "Change runtime type"
device = torch.device("cuda") if torch.cuda.is_available() else torch.device("cpu")
model.to(device)

progress_bar = tqdm(range(num_training_steps))

# Number of steps after which to log training stats
log_steps = 50

training_step_count = 0
model.train()
for epoch in range(num_epochs):
    for batch in train_dataloader:
        batch = {k: v.to(device) for k, v in batch.items()}
        outputs = model(input_ids=batch["input_ids"], attention_mask=batch["attention_mask"], labels=batch["label"])
        loss = outputs.loss
        loss.backward()

        optimizer.step()
        lr_scheduler.step()
        optimizer.zero_grad()
        progress_bar.update(1)

        training_step_count += 1
        if training_step_count % log_steps == 0:
            print(f"Training loss at step {training_step_count}: {loss.item()}")






  0%|          | 0/1875 [00:00<?, ?it/s]

Training loss at step 50: 0.3360562026500702
Training loss at step 100: 0.3613983392715454
Training loss at step 150: 0.19602687656879425
Training loss at step 200: 0.1519453078508377
Training loss at step 250: 0.23754222691059113
Training loss at step 300: 0.19000516831874847
Training loss at step 350: 0.5081885457038879
Training loss at step 400: 0.18347708880901337
Training loss at step 450: 0.45937660336494446
Training loss at step 500: 0.5301209092140198
Training loss at step 550: 0.20125079154968262
Training loss at step 600: 0.131150022149086
Training loss at step 650: 0.35261279344558716
Training loss at step 700: 0.1822538673877716
Training loss at step 750: 0.2403530329465866
Training loss at step 800: 0.31179091334342957
Training loss at step 850: 0.31338393688201904
Training loss at step 900: 0.32099589705467224
Training loss at step 950: 0.12113271653652191
Training loss at step 1000: 0.21208441257476807
Training loss at step 1050: 0.08348963409662247
Training loss at step

In [None]:
model.save_pretrained("distilbert-uncased-imdb-finetune", from_pt=True)

##8. Evaluate model on test set

**Model Evaluation Mode:** model.eval() is used to set the model to evaluation mode, disabling specific layers and behaviors like dropout layers used in training.

**Generating Predictions:** The model generates predictions for the test set. torch.no_grad() tells PyTorch not to calculate gradients, which is necessary during evaluation.

**Calculating Accuracy:** Predictions are compared with actual labels (references) to calculate the model's accuracy. This is the proportion of predictions the model got right, giving a straightforward metric to evaluate the model's performance.

In [None]:
import numpy as np

model.eval()
predictions = []
references = []

for batch in eval_dataloader:
    batch = {k: v.to(device) for k, v in batch.items()}
    with torch.no_grad():
        outputs = model(input_ids=batch["input_ids"], attention_mask=batch["attention_mask"], labels=batch["label"])

    logits = outputs.logits
    predictions.extend(torch.argmax(logits, dim=-1).tolist())
    references.extend(batch["label"].tolist())

accuracy = np.sum(np.array(predictions) == np.array(references)) / len(predictions)
print(f"Accuracy: {accuracy:.4f}")

Accuracy: 0.9180
