<a href="https://colab.research.google.com/github/lisabecker/nlp-fundamentals/blob/main/0404_finetuning_transformers.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Module 4 Use Case - 👍 or 👎? Classifying Movie Reviews with Transformers

In this exercise, we will finetune a transformer model `distilbert` on the [IMDB movie review dataset](https://huggingface.co/datasets/imdb) to classify if a movie review is either positive or negative. In the end, we shall try feeding the trained model with our own reviews about recent movies to check if it is able to classify them according to our general sentiment about the movie.

What is `distilbert`? It's a "distilled" version of BERT. Its main benefits include:

- **Reduced Size**: DistilBERT has 40% fewer parameters than BERT-base, making it much lighter and faster for training and inference.
- **Preserved Performance**: Despite its reduced size, DistilBERT retains up to 97% of BERT's performance on various benchmark NLP tasks.
- **Flexibility and Efficiency**: Its smaller size makes it more suitable for real-world applications, including on mobile devices or other environments with limited computational resources.

##1. Set up the environment

In [None]:
# Installing all packages will take a few minutes
!pip install --q transformers==4.36.2 datasets==2.16.1 torch==2.1.2
print("\nPackages installed.")

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m8.2/8.2 MB[0m [31m44.3 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m507.1/507.1 kB[0m [31m34.6 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m670.2/670.2 MB[0m [31m1.4 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m115.3/115.3 kB[0m [31m17.8 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m134.8/134.8 kB[0m [31m20.7 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m23.7/23.7 MB[0m [31m58.8 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m823.6/823.6 kB[0m [31m67.0 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m14.1/14.1 MB[0m [31m90.8 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━

##2. Load the dataset

Huggingface 🤗 provides a datasets library which not only contains most of the  popular open-source NLP datasets, but also allows users to upload their own and share it with the world. For this exercise, we will use the `IMDB Movie Reviews` dataset which is popular for sentiment analysis tasks.

In [None]:
from datasets import load_dataset

dataset = load_dataset("imdb")
print("IMDB dataset downloaded.")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


Downloading readme:   0%|          | 0.00/7.81k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/21.0M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/20.5M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/42.0M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/25000 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/25000 [00:00<?, ? examples/s]

Generating unsupervised split:   0%|          | 0/50000 [00:00<?, ? examples/s]

IMDB dataset downloaded.


##3. Inspect the dataset

Let's inspect the downloaded dataset.  We see that the dataset contains two fields: `text` and `label`. The `text` field contains the review and the `label` field contains the classification of the review as positive or negative. `label` can contain one of two values `0` and `1`. The former indicates a *negative* review while the latter, *positive*.

In [None]:
# Let's inspect the dataset
dataset

DatasetDict({
    train: Dataset({
        features: ['text', 'label'],
        num_rows: 25000
    })
    test: Dataset({
        features: ['text', 'label'],
        num_rows: 25000
    })
    unsupervised: Dataset({
        features: ['text', 'label'],
        num_rows: 50000
    })
})

##4. Process the dataset

Transformers generally deal with tokens and not raw characters. Hence, to train a model, we must set up utility functions to tokenize the given text. A tokenizer splits a given text into words and replaces each word with a number (input_id). Each sample is padded to the longest sample or truncated to the maximum length that a model can handle.

In [None]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased")

def tokenize_function(samples):
    return tokenizer(samples["text"], padding="max_length", truncation=True, return_tensors='pt')

# Tokenizing the entire dataset will take a few minutes
tokenized_datasets = dataset.map(tokenize_function, batched=True, remove_columns=["text"])

# Use the set_format() function to set the dataset format to be compatible with PyTorch:
tokenized_datasets.set_format(type="torch", columns=["input_ids", "attention_mask", "label"])

print("Dataset tokenized.")

tokenizer_config.json:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/483 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

Map:   0%|          | 0/25000 [00:00<?, ? examples/s]

Map:   0%|          | 0/25000 [00:00<?, ? examples/s]

Map:   0%|          | 0/50000 [00:00<?, ? examples/s]

Dataset tokenized.


In [None]:
# What the dataset looks like after tokenization
tokenized_datasets

DatasetDict({
    train: Dataset({
        features: ['label', 'input_ids', 'attention_mask'],
        num_rows: 25000
    })
    test: Dataset({
        features: ['label', 'input_ids', 'attention_mask'],
        num_rows: 25000
    })
    unsupervised: Dataset({
        features: ['label', 'input_ids', 'attention_mask'],
        num_rows: 50000
    })
})

In [None]:
# A tokenized sample
print("Label:", {tokenized_datasets["train"]["label"][0]})
print("Input IDs:", {tokenized_datasets["train"]["input_ids"][0]})
print("Attention Mask:", {tokenized_datasets["train"]["attention_mask"][0]})

Label: {tensor(0)}
Input IDs: {tensor([  101,  1045, 12524,  1045,  2572,  8025,  1011,  3756,  2013,  2026,
         2678,  3573,  2138,  1997,  2035,  1996,  6704,  2008,  5129,  2009,
         2043,  2009,  2001,  2034,  2207,  1999,  3476,  1012,  1045,  2036,
         2657,  2008,  2012,  2034,  2009,  2001,  8243,  2011,  1057,  1012,
         1055,  1012,  8205,  2065,  2009,  2412,  2699,  2000,  4607,  2023,
         2406,  1010,  3568,  2108,  1037,  5470,  1997,  3152,  2641,  1000,
         6801,  1000,  1045,  2428,  2018,  2000,  2156,  2023,  2005,  2870,
         1012,  1026,  7987,  1013,  1028,  1026,  7987,  1013,  1028,  1996,
         5436,  2003,  8857,  2105,  1037,  2402,  4467,  3689,  3076,  2315,
        14229,  2040,  4122,  2000,  4553,  2673,  2016,  2064,  2055,  2166,
         1012,  1999,  3327,  2016,  4122,  2000,  3579,  2014,  3086,  2015,
         2000,  2437,  2070,  4066,  1997,  4516,  2006,  2054,  1996,  2779,
        25430, 14728,  2245,  205

##5. Create data loaders

Data loaders are used to create an iterator, which can be used to load data in batches during the training loop. They can also perform various other tasks such as type conversion, collation, etc.

In [None]:
import torch
from torch.utils.data import DataLoader

# Randomly pick out a subset of samples from out dataset for quicker training
train_dataset = tokenized_datasets["train"].shuffle(seed=42).select(range(10000))
test_dataset = tokenized_datasets["test"].shuffle(seed=42).select(range(1000))

# Create an "iterator" object to load data in batches
# A batch is a subset of the data that the model uses for training at once
train_dataloader = DataLoader(train_dataset, shuffle=True, batch_size=16, num_workers=2)
eval_dataloader = DataLoader(test_dataset, batch_size=16, num_workers=2)

In [None]:
print(train_dataloader.dataset)

Dataset({
    features: ['label', 'input_ids', 'attention_mask'],
    num_rows: 10000
})


##6. Load pretrained model and set training parameters

We use the pretrained `distilbert-base-uncased` model, which is a smaller, faster version of BERT, a popular transformer model.

**AutoModelForSequenceClassification** customizes this model for the task of sequence classification (here, classifying movie reviews into positive or negative).


**Optimizer Setup:** AdamW is an optimization algorithm used to minimize the loss during training. The learning rate lr=5e-5 is set, which determines how much the model updates its weights in response to the estimated error each time the model weights are updated.

**Training Parameters:** We're setting the number of epochs (num_epochs) as 3, meaning the entire training dataset will be passed through the model three times. num_training_steps is the total number of training steps the model will undergo, calculated by multiplying the number of epochs by the number of batches in the training data.

**Learning Rate Scheduler:** The scheduler adjusts the learning rate over the course of training, here using a linear schedule with no warmup steps.

In [None]:
from transformers import AdamW
from transformers import get_scheduler
from transformers import AutoModelForSequenceClassification

# Load pretrained model and set training parameters
model = AutoModelForSequenceClassification.from_pretrained("distilbert-base-uncased", num_labels=2)

# Set parameters for our training schedule
optimizer = AdamW(model.parameters(), lr=5e-5)

# How many times our model will see the entire training set
num_epochs = 3

# How frequently the model's weights are updated (= once for each batch of data)
num_training_steps = num_epochs * len(train_dataloader)

# Create a training schedule to tell the model how to finetune
lr_scheduler = get_scheduler(
    "linear",
    optimizer=optimizer,
    num_warmup_steps=0,
    num_training_steps=num_training_steps
)

model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['pre_classifier.bias', 'classifier.weight', 'classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [None]:
# Inspect at the model details
model

DistilBertForSequenceClassification(
  (distilbert): DistilBertModel(
    (embeddings): Embeddings(
      (word_embeddings): Embedding(30522, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (transformer): Transformer(
      (layer): ModuleList(
        (0-5): 6 x TransformerBlock(
          (attention): MultiHeadSelfAttention(
            (dropout): Dropout(p=0.1, inplace=False)
            (q_lin): Linear(in_features=768, out_features=768, bias=True)
            (k_lin): Linear(in_features=768, out_features=768, bias=True)
            (v_lin): Linear(in_features=768, out_features=768, bias=True)
            (out_lin): Linear(in_features=768, out_features=768, bias=True)
          )
          (sa_layer_norm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
          (ffn): FFN(
            (dropout): Dropout(p=0.1, inplace=False)
 

<img src="https://github.com/lisabecker/nlp-fundamentals/blob/main/graphics/transformer_architecture.png?raw=true" width="40%">

##7. Run Training loop
**Setting Up the Device:** The model training will be performed on a GPU if available (faster) or on a CPU if a GPU is not available.

**Training Process:** For each epoch, the training data is passed through the model in batches. The model's output includes the loss, which measures the difference between the predicted and actual values.

**Backpropagation and Optimization:** The loss.backward() call performs backpropagation to calculate gradients, and optimizer.step() updates the model's weights based on these gradients. The learning rate is updated at each step using lr_scheduler.step(), and optimizer.zero_grad() clears old gradients, ensuring that only the most recent gradients are used in the next step.

**Progress Tracking:** A progress bar is displayed using tqdm, showing the number of steps completed out of the total.

In [None]:
from tqdm.auto import tqdm

# Use GPU if available - you can change it under "Runtime" -> "Change runtime type"
device = torch.device("cuda") if torch.cuda.is_available() else torch.device("cpu")
model.to(device)

progress_bar = tqdm(range(num_training_steps))

# Number of steps after which to log training stats
log_steps = 50
training_step_count = 0

model.train()
for epoch in range(num_epochs):
    print(f"\nEpoch {epoch}")
    for batch in train_dataloader:
        # The training data is passed through the model
        batch = {k: v.to(device) for k, v in batch.items()}
        outputs = model(input_ids=batch["input_ids"], attention_mask=batch["attention_mask"], labels=batch["label"])

        # Perform backpropagation to calculate new weights
        loss = outputs.loss
        loss.backward()

        # Update the training steps based on the loss
        optimizer.step()
        lr_scheduler.step()
        optimizer.zero_grad()
        progress_bar.update(1)

        # Calculate the average loss for this epoch
        training_step_count += 1
        if training_step_count % log_steps == 0:
            print(f"Training loss at step {training_step_count}: {loss.item()}")

# Save finetuned model
model.save_pretrained("distilbert-uncased-imdb-finetune", from_pt=True)
print("\nModel finetuned and saved to disk.")

  0%|          | 0/1875 [00:00<?, ?it/s]


Epoch 0
Training loss at step 50: 0.29767802357673645
Training loss at step 100: 0.174493670463562
Training loss at step 150: 0.11936938762664795
Training loss at step 200: 0.2581367790699005
Training loss at step 250: 0.3692319989204407
Training loss at step 300: 0.27673476934432983
Training loss at step 350: 0.5609225630760193
Training loss at step 400: 0.4167221486568451
Training loss at step 450: 0.3462904393672943
Training loss at step 500: 0.3550032377243042
Training loss at step 550: 0.2580909729003906
Training loss at step 600: 0.11381731182336807

Epoch 1
Training loss at step 650: 0.024932971224188805
Training loss at step 700: 0.13669981062412262
Training loss at step 750: 0.06672060489654541
Training loss at step 800: 0.09782260656356812
Training loss at step 850: 0.03975396230816841
Training loss at step 900: 0.2670494318008423
Training loss at step 950: 0.052920956164598465
Training loss at step 1000: 0.0629207193851471
Training loss at step 1050: 0.12790869176387787
Tra

##8. Evaluate model on test set
Let's test the model on a new review

In [None]:
# Map the label ID to the actual class for better readability
label_names = dataset["train"].features["label"].names
def get_label_name(label_id):
    return label_names[label_id]

In [None]:
sample = {"text": "I really enjoyed this one! It had a lot of funny moments."}

tokenized_sample = tokenize_function(sample).to(device)
output = model(input_ids=tokenized_sample["input_ids"], attention_mask=tokenized_sample["attention_mask"])
prediction = torch.argmax(output.logits, dim=-1)

print(f"<{sample['text']}> is a {get_label_name(prediction)} review.")

<I really enjoyed this one! It had a lot of funny moments.> is a pos review.


**Model Evaluation Mode:** model.eval() is used to set the model to evaluation mode, disabling specific layers and behaviors like dropout layers used in training.

**Generating Predictions:** The model generates predictions for the test set. torch.no_grad() tells PyTorch not to calculate gradients, which is necessary during evaluation.

**Calculating Accuracy:** Predictions are compared with actual labels (references) to calculate the model's accuracy. This is the proportion of predictions the model got right, giving a straightforward metric to evaluate the model's performance.

In [None]:
from sklearn.metrics import accuracy_score, precision_recall_fscore_support
import numpy as np

def evaluate_model(model, dataloader, device):
    model.eval()  # Set the model to evaluation mode

    predictions = []
    references = []

    with torch.no_grad():
        # Predict the labels for all samples in all batches
        for batch in dataloader:
            batch = {k: v.to(device) for k, v in batch.items()}
            outputs = model(input_ids=batch["input_ids"], attention_mask=batch["attention_mask"])

            logits = outputs.logits
            batch_predictions = torch.argmax(logits, dim=-1).tolist()
            batch_references = batch["label"].tolist()

            # Add the predictions and true labels
            predictions.extend(batch_predictions)
            references.extend(batch_references)

    # Calculate metrics
    accuracy = accuracy_score(references, predictions)
    precision, recall, f1, _ = precision_recall_fscore_support(references, predictions, average='binary')

    return accuracy, precision, recall, f1

# Evaluate the model
accuracy, precision, recall, f1 = evaluate_model(model, eval_dataloader, device)

print(f"Accuracy: {accuracy:.4f}")
print(f"Precision: {precision:.4f}")
print(f"Recall: {recall:.4f}")
print(f"F1 Score: {f1:.4f}")

Accuracy: 0.9130
Precision: 0.8955
Recall: 0.9303
F1 Score: 0.9126
