# Exercise: Create a BERT sentiment classifier

In this exercise, you will create a BERT sentiment classifier (actually DistilBERT) using the [Hugging Face Transformers](https://huggingface.co/transformers/) library. 

You will use the [IMDB movie review dataset](https://huggingface.co/datasets/imdb) to train and evaluate your model. The IMDB dataset contains movie reviews that are labeled as either positive or negative. 

In [15]:
# Install the required version of datasets in case you have an older version
# You will need to choose "Kernel > Restart Kernel" from the menu after executing this cell
! pip install -q "datasets==2.15.0"

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


In [16]:
# Import the datasets and transformers libraries
from datasets import load_dataset 

In [17]:
# Load the train and test splicts of the IMDB dataset
splits = ['train', 'test']
ds ={split: ds for split, ds in zip(splits, load_dataset('imdb', split=['train', 'test']))}

In [18]:
ds['test']

Dataset({
    features: ['text', 'label'],
    num_rows: 25000
})

In [7]:
# Tin out the dataset to make it run for this example
for split in splits:
    ds[split] = ds[split].select(range(1000))

In [8]:
splits = ['train', 'test']
datasets = load_dataset('imdb', split=splits)

In [9]:
ds = {}
for split, datasets in zip(splits, datasets):
    ds[split] = datasets

## Pre-process datasets

Now we are going to process our datasets by converting all the text into tokens for our models. You may ask, why isn't the text converted already? Well, different models may use different tokenizers, so by converting at train time we retain more flexibility.

In [None]:

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased")



In [20]:
def preprocess_function(examples):
    """Preprocess the imdb dataset by returning tokenized examples"""
    return tokenizer(examples['text'], padding='max_length', truncation=True)

tokenized_ds = {}
for split in splits:
    tokenized_ds[split] = ds[split].map(preprocess_function, batched=True)

# Check that we tokenized the examples properly
# assert tokenized_ds["train"][0]["input_ids"][:5] == [101, 2045, 2003, 2053, 7189]

# Show the first example of the tokenized training set
print(tokenized_ds["train"][0]["input_ids"])   
    

[101, 1045, 12524, 1045, 2572, 8025, 1011, 3756, 2013, 2026, 2678, 3573, 2138, 1997, 2035, 1996, 6704, 2008, 5129, 2009, 2043, 2009, 2001, 2034, 2207, 1999, 3476, 1012, 1045, 2036, 2657, 2008, 2012, 2034, 2009, 2001, 8243, 2011, 1057, 1012, 1055, 1012, 8205, 2065, 2009, 2412, 2699, 2000, 4607, 2023, 2406, 1010, 3568, 2108, 1037, 5470, 1997, 3152, 2641, 1000, 6801, 1000, 1045, 2428, 2018, 2000, 2156, 2023, 2005, 2870, 1012, 1026, 7987, 1013, 1028, 1026, 7987, 1013, 1028, 1996, 5436, 2003, 8857, 2105, 1037, 2402, 4467, 3689, 3076, 2315, 14229, 2040, 4122, 2000, 4553, 2673, 2016, 2064, 2055, 2166, 1012, 1999, 3327, 2016, 4122, 2000, 3579, 2014, 3086, 2015, 2000, 2437, 2070, 4066, 1997, 4516, 2006, 2054, 1996, 2779, 25430, 14728, 2245, 2055, 3056, 2576, 3314, 2107, 2004, 1996, 5148, 2162, 1998, 2679, 3314, 1999, 1996, 2142, 2163, 1012, 1999, 2090, 4851, 8801, 1998, 6623, 7939, 4697, 3619, 1997, 8947, 2055, 2037, 10740, 2006, 4331, 1010, 2016, 2038, 3348, 2007, 2014, 3689, 3836, 1010, 19846

# Load and set up the model 

In [21]:
from transformers import AutoModelForSequenceClassification

In [22]:
model = AutoModelForSequenceClassification.from_pretrained("distilbert-base-uncased", num_labels=2, id2label={0: "NEGATIVE", 1: "POSITIVE"}, label2id={"NEGATIVE": 0, "POSITIVE": 1})

  _torch_pytree._register_pytree_node(
  _torch_pytree._register_pytree_node(
Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.weight', 'pre_classifier.weight', 'pre_classifier.bias', 'classifier.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [23]:
# Freeze al the parameters of the base model
for param in model.base_model.parameters():
    param.requires_grad = False

model.classifier

Linear(in_features=768, out_features=2, bias=True)

In [24]:
print(model)

DistilBertForSequenceClassification(
  (distilbert): DistilBertModel(
    (embeddings): Embeddings(
      (word_embeddings): Embedding(30522, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (transformer): Transformer(
      (layer): ModuleList(
        (0-5): 6 x TransformerBlock(
          (attention): MultiHeadSelfAttention(
            (dropout): Dropout(p=0.1, inplace=False)
            (q_lin): Linear(in_features=768, out_features=768, bias=True)
            (k_lin): Linear(in_features=768, out_features=768, bias=True)
            (v_lin): Linear(in_features=768, out_features=768, bias=True)
            (out_lin): Linear(in_features=768, out_features=768, bias=True)
          )
          (sa_layer_norm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
          (ffn): FFN(
            (dropout): Dropout(p=0.1, inplace=False)
 

## Let's train it!

Now it's time to train our model. We'll use the `Trainer` class from the 🤗 Transformers library to do this. The `Trainer` class provides a high-level API that abstracts away a lot of the training loop.

First we'll define a function to compute our accuracy metreic then we make the `Trainer`.

Let's take this opportunity to learn about the `DataCollator`. According to the HuggingFace documentation:

> Data collators are objects that will form a batch by using a list of dataset elements as input. These elements are of the same type as the elements of train_dataset or eval_dataset.

> To be able to build batches, data collators may apply some processing (like padding).

In [25]:
import numpy as np
from transformers import DataCollatorWithPadding, Trainer, TrainingArguments

In [26]:
def compute_metrics(eval_pred):
    predictions, labels = eval_pred
    predictions = np.argmax(predictions, axis=1)
    return {"accuracy": (predictions == labels).mean()}

In [27]:
# the Hugging Face Trainer class handles the eval loop for PyTorch for us.
trainer = Trainer(
    model=model,
    args=TrainingArguments(
        output_dir="./data/sentiment_analysis",
        learning_rate=2e-3,
        # Reduce the batch size if you are running out of memory
        per_device_train_batch_size=4,
        per_device_eval_batch_size=4,
        num_train_epochs=1,
        weight_decay=0.01,
        evaluation_strategy="epoch",
        save_strategy="epoch",
        load_best_model_at_end=True,
),
    train_dataset=tokenized_ds["train"],
    eval_dataset=tokenized_ds["test"],
    data_collator=DataCollatorWithPadding(tokenizer),
    compute_metrics=compute_metrics,
)

In [28]:
trainer.train()

  0%|          | 0/6250 [00:00<?, ?it/s]

You're using a DistilBertTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


{'loss': 0.5609, 'learning_rate': 0.00184, 'epoch': 0.08}
{'loss': 0.4935, 'learning_rate': 0.00168, 'epoch': 0.16}
{'loss': 0.4787, 'learning_rate': 0.00152, 'epoch': 0.24}
{'loss': 0.4449, 'learning_rate': 0.00136, 'epoch': 0.32}
{'loss': 0.4171, 'learning_rate': 0.0012, 'epoch': 0.4}
{'loss': 0.4407, 'learning_rate': 0.0010400000000000001, 'epoch': 0.48}
{'loss': 0.4138, 'learning_rate': 0.00088, 'epoch': 0.56}
{'loss': 0.4258, 'learning_rate': 0.0007199999999999999, 'epoch': 0.64}
{'loss': 0.405, 'learning_rate': 0.0005600000000000001, 'epoch': 0.72}
{'loss': 0.3879, 'learning_rate': 0.0004, 'epoch': 0.8}
{'loss': 0.384, 'learning_rate': 0.00024, 'epoch': 0.88}
{'loss': 0.3827, 'learning_rate': 8e-05, 'epoch': 0.96}


  0%|          | 0/6250 [00:00<?, ?it/s]

{'eval_loss': 0.3422572612762451, 'eval_accuracy': 0.86068, 'eval_runtime': 631.9942, 'eval_samples_per_second': 39.557, 'eval_steps_per_second': 9.889, 'epoch': 1.0}
{'train_runtime': 1402.2465, 'train_samples_per_second': 17.829, 'train_steps_per_second': 4.457, 'train_loss': 0.43380997192382814, 'epoch': 1.0}


TrainOutput(global_step=6250, training_loss=0.43380997192382814, metrics={'train_runtime': 1402.2465, 'train_samples_per_second': 17.829, 'train_steps_per_second': 4.457, 'train_loss': 0.43380997192382814, 'epoch': 1.0})

# Evaluate the model 
Evaluating the model is as simple as calling the evaluate method on the trainer object. This will run the model on the test set and compute the metrics we specified in the compute_metrics function.