<h1 style="font-size:300%;">AI4EduRes'2023: <br />Fine-Tuning RoBERTa for Downstream Tasks</h1>
<p>This notebook details how to use RoBERTa with HuggingFace.</p>

# 📽️ IMDb Review Sentiment Classification

## Platform Check
Ensure we're on an ARM environment. 

NOTE:  If you are not on an ARM environment, update `params.device` to `torch.device('cuda' if torch.cuda.is_available() else 'cpu')`

In [1]:
import platform

if platform.platform() == 'macOS-13.0-arm64-i386-64bit':
    print(f"We're Armed: {platform.platform()}")
else:
    print(f"WARNING! NOT ARMED: {platform.platform()}")

We're Armed: macOS-13.0-arm64-i386-64bit


## Imports & Settings
We begin by importing the necessary packages. Two imports to note are params and utils:
<ul>

**`params.py`** : Contains parameters shared between all three notebooks.

**`utils.py`** : Contains helper functions for visualizations in this notebook.

</ul>

In [2]:
import params
from utils import *

import evaluate
import numpy as np
from datasets import load_dataset
from transformers import DataCollatorWithPadding

from transformers import RobertaForSequenceClassification, TrainingArguments, Trainer

# suppress model warning
from transformers import logging
logging.set_verbosity_error()

Next, we adjust any parameters that may be unique to this notebook.

In [3]:
# Adjust parameters if necessary
params.num_labels = 2
params.output_dir = "models/imdb_hf"
params.max_length = 256

## Load Data

### IMDb

For this notebook, I've prepared a train/validate/test split of the IMDb movie review dataset. I've also mapped the string sentiment labels to binary integers and renamed the columns to "text" and "label". I've saved that dataset as a HuggingFace dataset object for the purposes of demonstrating the HuggingFace framework. Here, I load the dataset from a local directory. 

HuggingFace also offers a large selection of datasets that can be loaded from their remote repository using `from datasets import load_dataset`.

In [4]:
imdb = load_dataset("jahjinx/IMDb_movie_reviews")

Using custom data configuration jahjinx--IMDb_movie_reviews-d7ed51e8fa5a21e7
Found cached dataset csv (/Users/jarradjinx/.cache/huggingface/datasets/jahjinx___csv/jahjinx--IMDb_movie_reviews-d7ed51e8fa5a21e7/0.0.0/6b34fb8fcf56f7c8ba51dc895bfa2bfbe43546f190a60fcf74bb5e8afdcc2317)


  0%|          | 0/3 [00:00<?, ?it/s]

In [5]:
imdb

DatasetDict({
    train: Dataset({
        features: ['text', 'label'],
        num_rows: 36000
    })
    test: Dataset({
        features: ['text', 'label'],
        num_rows: 10000
    })
    validation: Dataset({
        features: ['text', 'label'],
        num_rows: 4000
    })
})

In [6]:
# View Example
imdb["train"][0]

{'text': 'Beautifully photographed and ably acted, generally, but the writing is very slipshod. There are scenes of such unbelievability that there is no joy in the watching. The fact that the young lover has a twin brother, for instance, is so contrived that I groaned out loud. And the "emotion-light bulb connection" seems gimmicky, too.<br /><br />I don\'t know, though. If you have a few glasses of wine and feel like relaxing with something pretty to look at with a few flaccid comedic scenes, this is a pretty good movie. No major effort on the part of the viewer required. But Italian film, especially Italian comedy, is usually much, much better than this.',
 'label': 0}

## Preprocess
Below, we prepare our input text sequences to be accepted by the model. This involves tokenization and encoding of our sequences.

<ul>

<b>Tokenization</b> :  Splitting strings into word or sub-word token strings <br />
<b>Encoding</b> : Converting those token strings into integers<br />

</ul>

### Understanding Tokenization & Encoding

Before tokenizing and encoding our training data, we briefly explore RoBERTa's tokenization process. RoBERTa uses a type of Byte-Pair Encoding that not only creates subword units using bytes rather than unicode characters, but also allows it to learn a modestly-sized subword vocabulary without introducing any “unknown” tokens. You can learn more about Byte-Pair Encoding here: https://huggingface.co/docs/transformers/tokenizer_summary. <br />


Most text preprocessing, beyond tokenization and encoding, is largely unnecessary for RoBERTa-based models. For example, the removal of stop words, a common text preprocessing technique, is unnecessary as RoBERTa's dictionary not only includes these words, but the training corpus and pre-training process include these words as well. Casing is also accounted for, meaning RoBERTa will process and infer from casing in input text.<br />

We can see examples of sub-word tokenization, distinction between vocabulary words of different casing, special tokens, and how characters such as spaces are represented below.<br />

Our first example shows how the RoBERTa encodes the same word with different casing. Our second example illustrates how it handles non-words, spaces, and attention masks.

In [7]:
# View Encoding of Title-Case "Hello"
sequence_1 = "Hello"
encoding_1 = params.tokenizer.encode_plus(
                        sequence_1,
                        add_special_tokens = True,
                        return_attention_mask = True,
                        return_tensors = 'pt'
                   )


token_id_1 = encoding_1['input_ids']
attention_masks_1 = encoding_1['attention_mask']

# View Encoding of Lower-Case "hello"
sequence_2 = "hello"
encoding_2 = params.tokenizer.encode_plus(
                        sequence_2,
                        add_special_tokens = True,
                        return_attention_mask = True,
                        return_tensors = 'pt'
                   )


token_id_2 = encoding_2['input_ids']
attention_masks_2 = encoding_2['attention_mask']

# View Encoding of Lower-Case "hello"
sequence_3 = "heLlo"
encoding_3 = params.tokenizer.encode_plus(
                        sequence_3,
                        add_special_tokens = True,
                        return_attention_mask = True,
                        return_tensors = 'pt'
                   )


token_id_3 = encoding_3['input_ids']
attention_masks_3 = encoding_3['attention_mask']

print_sentence_encoding(token_id_1, attention_masks_1)
print_sentence_encoding(token_id_2, attention_masks_2)
print_sentence_encoding(token_id_3, attention_masks_3)


╒══════════╤═════════════╤══════════════════╕
│ Tokens   │   Token IDs │   Attention Mask │
╞══════════╪═════════════╪══════════════════╡
│ <s>      │           0 │                1 │
├──────────┼─────────────┼──────────────────┤
│ Hello    │       31414 │                1 │
├──────────┼─────────────┼──────────────────┤
│ </s>     │           2 │                1 │
╘══════════╧═════════════╧══════════════════╛
╒══════════╤═════════════╤══════════════════╕
│ Tokens   │   Token IDs │   Attention Mask │
╞══════════╪═════════════╪══════════════════╡
│ <s>      │           0 │                1 │
├──────────┼─────────────┼──────────────────┤
│ hello    │       42891 │                1 │
├──────────┼─────────────┼──────────────────┤
│ </s>     │           2 │                1 │
╘══════════╧═════════════╧══════════════════╛
╒══════════╤═════════════╤══════════════════╕
│ Tokens   │   Token IDs │   Attention Mask │
╞══════════╪═════════════╪══════════════════╡
│ <s>      │           0 │        

In [8]:
sequence = "Please reorganize these letters:   the sldkj ug"
test = params.tokenizer.encode_plus(
                        sequence,
                        add_special_tokens = True,
                        max_length = 20,
                        padding='max_length',
                        truncation=True,
                        return_attention_mask = True,
                        return_tensors = 'pt'
                   )


token_id = test['input_ids']
attention_masks =test['attention_mask']

print_sentence_encoding(token_id, attention_masks)

╒══════════╤═════════════╤══════════════════╕
│ Tokens   │   Token IDs │   Attention Mask │
╞══════════╪═════════════╪══════════════════╡
│ <s>      │           0 │                1 │
├──────────┼─────────────┼──────────────────┤
│ Please   │        6715 │                1 │
├──────────┼─────────────┼──────────────────┤
│ Ġreorgan │       22208 │                1 │
├──────────┼─────────────┼──────────────────┤
│ ize      │        2072 │                1 │
├──────────┼─────────────┼──────────────────┤
│ Ġthese   │         209 │                1 │
├──────────┼─────────────┼──────────────────┤
│ Ġletters │        5430 │                1 │
├──────────┼─────────────┼──────────────────┤
│ :        │          35 │                1 │
├──────────┼─────────────┼──────────────────┤
│ Ġ        │        1437 │                1 │
├──────────┼─────────────┼──────────────────┤
│ Ġ        │        1437 │                1 │
├──────────┼─────────────┼──────────────────┤
│ Ġthe     │           5 │        

### Tokenizing Training Data

The preprocessing function below will tokenize and encode our strings according to RoBERTa's pre-defined tokenization and encoding dictionary. When used in conjunction with HuggingFace Datasets and the mapping function, the result is the addition of token/input_ids and attention mask columns to our Dataset:

<ul>

<b>token/input ids</b> : A list of integers that represent our tokenized and encoded string<br />
<b>attention masks</b> : A list of 1's and 0's mapped to each token id. 1 represents an id to which the model should apply attention and 0 represents an id to which the model may not apply attention (padding tokens, for example).

</ul>

It is important to note that our preprocessing function's tokenizer is set only to truncate sequences longer than `max_length` to `max_length`. Sequences shorter than `max_length` will need to be padded,  which is possible via the tokenizer. However, padding at this point will pad all sequences to `max_length` which is unnecessary and computationally inefficient as explained below.

In [9]:
def preprocess_function(examples):
    return params.tokenizer(examples["text"], 
                            max_length = params.max_length, # 256 in this example
                            truncation=True) # truncate sequences longer than max_length to max_length

In [None]:
tokenized_imdb = imdb.map(preprocess_function, batched=True)

# removing unneeded columns avoids warning in training loop
tokenized_imdb = tokenized_imdb.remove_columns("text")

In [11]:
tokenized_imdb

DatasetDict({
    train: Dataset({
        features: ['label', 'input_ids', 'attention_mask'],
        num_rows: 36000
    })
    test: Dataset({
        features: ['label', 'input_ids', 'attention_mask'],
        num_rows: 10000
    })
    validation: Dataset({
        features: ['label', 'input_ids', 'attention_mask'],
        num_rows: 4000
    })
})

In [12]:
# view and decode a single sequence
print("Length:", len(tokenized_imdb['train'][0]['input_ids']))
print(tokenized_imdb['train'][0]['input_ids'])
print(params.tokenizer.decode(tokenized_imdb['train'][0]['input_ids']))

Length: 152
[0, 29287, 34717, 16372, 8, 4091, 352, 8337, 6, 3489, 6, 53, 5, 2410, 16, 182, 9215, 1193, 1630, 4, 345, 32, 5422, 9, 215, 49856, 4484, 14, 89, 16, 117, 5823, 11, 5, 2494, 4, 20, 754, 14, 5, 664, 16095, 34, 10, 9544, 2138, 6, 13, 4327, 6, 16, 98, 8541, 36040, 14, 38, 11491, 22597, 66, 7337, 4, 178, 5, 22, 991, 19187, 12, 6991, 32384, 2748, 113, 1302, 40585, 14963, 6, 350, 49069, 3809, 1589, 49007, 3809, 48709, 100, 218, 75, 216, 6, 600, 4, 318, 47, 33, 10, 367, 11121, 9, 3984, 8, 619, 101, 19448, 19, 402, 1256, 7, 356, 23, 19, 10, 367, 2342, 7904, 808, 29045, 5422, 6, 42, 16, 10, 1256, 205, 1569, 4, 440, 538, 1351, 15, 5, 233, 9, 5, 18754, 1552, 4, 125, 3108, 822, 6, 941, 3108, 5313, 6, 16, 2333, 203, 6, 203, 357, 87, 42, 4, 2]
<s>Beautifully photographed and ably acted, generally, but the writing is very slipshod. There are scenes of such unbelievability that there is no joy in the watching. The fact that the young lover has a twin brother, for instance, is so contrived th

## Define Data Collator

By creating a collator that collects and combines portions of our dataset, we are able to refine the way data is handed to the model. Specifically, the DataCollatorWithPadding will collect our data into batches of a pre-defined size (`params.batch_size`). It will then pad all sequences within that batch to the length of the longest sequence within given batch. Finally, it converts the batch to a tensor before handing it to the model. This process is called dynamic padding.

Dynamic padding increases the efficiency of forwarding data through the model as sequences may be shorter than `max_length`. If we pad our sequences and convert them to tensors without collation, every sequence given to the model will be `max_length`.

<p style="text-align: center;"><img src="presentation_resources/collate.png" width="750" align="center"><figcaption style="text-align: center;">Bednarski , M (2022). Understand collate_fn in PyTorch [Blog post]. Retrieved from https://plainenglish.io/blog/understanding-collate-fn-in-pytorch-f9d1742647d3</figcaption></p>

In [13]:
data_collator = DataCollatorWithPadding(tokenizer=params.tokenizer,
                                        padding='max_length',
                                        max_length=params.max_length)

## Define Training Metrics

If you would like to view metrics other than loss during training, specifically in relation to the validation loop, you may load those metrics in a computational function as shown below. This function must pass predictions and labels to [compute](https://huggingface.co/docs/evaluate/main/en/package_reference/main_classes#evaluate.EvaluationModule.compute) to calculate your defined metrics. 

This function will be given to the Trainer below.

In [13]:
def compute_metrics(eval_pred):
    metric1 = evaluate.load("accuracy")
    metric2 = evaluate.load("f1")
    
    # get prediction logits and true labels
    logits, labels = eval_pred
    
    # get predictions from logits
    predictions = np.argmax(logits, axis=-1)
    
    # pass predictions and true labels to metric functions
    accuracy = metric1.compute(predictions=predictions, references=labels)["accuracy"]
    f1 = metric2.compute(predictions=predictions, references=labels)["f1"]
    
    return {"accuracy": accuracy, "f1": f1}

## Load & View Model

In [14]:
# Load the RobertaForSequenceClassification model
model = RobertaForSequenceClassification.from_pretrained('roberta-base',
                                                         num_labels = params.num_labels,
                                                         output_attentions = False,
                                                         output_hidden_states = False,
                                                         )

# view the model summary by passing dummy data of compatible shape
from torchinfo import summary
summary(model, input_size=(1, 512), dtypes=['torch.IntTensor'])

Layer (type:depth-idx)                                       Output Shape              Param #
RobertaForSequenceClassification                             [1, 2]                    --
├─RobertaModel: 1-1                                          [1, 512, 768]             --
│    └─RobertaEmbeddings: 2-1                                [1, 512, 768]             --
│    │    └─Embedding: 3-1                                   [1, 512, 768]             38,603,520
│    │    └─Embedding: 3-2                                   [1, 512, 768]             768
│    │    └─Embedding: 3-3                                   [1, 512, 768]             394,752
│    │    └─LayerNorm: 3-4                                   [1, 512, 768]             1,536
│    │    └─Dropout: 3-5                                     [1, 512, 768]             --
│    └─RobertaEncoder: 2-2                                   [1, 512, 768]             --
│    │    └─ModuleList: 3-6                                  --               

## Train

Before training, we define all of our training hyperparameters via [TrainingArguments](https://huggingface.co/docs/transformers/main/en/main_classes/trainer#transformers.TrainingArguments). The only required parameter is `output_dir` while all other arguments have default values set. Here, we define many of our own parameters.

We then instantiate the [Trainer](https://huggingface.co/docs/transformers/main/en/main_classes/trainer#transformers.Trainer) with our model, training arguments, dataset splits, tokenizer, collator and metrics function.

Finally, we start fine-tuning our model by calling [train()](https://huggingface.co/docs/transformers/main/en/main_classes/trainer#transformers.Trainer.train).

In [15]:
training_args = TrainingArguments(
    disable_tqdm=False, # show training progress
    output_dir=params.output_dir, # "imdb_hf"
    learning_rate=params.learning_rate,
    optim="adamw_torch",
    per_device_train_batch_size=params.batch_size,
    per_device_eval_batch_size=params.batch_size,
    num_train_epochs=params.epochs,
    evaluation_strategy="epoch", # eval at each epoch
    logging_strategy="epoch", # show info each epoch  
    save_strategy="epoch", # save at each epoch
    load_best_model_at_end=True,
    use_mps_device=True, # use MPS, remove if using GPU
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_imdb["train"],
    eval_dataset=tokenized_imdb["validation"],
    tokenizer=params.tokenizer,
    data_collator=data_collator,
    compute_metrics=compute_metrics,
)

In [None]:
# fit the model to our training data - fine-tuning
trainer.train()

<p style="text-align: left;"><img src="presentation_resources/hf_training.png"  align="left"></p>

## Inference

Once a model is fine-tuned, we can load the model, its tokenizer, pre-process new input and generate predictions.

### Imports for Inference


In [16]:
import torch
from tqdm import tqdm
from transformers import RobertaTokenizer
from transformers import TextClassificationPipeline
from transformers import AutoModelForSequenceClassification
from sklearn.metrics import accuracy_score, f1_score, recall_score, precision_score

### Infer Manually w/HuggingFace
First, load our selected fine-tuned model and its tokenizer.

In [17]:
PATH = 'models/imdb_hf/checkpoint-4500'
model = AutoModelForSequenceClassification.from_pretrained(PATH, local_files_only=True)
tokenizer = RobertaTokenizer.from_pretrained(PATH, local_files_only=True)

loading configuration file models/imdb_hf/checkpoint-4500/config.json
Model config RobertaConfig {
  "_name_or_path": "models/imdb_hf/checkpoint-4500",
  "architectures": [
    "RobertaForSequenceClassification"
  ],
  "attention_probs_dropout_prob": 0.1,
  "bos_token_id": 0,
  "classifier_dropout": null,
  "eos_token_id": 2,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "layer_norm_eps": 1e-05,
  "max_position_embeddings": 514,
  "model_type": "roberta",
  "num_attention_heads": 12,
  "num_hidden_layers": 12,
  "pad_token_id": 1,
  "position_embedding_type": "absolute",
  "problem_type": "single_label_classification",
  "torch_dtype": "float32",
  "transformers_version": "4.25.1",
  "type_vocab_size": 1,
  "use_cache": true,
  "vocab_size": 50265
}

loading weights file models/imdb_hf/checkpoint-4500/pytorch_model.bin
All model checkpoint weights were used when initializing RobertaForSequenceClas

Next, we tokenize our test input.

In [18]:
# input = tokenizer("I hate this movie!", return_tensors="pt")
inputs = tokenizer(["I hate this movie!", "I love this movie!"], return_tensors="pt")

print(inputs)

{'input_ids': tensor([[   0,  100, 4157,   42, 1569,  328,    2],
        [   0,  100,  657,   42, 1569,  328,    2]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1],
        [1, 1, 1, 1, 1, 1, 1]])}


Forward those inputs through the model.

In [19]:
with torch.no_grad():
    logits = model(**inputs).logits

logits

tensor([[ 3.3273, -3.2752],
        [-3.2370,  3.5497]])

The classes are mapped by index. Because our classification head uses a softmax function on the output layer, the index with the highest value corresponds to our predicted label. In this case, 0 = "Negative (Sentiment)" and 1 = "Positive (Sentiment)".

In [20]:
# loop through logits
for i, v in enumerate(logits):
    # get index of largest value
    predicted_class_id = v.argmax().item()
    # get & decode input, match with predicted_class_id
    print(params.tokenizer.decode((inputs['input_ids'][i])), predicted_class_id)

<s>I hate this movie!</s> 0
<s>I love this movie!</s> 1


### Infer with HugginFace Pipeline

HuggingFace's pipeline method allows us to streamline the inference process further.

In [21]:
PATH = 'models/imdb_hf/checkpoint-4500'
model = AutoModelForSequenceClassification.from_pretrained(PATH, local_files_only=True)
tokenizer = RobertaTokenizer.from_pretrained(PATH, local_files_only=True)

# define pipeline
pipe = TextClassificationPipeline(model=model, tokenizer=tokenizer, top_k=2, max_length=512, truncation=True)

loading configuration file models/imdb_hf/checkpoint-4500/config.json
Model config RobertaConfig {
  "_name_or_path": "models/imdb_hf/checkpoint-4500",
  "architectures": [
    "RobertaForSequenceClassification"
  ],
  "attention_probs_dropout_prob": 0.1,
  "bos_token_id": 0,
  "classifier_dropout": null,
  "eos_token_id": 2,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "layer_norm_eps": 1e-05,
  "max_position_embeddings": 514,
  "model_type": "roberta",
  "num_attention_heads": 12,
  "num_hidden_layers": 12,
  "pad_token_id": 1,
  "position_embedding_type": "absolute",
  "problem_type": "single_label_classification",
  "torch_dtype": "float32",
  "transformers_version": "4.25.1",
  "type_vocab_size": 1,
  "use_cache": true,
  "vocab_size": 50265
}

loading weights file models/imdb_hf/checkpoint-4500/pytorch_model.bin
All model checkpoint weights were used when initializing RobertaForSequenceClas

In [22]:
pipe("i hate this movie")

[[{'label': 'LABEL_0', 'score': 0.9987990856170654},
  {'label': 'LABEL_1', 'score': 0.001200946164317429}]]

### Run Whole Test Set & Evaluate
We will use the HuggingFace pipeline with a loop in order to generate predictions for our entire test set.

### Load Test Data


In [23]:
# Load Test Data
imdb = load_from_disk("data/inter_IMDB_sentiment/IMDb.hf")

imdb_test = imdb['test']

imdb_test

Dataset({
    features: ['text', 'label'],
    num_rows: 10000
})

In [24]:
# get sequences
test_input = imdb_test['text']

test_output = []

# pipe sequences to tokenizer -> model
with tqdm(test_input, unit="test") as prog:
    for step, test in enumerate(prog):
        prog.set_description(f"Test {step+1}")
        # append results to test_output list
        test_output.append(pipe(test)[0])

Test 10000: 100%|██████████| 10000/10000 [21:30<00:00,  7.75test/s]


In [25]:
print("Test Output Slice:")
test_output[:5]

Test Output Slice:


[[{'label': 'LABEL_0', 'score': 0.9918249845504761},
  {'label': 'LABEL_1', 'score': 0.008175036869943142}],
 [{'label': 'LABEL_0', 'score': 0.9990043044090271},
  {'label': 'LABEL_1', 'score': 0.0009957626461982727}],
 [{'label': 'LABEL_0', 'score': 0.9987474679946899},
  {'label': 'LABEL_1', 'score': 0.0012525408528745174}],
 [{'label': 'LABEL_0', 'score': 0.9980292916297913},
  {'label': 'LABEL_1', 'score': 0.001970750279724598}],
 [{'label': 'LABEL_1', 'score': 0.9984378218650818},
  {'label': 'LABEL_0', 'score': 0.0015621976926922798}]]

In [26]:
# parse target predictions to new list
predictions = []

for i in test_output:
    predictions.append(i[0]['label'])
    
print(len(predictions))
print(predictions[:10])

10000
['LABEL_0', 'LABEL_0', 'LABEL_0', 'LABEL_0', 'LABEL_1', 'LABEL_0', 'LABEL_0', 'LABEL_0', 'LABEL_0', 'LABEL_0']


In [27]:
# remove "LABEL_" and cast as int
for i, v in enumerate(predictions):
    predictions[i] = int(v.replace("LABEL_",""))

print(predictions[:10])

[0, 0, 0, 0, 1, 0, 0, 0, 0, 0]


In [28]:
# get accuracy and F1
acc = accuracy_score(imdb_test['label'], predictions)
f1 = f1_score(imdb_test['label'], predictions)

print("Accuracy: ", acc)
print("F1: ", f1)

Accuracy:  0.9506
F1:  0.951011503371678
