<h1 style="font-size:300%;">AI4EduRes'2023: <br />Fine-Tuning RoBERTa for Downstream Tasks</h1>
<p>This notebook details how to use RoBERTa with HuggingFace.</p>

# üìΩÔ∏è IMDb Review Sentiment Classification

## Platform Check
Ensure we're on an ARM environment. 

NOTE:  If you are not on an ARM environment, update `params.device` to `torch.device('cuda' if torch.cuda.is_available() else 'cpu')`

In [1]:
import platform

if platform.platform() == 'macOS-13.0-arm64-i386-64bit':
    print(f"We're Armed: {platform.platform()}")
else:
    print(f"WARNING! NOT ARMED: {platform.platform()}")

We're Armed: macOS-13.0-arm64-i386-64bit


## Imports & Settings

In [2]:
import params

import evaluate
import numpy as np
from datasets import load_from_disk
from transformers import DataCollatorWithPadding

from transformers import RobertaForSequenceClassification, TrainingArguments, Trainer

# suppress model warning
from transformers import logging
logging.set_verbosity_error()

In [3]:
# Adjust parameters if necessary
params.num_labels = 2
params.output_dir = "models/imdb_hf"
params.max_length = 256

## Load Data

### IMDb

For this notebook, I've prepared a train/validate/test split of the IMDb movie review dataset. I've also mapped the string sentiment labels to binary integers and renamed the columns to "text" and "label". I've saved that dataset as a HuggingFace dataset object for the purposes of demonstrating the HuggingFace framework. Here, I load the dataset from a local directory. 

HuggingFace also offers a large selection of datasets that can be loaded from their remote repository using `from datasets import load_dataset`.

In [4]:
imdb = load_from_disk("data/inter_IMDB_sentiment/IMDb.hf")



In [5]:
imdb

DatasetDict({
    train: Dataset({
        features: ['text', 'label'],
        num_rows: 36000
    })
    validation: Dataset({
        features: ['text', 'label'],
        num_rows: 4000
    })
    test: Dataset({
        features: ['text', 'label'],
        num_rows: 10000
    })
})

In [6]:
# View Example
imdb["train"][0]

{'text': 'Beautifully photographed and ably acted, generally, but the writing is very slipshod. There are scenes of such unbelievability that there is no joy in the watching. The fact that the young lover has a twin brother, for instance, is so contrived that I groaned out loud. And the "emotion-light bulb connection" seems gimmicky, too.<br /><br />I don\'t know, though. If you have a few glasses of wine and feel like relaxing with something pretty to look at with a few flaccid comedic scenes, this is a pretty good movie. No major effort on the part of the viewer required. But Italian film, especially Italian comedy, is usually much, much better than this.',
 'label': 0}

## Preprocess
Below, we prepare our input text sequences to be accepted by the model. This involves tokenization and encoding of our sequences.

<b>Tokenization</b> :  Splitting strings into word or sub-word token strings <br />
<b>Encoding</b> : Converting those token strings into integers<br />

The preprocessing function will tokenize and encode our strings according to RoBERTa's pre-defined tokenization and encoding dictionary. When used in conjunction with HuggingFace Datasets and the mapping function, the result is the addition of token/input_ids and attention mask columns to our Dataset:

<b>token/input ids</b> : A list of integers that represent our tokenized and encoded string<br />
<b>attention masks</b> : A list of 1's and 0's mapped to each token id. 1 represents an id to which the model should apply attention and 0 represents an id to which the model may not apply attention (padding tokens, for example).

It is important to note that our preprocessing function's tokenizer is set only to truncate sequences longer than `max_length` to `max_length`. Sequences shorter than `max_length` will need to be padded,  which is possible via the tokenizer. However, padding at this point will pad all sequences to `max_length` which is unnecessary and computationally inefficient as explained below.

In [7]:
def preprocess_function(examples):
    return params.tokenizer(examples["text"], 
                            max_length = params.max_length, # 256 in this example
                            truncation=True) # truncate sequences longer than max_length to max_length

In [8]:
tokenized_imdb = imdb.map(preprocess_function, batched=True)

# removing unneeded columns avoids warning in training loop
tokenized_imdb = tokenized_imdb.remove_columns("text")

Loading cached processed dataset at /Users/jarradjinx/Library/Mobile Documents/com~apple~CloudDocs/EDU_leeds/LD_research/AI4EduRes'2023_FT-RoBERTa/data/inter_IMDB_sentiment/IMDb.hf/train/cache-8aaa70961af00f55.arrow


  0%|          | 0/4 [00:00<?, ?ba/s]

  0%|          | 0/10 [00:00<?, ?ba/s]

In [9]:
tokenized_imdb

DatasetDict({
    train: Dataset({
        features: ['label', 'input_ids', 'attention_mask'],
        num_rows: 36000
    })
    validation: Dataset({
        features: ['label', 'input_ids', 'attention_mask'],
        num_rows: 4000
    })
    test: Dataset({
        features: ['label', 'input_ids', 'attention_mask'],
        num_rows: 10000
    })
})

## Define Data Collator

By creating a collator that collects and combines portions of our dataset, we are able to refine the way data is handed to the model. Specifically, the DataCollatorWithPadding will collect our data into batches of a pre-defined size (`params.batch_size`). It will then pad all sequences within that batch to the length of the longest sequence within given batch. Finally, it converts the batch to a tensor before handing it to the model. This process is called dynamic padding.

Dynamic padding increases the efficiency of forwarding data through the model as sequences may be shorter than `max_length`. If we pad our sequences and convert them to tensors without collation, every sequence given to the model will be `max_length`.

In [10]:
data_collator = DataCollatorWithPadding(tokenizer=params.tokenizer,
                                        padding='max_length',
                                        max_length=params.max_length)

## Define Training Metrics

If you would like to view metrics other than loss during training, specifically in relation to the validation loop, you may load those metrics in a computational function as shown below. This function must pass predictions and labels to [compute](https://huggingface.co/docs/evaluate/main/en/package_reference/main_classes#evaluate.EvaluationModule.compute) to calculate your defined metrics. 

This function will be given to the Trainer below.

In [11]:
def compute_metrics(eval_pred):
    metric1 = evaluate.load("accuracy")
    metric2 = evaluate.load("f1")
    
    logits, labels = eval_pred
    predictions = np.argmax(logits, axis=-1)
    accuracy = metric1.compute(predictions=predictions, references=labels)["accuracy"]
    f1 = metric2.compute(predictions=predictions, references=labels)["f1"]
    return {"accuracy": accuracy, "f1": f1}

## Load & View Model

In [12]:
# Load the RobertaForSequenceClassification model
model = RobertaForSequenceClassification.from_pretrained('roberta-base',
                                                         num_labels = params.num_labels,
                                                         output_attentions = False,
                                                         output_hidden_states = False,
                                                         )

# view the model summary by passing dummy data of compatible shape
from torchinfo import summary
summary(model, input_size=(1, 512), dtypes=['torch.IntTensor'])

Layer (type:depth-idx)                                       Output Shape              Param #
RobertaForSequenceClassification                             [1, 2]                    --
‚îú‚îÄRobertaModel: 1-1                                          [1, 512, 768]             --
‚îÇ    ‚îî‚îÄRobertaEmbeddings: 2-1                                [1, 512, 768]             --
‚îÇ    ‚îÇ    ‚îî‚îÄEmbedding: 3-1                                   [1, 512, 768]             38,603,520
‚îÇ    ‚îÇ    ‚îî‚îÄEmbedding: 3-2                                   [1, 512, 768]             768
‚îÇ    ‚îÇ    ‚îî‚îÄEmbedding: 3-3                                   [1, 512, 768]             394,752
‚îÇ    ‚îÇ    ‚îî‚îÄLayerNorm: 3-4                                   [1, 512, 768]             1,536
‚îÇ    ‚îÇ    ‚îî‚îÄDropout: 3-5                                     [1, 512, 768]             --
‚îÇ    ‚îî‚îÄRobertaEncoder: 2-2                                   [1, 512, 768]             --
‚îÇ    ‚îÇ    ‚îî‚îÄMo

## Train

Before training, we define all of our training hyperparameters via [TrainingArguments](https://huggingface.co/docs/transformers/main/en/main_classes/trainer#transformers.TrainingArguments). The only required parameter is `output_dir` while all other arguments have default values set. Here, we define many of our own parameters.

We then instantiate the [Trainer](https://huggingface.co/docs/transformers/main/en/main_classes/trainer#transformers.Trainer) with our model, training arguments, dataset splits, tokenizer, collator and metrics function.

Finally, we start fine-tuning our model by calling [train()](https://huggingface.co/docs/transformers/main/en/main_classes/trainer#transformers.Trainer.train).

In [13]:
training_args = TrainingArguments(
    disable_tqdm=False, # show training progress
    output_dir=params.output_dir, # "imdb_hf"
    learning_rate=params.learning_rate,
    optim="adamw_torch",
    per_device_train_batch_size=params.batch_size,
    per_device_eval_batch_size=params.batch_size,
    num_train_epochs=params.epochs,
    evaluation_strategy="epoch", # eval at each epoch
    logging_strategy="epoch", # show info each epoch  
    save_strategy="epoch", # save at each epoch
    load_best_model_at_end=True, #
    use_mps_device=True, # use MPS, remove if using GPU
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_imdb["train"],
    eval_dataset=tokenized_imdb["validation"],
    tokenizer=params.tokenizer,
    data_collator=data_collator,
    compute_metrics=compute_metrics,
)

In [None]:
# fit the model to our training data - fine-tuning
trainer.train()

<p style="text-align: left;"><img src="presentation_resources/hf_training.png"  align="left"></p>

## Inference

Once a model is fine-tuned, we can load the model, its tokenizer, pre-process new input and generate predictions.

### Imports for Inference


In [14]:
import torch
from tqdm import tqdm
from transformers import RobertaTokenizer
from transformers import TextClassificationPipeline
from transformers import AutoModelForSequenceClassification
from sklearn.metrics import accuracy_score, f1_score, recall_score, precision_score

### Infer Manually w/HuggingFace
First, load our selected fine-tuned model and its tokenizer.

In [15]:
PATH = 'models/imdb_hf/checkpoint-4500'
model = AutoModelForSequenceClassification.from_pretrained(PATH, local_files_only=True)
tokenizer = RobertaTokenizer.from_pretrained(PATH, local_files_only=True)

loading configuration file models/imdb_hf/checkpoint-4500/config.json
Model config RobertaConfig {
  "_name_or_path": "models/imdb_hf/checkpoint-4500",
  "architectures": [
    "RobertaForSequenceClassification"
  ],
  "attention_probs_dropout_prob": 0.1,
  "bos_token_id": 0,
  "classifier_dropout": null,
  "eos_token_id": 2,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "layer_norm_eps": 1e-05,
  "max_position_embeddings": 514,
  "model_type": "roberta",
  "num_attention_heads": 12,
  "num_hidden_layers": 12,
  "pad_token_id": 1,
  "position_embedding_type": "absolute",
  "problem_type": "single_label_classification",
  "torch_dtype": "float32",
  "transformers_version": "4.25.1",
  "type_vocab_size": 1,
  "use_cache": true,
  "vocab_size": 50265
}

loading weights file models/imdb_hf/checkpoint-4500/pytorch_model.bin
All model checkpoint weights were used when initializing RobertaForSequenceClas

Next, we tokenize our test input.

In [16]:
# input = tokenizer("I hate this movie!", return_tensors="pt")
inputs = tokenizer(["I hate this movie!", "I love this movie!"], return_tensors="pt")

print(inputs)

{'input_ids': tensor([[   0,  100, 4157,   42, 1569,  328,    2],
        [   0,  100,  657,   42, 1569,  328,    2]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1],
        [1, 1, 1, 1, 1, 1, 1]])}


Forward those inputs through the model.

In [17]:
with torch.no_grad():
    logits = model(**inputs).logits

logits

tensor([[ 3.3273, -3.2752],
        [-3.2370,  3.5497]])

The classes are mapped by index. Because our classification head uses a softmax function on the output layer, the index with the highest value corresponds to our predicted label. In this case, 0 = "Negative (Sentiment)" and 1 = "Positive (Sentiment)".

In [18]:
# loop through logits
for i, v in enumerate(logits):
    # get index of largest value
    predicted_class_id = v.argmax().item()
    # get & decode input, match with predicted_class_id
    print(params.tokenizer.decode((inputs['input_ids'][i])), predicted_class_id)

<s>I hate this movie!</s> 0
<s>I love this movie!</s> 1


### Infer with HugginFace Pipeline

HuggingFace's pipeline method allows us to streamline the inference process further.

In [21]:
PATH = 'models/imdb_hf/checkpoint-4500'
model = AutoModelForSequenceClassification.from_pretrained(PATH, local_files_only=True)
tokenizer = RobertaTokenizer.from_pretrained(PATH, local_files_only=True)

# define pipeline
pipe = TextClassificationPipeline(model=model, tokenizer=tokenizer, top_k=2, max_length=512, truncation=True)

loading configuration file models/imdb_hf/checkpoint-4500/config.json
Model config RobertaConfig {
  "_name_or_path": "models/imdb_hf/checkpoint-4500",
  "architectures": [
    "RobertaForSequenceClassification"
  ],
  "attention_probs_dropout_prob": 0.1,
  "bos_token_id": 0,
  "classifier_dropout": null,
  "eos_token_id": 2,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "layer_norm_eps": 1e-05,
  "max_position_embeddings": 514,
  "model_type": "roberta",
  "num_attention_heads": 12,
  "num_hidden_layers": 12,
  "pad_token_id": 1,
  "position_embedding_type": "absolute",
  "problem_type": "single_label_classification",
  "torch_dtype": "float32",
  "transformers_version": "4.25.1",
  "type_vocab_size": 1,
  "use_cache": true,
  "vocab_size": 50265
}

loading weights file models/imdb_hf/checkpoint-4500/pytorch_model.bin
All model checkpoint weights were used when initializing RobertaForSequenceClas

In [22]:
pipe("i hate this movie")

[[{'label': 'LABEL_0', 'score': 0.9987990856170654},
  {'label': 'LABEL_1', 'score': 0.001200946164317429}]]

### Run Whole Test Set & Evaluate
We will use the HuggingFace pipeline with a loop in order to generate predictions for our entire test set.

### Load Test Data


In [None]:
# Load Test Data
imdb = load_from_disk("data/inter_IMDB_sentiment/IMDb.hf")

imdb_test = imdb['test']

imdb_test

In [None]:
# get sequences
test_input = imdb_test['text']

test_output = []

# pipe sequences to tokenizer -> model
with tqdm(test_input, unit="test") as prog:
    for step, test in enumerate(prog):
        prog.set_description(f"Test {step+1}")
        # append results to test_output list
        test_output.append(pipe(test)[0])

In [None]:
print("Test Output Slice:")
test_output[:5]

In [None]:
# parse target predictions to new list
predictions = []

for i in test_output:
    predictions.append(i[0]['label'])
    
print(len(predictions))
print(predictions[:10])

In [None]:
# remove "LABEL_" and cast as int
for i, v in enumerate(predictions):
    predictions[i] = int(v.replace("LABEL_",""))

print(predictions[:10])

In [None]:
# get accuracy and F1
acc = accuracy_score(imdb_test['label'], predictions)
f1 = f1_score(imdb_test['label'], predictions)

print("Accuracy: ", acc)
print("F1: ", f1)