# Training LLMs - Fine Tuning Language Model on Semantic Tasks

In this assignment we are going to fine tune an off the shelf pre-trained language model to understand semantic similarity. To do this we are going to use the Glue - MSRC dataset provided by microsoft to understand such semantics.

The goal for this assignment is to take an off the shelf language model that is already pre-trained and fine tune it on the task understanding semantic analysis. The languge model that we are going to use is the base `roberta` model that is larger than the original base `bert` model. There are other modifications that `roberta` did to enhance `bert` such as dynamic masking, the removal of the next sentence prediction task, as well as a more enhanced tokenzer.

This assignment will walk you through the steps needed to accomplish this task. You will be asked to fill in the various code blocks as we progress through the notebook. Please view the comments to monitor which code blocks to complete.

In [None]:
# let's first install the various libraries that we'll
# need for this assignment
!pip install -q peft datasets evaluate


In [None]:
! pip install transformers[torch]

In [None]:
# we are going to import
# the various methods and classes that we will use
# throughout the notebook.
from transformers import (
    AutoModelForSequenceClassification,
    AutoTokenizer,
    DataCollatorWithPadding,
    TrainingArguments,
    Trainer,
)
from peft import (
    get_peft_config,
    get_peft_model,
    LoraConfig,
    TaskType
    )
from datasets import load_dataset
import evaluate
import torch
import numpy as np

# we are going to pull in the RoBERTA model
# which is a modification of BERT
model_name_or_path = "roberta-base"
# this is going to be the specific dataset that
# we are pulling from load dataset.
# this data allows us to understand
# semantic similarity between documents.
task = "mrpc" # microsoft research paraphrase corpus

In [None]:
# let's load in the glue dataset with the MRPC task
dataset = load_dataset("glue", task)

In [None]:
# let's view the data set as a whole
dataset

In [None]:
# look at a few records of the train dataset.
# label refers to if those sentences are indeed
# similar

## code here


As is typical when fine tuning language models, we need to create a function that will keep track of metrics while training. In order to do this we are going to use the native metric that is seen in the GLEU dataset creation.

For more information on the GLUE Metric and Datasets, please view this link: [GLEU](https://huggingface.co/spaces/evaluate-metric/glue).

In [None]:
# at this point we are going to load in the metric
# that we should be using when evaluating the MRPC dataset.
# we will use this as part of computing metrics
metric = evaluate.load("glue", task)

In [None]:
# let's visualize what this metric
# looks like.
metric

In [None]:
# we can see an example here
# between references and predictions
# this metric is how we will account for
# the training and validation loss during
# model training
references = [0, 1]
predictions = [1, 1]
results = metric.compute(predictions=predictions, references=references)
print(results)

In [None]:
# go ahead and write a compute_metrics
# function that will take an eval_pred
# object and return the metric calculation
# of the predictions vs the labels.

## code here


In [None]:
# load in the requisite tokenizer for the RoBERTA model
tokenizer = AutoTokenizer.from_pretrained(model_name_or_path, padding_side="right")
if getattr(tokenizer, "pad_token_id") is None:
    tokenizer.pad_token_id = tokenizer.eos_token_id

# write a function that will take in a batch (or example)
# and tokenize both the first sentence and second sentence
# make sure to truncate the text and don't worry about the max length for now.

## code here


In [None]:
# take your just written tokenize function
# and tokenize the entire dataset that we
# pulled in at the beginning

## code here
tokenized_datasets = dataset.map(
    tokenize_function,
    batched=True,
    remove_columns=["idx", "sentence1", "sentence2"],
)

# afterwards, rename the "label" feature as "labels"

## code here


In [None]:
# view the first few examples of
# your tokenized data set to see what it looks
# like.
tokenized_datasets['train'][0]

In [None]:
# so that you van view get the input ids from any
# example that you choose, and run it through the following code,
# what do you notice?
example_input_ids = tokenized_datasets['train'][0]['input_ids']
tokenizer.convert_tokens_to_string(tokenizer.convert_ids_to_tokens(example_input_ids))

In [None]:
# let's now make a DataCollator object will dynamically pad
# our inputs using the tokenizer in question.
# ideally we want to pad to the longest sentences that we see in question.
data_collator = DataCollatorWithPadding(tokenizer=tokenizer, padding="longest")

## Fine Tuning Languge Model
Let's first fine tune the full language model. We will then compare it to fine tuning on the PEFT version and notice any major differences.

In [None]:
# pull in the RoBERTA model
# remember to use AutoModelForSequenceClassification
# class because we are going to be classifying on a known label.
model = AutoModelForSequenceClassification.from_pretrained(model_name_or_path, return_dict=True)

In [None]:
## find the number of trainable parameters that this model
## will use when fine tuning

## code here


In [None]:
## using the TrainingArguments class
## choose the best hyperparameters for
## fine tuning the languge model.
## NOTE: It may help because of the size of the
## Roberta model to use logging_steps around 100.

## code here
training_args =

In [None]:
# put your model, training args, datasets,
# tokenizer, data collator, and metrics into a Trainer object
# and then begin the fine tuning process

## code here
trainer =

# train the model here
## code here


In [None]:
from sklearn.metrics import ConfusionMatrixDisplay, confusion_matrix
import matplotlib.pyplot as plt
# here is a plot confusion matrix from before
# lets use it to plot a confusion matrix
# of our labels
def plot_confusion_matrix(y_preds, y_true, labels):
    cm = confusion_matrix(y_true, y_preds, normalize="true")
    fig, ax = plt.subplots(figsize=(6, 6))
    disp = ConfusionMatrixDisplay(confusion_matrix=cm, display_labels=labels)
    disp.plot(cmap="Blues", values_format=".2f", ax=ax, colorbar=False)
    plt.title("Normalized confusion matrix")
    plt.show()

# get the predictions and test data here
y_preds = np.argmax(trainer.predict(tokenized_datasets["test"]).predictions, axis=1)
y_test = tokenized_datasets["test"]['labels']

In [None]:
# plot the confusion matrix
## code here


## Train the LoRA variation of the RoBERTA Model
In this section we are going to fine tune the language model using the LoRA configuration. It will follow a similar procedure as done before.

In [None]:
# let us now get the LoRA confirguration
# get the lora configuration of the model
# in order to do this use lora_config and choose the appropriate rank

## code here



In [None]:
# now create the pretained model using the RoBERTA path
# and wrap it around the lora configuration
# afterwards print out the number of trainable parameters
# what do you notice with the original roberta model?

## code here


In [None]:
# in a similar way as before,
# write out the training arguments that you wish
# to use for the lora configuration of Roberta.

## code here
training_args =

In [None]:
# similarly as before write down the Trainer
# object and with your training arguments, lora model
# datasets, tokenizer, data collator, and compute metrics

## code here
trainer =

# train the model
## code here


In [None]:
# get the training predictions
# as well as the test target outputs
# lastly plot the confusion matric

## code here
y_preds = np.argmax(trainer.predict(tokenized_datasets["test"]).predictions, axis=1)
y_test = tokenized_datasets["test"]['labels']
labels = ["not equivalent", "equivalent"]
plot_confusion_matrix(y_preds, y_test, labels)

Now we are ready to begin testing our LoRA model on test data that we generate ourselves. We are providing two sentences and the code in order to determine semantic similarity. Afterwards, you can test some yourselves.

In [None]:
# let's look at a specific example
# and see what the trained model will do on two samples
# that are not necessarily in the training or valiudation data

## here is the code to take in two sentences, tokenize them,
## and generate model logit outputs. Lastly, we are going to generate predictions
## for each classes.
def get_preds(sentence1, sentence2, classes=["not equivalent", "equivalent"]):
  inputs = tokenizer(sentence1,
                     sentence2,
                     truncation=True,
                     padding="longest",
                     return_tensors="pt").to("cuda")
  with torch.no_grad():
    outputs = trainer.model(**inputs).logits
    print(outputs)

  paraphrased_text = torch.softmax(outputs, dim=1).tolist()[0]
  for i in range(len(classes)):
      print(f"{classes[i]}: {int(round(paraphrased_text[i] * 100))}%")


## here are two sentences and we'd like to understand
## if the two sentences are equivalent
sentence1 = "Coast redwood trees are the tallest trees on the planet and can grow over 300 feet tall."
sentence2 = "The coast redwood trees, which can attain a height of over 300 feet, are the tallest trees on earth."

## run the get_preds function with these two sentences
## code here


In [None]:
## go ahead and chose two sentences and
## check if they are semantically equivalent
## what do you notice about the sentences you choose?

