The data for this Jupyter Notebook is sourced from the following web page: [GitHub - Lora for sequence classification with Roberta-Llama-Mistral](https://github.com/mehdiir/Roberta-Llama-Mistral/blob/main/Lora-for-sequence-classification-with-Roberta-Llama-Mistral.md)


In [2]:
MAX_LEN = 512 
roberta_checkpoint = "roberta-large"
mistral_checkpoint = "mistralai/Mistral-7B-v0.1"
llama_checkpoint = "meta-llama/Llama-2-7b-hf"

## Data preperation

### Read in the training data from csv

In [3]:
import pandas as pd
import os
DATA_PATH = "../data/"
train_df=pd.read_csv(os.path.join(DATA_PATH, 'enron_labeled_curated_train.csv'))
test_df=pd.read_csv(os.path.join(DATA_PATH, 'enron_labeled_curated_test.csv'))
# dummy target column for merge test and train into one huggingface data
test_df['target'] = 0 
train_df.info()
test_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2236 entries, 0 to 2235
Data columns (total 7 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   search_phrase    2236 non-null   object 
 1   label            2236 non-null   int64  
 2   email            2236 non-null   object 
 3   mistral_pred     2236 non-null   float64
 4   openhermes_pred  2236 non-null   float64
 5   vicuna_pred      2236 non-null   float64
 6   gemma_pred       2236 non-null   float64
dtypes: float64(4), int64(1), object(2)
memory usage: 122.4+ KB
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 559 entries, 0 to 558
Data columns (total 8 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   search_phrase    559 non-null    object 
 1   label            559 non-null    int64  
 2   email            559 non-null    object 
 3   mistral_pred     559 non-null    float64
 4   openhermes_pred  559 non-null  

As the classes are not balanced, we will compute the positive and negative weights and use them for loss calculation later:

In [4]:
print(train_df.label.value_counts())
pos_weights = len(train_df) / (2 * train_df.label.value_counts()[1])
neg_weights = len(train_df) / (2 * train_df.label.value_counts()[0])
POS_WEIGHT, NEG_WEIGHT = (pos_weights, neg_weights)
print(POS_WEIGHT, NEG_WEIGHT)

label
0    1918
1     318
Name: count, dtype: int64
3.5157232704402515 0.5828988529718456


In [5]:
##Then, we compute the maximum length of the column text:
# Number of Characters
max_char=train_df['email'].str.len().max()
# Number of Words
max_words = train_df['email'].str.split().str.len().max()
print(f"The maximum number of character is {max_char}.")
print(f"The maximum number of word is {max_words}.")

The maximum number of character is 2983.
The maximum number of word is 650.


In [6]:
#Now, let's convert the dataframe into HuggingFace dataset format, 
#split it into training and validation datasets, add the test dataset and save the three files:


from datasets import Dataset
#Convert the training dataframe to HuggingFace dataset
dataset = Dataset.from_pandas(train_df)
#Split the dataset into training and validation datasets
data = dataset.train_test_split(train_size=0.8, seed=42)
#Rename the default "test" split to "validation"
data['val'] = data.pop("test")
#Convert the test dataframe to HuggingFace dataset and add it into the first dataset
data['test'] = Dataset.from_pandas(test_df)
#Save
data.save_to_disk("processed_hf")

  from .autonotebook import tqdm as notebook_tqdm
Saving the dataset (1/1 shards): 100%|██████████| 1788/1788 [00:00<00:00, 93701.70 examples/s] 
Saving the dataset (1/1 shards): 100%|██████████| 448/448 [00:00<00:00, 59909.08 examples/s]
Saving the dataset (1/1 shards): 100%|██████████| 559/559 [00:00<00:00, 128591.89 examples/s]


The data comprises a keyword, a location and the text of the tweet. For sake of simplicity, we select the text feature only as the input to the LLM.

At this stage, we prepared the train, validation, and test sets in the HuggingFace format expected by the pre-trained LLMs. The next step is to define the tokenized dataset for training using the appropriate tokenizer to transform the text feature into two Tensors of sequence of token ids and attention masks. As each model has its specific tokenizer, we will need to define three different datasets.

We start by defining the RoBERTa dataloader:

Load the tokenizer:

In [10]:
from transformers import AutoTokenizer
roberta_tokenizer = AutoTokenizer.from_pretrained(roberta_checkpoint, add_prefix_space=True)


#Define the preprocessing function for converting one row of the dataframe:
def roberta_preprocessing_function(examples):
    return roberta_tokenizer(examples['email'], truncation=True, max_length=MAX_LEN)

#By applying the preprocessing function to the first example of our training dataset, we can see that we have inputs ids and an attention mask:
roberta_preprocessing_function(data['train'][0])

{'input_ids': [0, 520, 327, 9, 2271, 2839, 12662, 31, 68, 6232, 656, 42, 76, 7, 15655, 4276, 5214, 6, 1583, 9, 5, 138, 18, 1321, 685, 45, 95, 49, 1315, 53, 67, 7458, 5214, 620, 9, 5, 923, 9, 49, 17936, 1640, 330, 43, 3832, 2349, 4, 286, 5, 674, 12735, 5214, 1942, 6, 2271, 2839, 388, 4625, 130, 12, 41830, 15561, 9, 17936, 1640, 330, 43, 1781, 6, 8, 5, 1007, 740, 5214, 7474, 3785, 18, 24053, 480, 71, 14613, 9, 12030, 6, 1153, 15381, 6, 5457, 36617, 154, 3464, 480, 21, 10, 1081, 28857, 1571, 4, 5214, 844, 5975, 3770, 6, 217, 1557, 12274, 4, 4810, 7803, 254, 9, 886, 8, 5214, 4160, 2812, 329, 833, 9, 188, 3123, 6, 32, 6404, 11, 7, 1744, 97, 1791, 31, 5457, 42116, 12884, 4, 178, 270, 3516, 34, 2740, 22, 102, 714, 1551, 7, 27067, 5214, 3894, 82, 18, 15131, 72, 125, 168, 6530, 74, 129, 6581, 10, 385, 5214, 8395, 1827, 1114, 35, 14, 867, 4395, 75, 4649, 5, 6976, 9, 49, 308, 5044, 8539, 5214, 1790, 4, 7939, 18, 28, 699, 35, 2271, 2839, 4585, 8, 751, 9818, 9314, 54, 15005, 7, 867, 5214, 480, 217,

In [7]:
data['train'][0]

{'search_phrase': 'it appears that some Enron employees used dummy accounts and rigged valuation methodologies to create false profit-and-loss entries for the derivatives',
 'label': 0,
 'email': 'When shares of Enron plunged from $84 earlier this year to practically zero=, thousands of the company\'s employees lost not just their jobs but also mo=st of the value of their 401(k) retirement accounts. For the average employ=ee, Enron stock represented three-fifths of 401(k) assets, and the energy c=ompany\'s meltdown -- after revelations of misleading, probably fraudulent, =accounting practices -- was a personal calamity.=20Now politicians, including Democratic Sens. Barbara Boxer of California and= Jon Corzine of New Jersey, are rushing in to protect other Americans from =similar disasters. And President Bush has ordered "a policy review to prote=ct people\'s pensions." But government intervention would only introduce a d=angerous idea: that investors shouldn\'t bear the burden of their

In [11]:
#Now, let's apply the preprocessing function to the entire dataset:
col_to_delete = ['search_phrase', 'mistral_pred','openhermes_pred', 'vicuna_pred', 'gemma_pred']
#Apply the preprocessing function
roberta_tokenized_datasets = data.map(roberta_preprocessing_function, batched=False)
#Remove the undesired columns
roberta_tokenized_datasets = roberta_tokenized_datasets.remove_columns(col_to_delete)
#Rename the target to label as for HugginFace standards
roberta_tokenized_datasets = roberta_tokenized_datasets.rename_column("email", "text")
#Set to torch format
roberta_tokenized_datasets.set_format("torch")

Map: 100%|██████████| 1788/1788 [00:01<00:00, 992.98 examples/s] 
Map: 100%|██████████| 448/448 [00:00<00:00, 856.64 examples/s] 
Map: 100%|██████████| 559/559 [00:00<00:00, 998.74 examples/s] 


In [12]:
#We can have a look into our tokenized training dataset:
roberta_tokenized_datasets['train'][0]

{'label': tensor(0),
 'text': 'When shares of Enron plunged from $84 earlier this year to practically zero=, thousands of the company\'s employees lost not just their jobs but also mo=st of the value of their 401(k) retirement accounts. For the average employ=ee, Enron stock represented three-fifths of 401(k) assets, and the energy c=ompany\'s meltdown -- after revelations of misleading, probably fraudulent, =accounting practices -- was a personal calamity.=20Now politicians, including Democratic Sens. Barbara Boxer of California and= Jon Corzine of New Jersey, are rushing in to protect other Americans from =similar disasters. And President Bush has ordered "a policy review to prote=ct people\'s pensions." But government intervention would only introduce a d=angerous idea: that investors shouldn\'t bear the burden of their own decisi=ons.Let\'s be clear: Enron executives and outside auditors who lied to investors= -- including their own 21,000 employees -- bear an enormous responsibili

In [19]:
# Data collator for padding a batch of examples to the maximum length seen in the batch
from transformers import DataCollatorWithPadding
roberta_data_collator = DataCollatorWithPadding(tokenizer=roberta_tokenizer)

## RoBERTa

A. Load checkpoints for the classfication model

We load the pre-trained RoBERTa model with a sequence classification header using the HuggingFace AutoModelForSequenceClassification class:

In [13]:
from transformers import AutoModelForSequenceClassification 
# Load a pre-trained model with a sequence classification header 
roberta_model = AutoModelForSequenceClassification.from_pretrained(roberta_checkpoint, num_labels=2)

Some weights of RobertaForSequenceClassification were not initialized from the model checkpoint at roberta-large and are newly initialized: ['classifier.dense.bias', 'classifier.dense.weight', 'classifier.out_proj.bias', 'classifier.out_proj.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


### B. LoRa setup for RoBERTa classifier

We import LoRa configuration and set some parameters for RoBERTa classifier:

TaskType: Sequence classification
r(rank): Rank for our decomposition matrices

lora_alpha: Alpha parameter to scale the learned weights. LoRA paper advises fixing alpha at 16

lora_dropout: Dropout probability of the LoRA layers

bias: Whether to add bias term to LoRa layers

In the code below, we use the values recommended by the Lora paper.

Note that in this blog, we will perform a hyperparameters tunning with wandb.


In [14]:
from peft import get_peft_model, LoraConfig, TaskType
roberta_peft_config = LoraConfig(
    task_type=TaskType.SEQ_CLS, r=2, lora_alpha=16, lora_dropout=0.1, bias="none",
)
roberta_model = get_peft_model(roberta_model, roberta_peft_config)
roberta_model.print_trainable_parameters()

trainable params: 2,299,908 || all params: 356,610,052 || trainable%: 0.6449363911929212


## Setup the trainer

### Evaluation Metrics

First, we define the perfomance metrics we will use to compare the three models: F1 score, recall, precision and accuracy:

In [15]:
#Define evaluation metrics
import evaluate
import numpy as np
def compute_metrics(eval_pred):
    # All metrics are already predefined in the HF `evaluate` package
    precision_metric = evaluate.load("precision")
    recall_metric = evaluate.load("recall")
    f1_metric= evaluate.load("f1")
    accuracy_metric = evaluate.load("accuracy")

    logits, labels = eval_pred # eval_pred is the tuple of predictions and labels returned by the model
    predictions = np.argmax(logits, axis=-1)
    precision = precision_metric.compute(predictions=predictions, references=labels)["precision"]
    recall = recall_metric.compute(predictions=predictions, references=labels)["recall"]
    f1 = f1_metric.compute(predictions=predictions, references=labels)["f1"]
    accuracy = accuracy_metric.compute(predictions=predictions, references=labels)["accuracy"]
    # The trainer is expecting a dictionary where the keys are the metrics names and the values are the scores. 
    return {"precision": precision, "recall": recall, "f1-score": f1, 'accuracy': accuracy}

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


## Custom Trainer for Weighted Loss

As mentioned at the begining of this post, we have an imbalanced distribution between positive and negative classes. To account for that, we need to train our models with a weight cross-entropy loss. The Trainer class doesn't support providing a custom loss as it expects getting the loss directly from the model's outputs.

So, we need to define our custom WeightedCELossTrainer that overrides the compute_loss method to calculate the weighted cross-entropy loss based on the model's predictions and the input labels:

In [17]:
from transformers import TrainingArguments, Trainer

class WeightedCELossTrainer(Trainer):
    def compute_loss(self, model, inputs, return_outputs=False):
        labels = inputs.pop("labels")
        # Get model's predictions
        outputs = model(**inputs)
        logits = outputs.get("logits")
        # compute custom loss
        loss_fct = torch.nn.CrossEntropyLoss(weight=torch.tensor([neg_weights, pos_weights], device=model.device, dtype=logits.dtype))
        loss = loss_fct(logits.view(-1, self.model.config.num_labels), labels.view(-1))
        return (loss, outputs) if return_outputs else loss

## Trainer Setup

Let's set the training arguments and the trainer for the three models.

### A. RoBERTa

First important step is to move the models to the GPU device for training

In [18]:
roberta_model = roberta_model.cuda()
roberta_model.device()

#It will print the following:
device(type='cuda', index=0)

#Define the training arguments
from transformers import TrainingArguments, Trainer

lr = 1e-4
batch_size = 8
num_epochs = 5

training_args = TrainingArguments(
    output_dir="roberta-large-lora-token-classification",
    learning_rate=lr,
    lr_scheduler_type= "constant",
    warmup_ratio= 0.1,
    max_grad_norm= 0.3,
    per_device_train_batch_size=batch_size,
    per_device_eval_batch_size=batch_size,
    num_train_epochs=num_epochs,
    weight_decay=0.001,
    evaluation_strategy="epoch",
    save_strategy="epoch",
    load_best_model_at_end=True,
    report_to="wandb",
    fp16=False,
    gradient_checkpointing=True,
)

#Finally, we define the RoBERTa trainer by providing the model, the training arguments and the tokenized datasets:

roberta_trainer = WeightedCELossTrainer(
    model=roberta_model,
    args=training_args,
    train_dataset=roberta_tokenized_datasets['train'],
    eval_dataset=roberta_tokenized_datasets["val"],
    data_collator=roberta_data_collator,
    compute_metrics=compute_metrics
)

AssertionError: Torch not compiled with CUDA enabled