The data for this Jupyter Notebook is sourced from the following web page: [GitHub - Lora for sequence classification with Roberta-Llama-Mistral](https://github.com/mehdiir/Roberta-Llama-Mistral/blob/main/Lora-for-sequence-classification-with-Roberta-Llama-Mistral.md)


In [1]:
MAX_LEN = 512 
roberta_checkpoint = "roberta-large"
mistral_checkpoint = "mistralai/Mistral-7B-v0.1"
llama_checkpoint = "meta-llama/Llama-2-7b-hf"

## Data preperation

### Read in the training data from csv

In [2]:
import pandas as pd
import os
DATA_PATH = "../data/"
train_df=pd.read_csv(os.path.join(DATA_PATH, 'enron_labeled_curated_train.csv'))
test_df=pd.read_csv(os.path.join(DATA_PATH, 'enron_labeled_curated_test.csv'))
# dummy target column for merge test and train into one huggingface data
test_df['target'] = 0 
train_df.info()
test_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2236 entries, 0 to 2235
Data columns (total 7 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   search_phrase    2236 non-null   object 
 1   label            2236 non-null   int64  
 2   email            2236 non-null   object 
 3   mistral_pred     2236 non-null   float64
 4   openhermes_pred  2236 non-null   float64
 5   vicuna_pred      2236 non-null   float64
 6   gemma_pred       2236 non-null   float64
dtypes: float64(4), int64(1), object(2)
memory usage: 122.4+ KB
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 559 entries, 0 to 558
Data columns (total 8 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   search_phrase    559 non-null    object 
 1   label            559 non-null    int64  
 2   email            559 non-null    object 
 3   mistral_pred     559 non-null    float64
 4   openhermes_pred  559 non-null  

As the classes are not balanced, we will compute the positive and negative weights and use them for loss calculation later:

In [3]:
print(train_df.label.value_counts())
pos_weights = len(train_df) / (2 * train_df.label.value_counts()[1])
neg_weights = len(train_df) / (2 * train_df.label.value_counts()[0])
POS_WEIGHT, NEG_WEIGHT = (pos_weights, neg_weights)
print(POS_WEIGHT, NEG_WEIGHT)

label
0    1918
1     318
Name: count, dtype: int64
3.5157232704402515 0.5828988529718456


In [4]:
##Then, we compute the maximum length of the column text:
# Number of Characters
max_char=train_df['email'].str.len().max()
# Number of Words
max_words = train_df['email'].str.split().str.len().max()
print(f"The maximum number of character is {max_char}.")
print(f"The maximum number of word is {max_words}.")

The maximum number of character is 2983.
The maximum number of word is 650.


In [5]:
#Now, let's convert the dataframe into HuggingFace dataset format, 
#split it into training and validation datasets, add the test dataset and save the three files:


from datasets import Dataset
#Convert the training dataframe to HuggingFace dataset
dataset = Dataset.from_pandas(train_df)
#Split the dataset into training and validation datasets
data = dataset.train_test_split(train_size=0.8, seed=42)
#Rename the default "test" split to "validation"
data['val'] = data.pop("test")
#Convert the test dataframe to HuggingFace dataset and add it into the first dataset
data['test'] = Dataset.from_pandas(test_df)
#Save
data.save_to_disk("processed_hf")

Saving the dataset (0/1 shards):   0%|          | 0/1788 [00:00<?, ? examples/s]

Saving the dataset (0/1 shards):   0%|          | 0/448 [00:00<?, ? examples/s]

Saving the dataset (0/1 shards):   0%|          | 0/559 [00:00<?, ? examples/s]

In [6]:
data['train'][0]

{'search_phrase': 'it appears that some Enron employees used dummy accounts and rigged valuation methodologies to create false profit-and-loss entries for the derivatives',
 'label': 0,
 'email': 'When shares of Enron plunged from $84 earlier this year to practically zero=, thousands of the company\'s employees lost not just their jobs but also mo=st of the value of their 401(k) retirement accounts. For the average employ=ee, Enron stock represented three-fifths of 401(k) assets, and the energy c=ompany\'s meltdown -- after revelations of misleading, probably fraudulent, =accounting practices -- was a personal calamity.=20Now politicians, including Democratic Sens. Barbara Boxer of California and= Jon Corzine of New Jersey, are rushing in to protect other Americans from =similar disasters. And President Bush has ordered "a policy review to prote=ct people\'s pensions." But government intervention would only introduce a d=angerous idea: that investors shouldn\'t bear the burden of their

In [7]:
# Load Mistral 7B Tokenizer
# Add prefix space to tokenize words into subwords
from transformers import AutoTokenizer, DataCollatorWithPadding
mistral_tokenizer = AutoTokenizer.from_pretrained(mistral_checkpoint,  add_prefix_space=True)
mistral_tokenizer.pad_token_id = mistral_tokenizer.eos_token_id
mistral_tokenizer.pad_token = mistral_tokenizer.eos_token

def mistral_preprocessing_function(examples):
    return mistral_tokenizer(examples['email'], truncation=True, max_length=MAX_LEN)

#Now, let's apply the preprocessing function to the entire dataset:
col_to_delete = ['search_phrase', 'mistral_pred','openhermes_pred', 'vicuna_pred', 'gemma_pred']
mistral_tokenized_datasets = data.map(mistral_preprocessing_function, batched=False)
mistral_tokenized_datasets = mistral_tokenized_datasets.remove_columns(col_to_delete)
mistral_tokenized_datasets = mistral_tokenized_datasets.rename_column("email", "text")
#Set to torch format
mistral_tokenized_datasets.set_format("torch")

# Data collator for padding a batch of examples to the maximum length seen in the batch
mistral_data_collator = DataCollatorWithPadding(tokenizer=mistral_tokenizer)

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


Map:   0%|          | 0/1788 [00:00<?, ? examples/s]

Map:   0%|          | 0/448 [00:00<?, ? examples/s]

Map:   0%|          | 0/559 [00:00<?, ? examples/s]

## Mistral
A. Load checkpoints for the classfication model

Let's load pre-trained Mistral 7B model with a sequence classification header:

In [8]:
from transformers import AutoModelForSequenceClassification # Load a pre-trained model with a sequence classification header 
import torch
mistral_model =  AutoModelForSequenceClassification.from_pretrained(
  pretrained_model_name_or_path=mistral_checkpoint,
  num_labels=2,
  device_map="auto"
)

#For Mistral 7B, we have to add the padding token id as it is not defined by default.

mistral_model.config.pad_token_id = mistral_model.config.eos_token_id

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

Some weights of MistralForSequenceClassification were not initialized from the model checkpoint at mistralai/Mistral-7B-v0.1 and are newly initialized: ['score.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


## B. LoRa setup for Mistral 7B classifier

For Mistral 7B model, we need to specify the target_modules (The query and value vectors from the attention modules):


In [9]:
from peft import get_peft_model, LoraConfig, TaskType

mistral_peft_config = LoraConfig(
    task_type=TaskType.SEQ_CLS, r=2, lora_alpha=16, lora_dropout=0.1, bias="none", 
    target_modules=[
        "q_proj",
        "v_proj",
    ],
)

mistral_model = get_peft_model(mistral_model, mistral_peft_config)
mistral_model.print_trainable_parameters()

trainable params: 868,352 || all params: 7,111,528,448 || trainable%: 0.01221048339115074


## Setup the trainer

### Evaluation Metrics

First, we define the perfomance metrics we will use to compare the three models: F1 score, recall, precision and accuracy:

In [10]:
#Define evaluation metrics
import evaluate
import numpy as np
def compute_metrics(eval_pred):
    # All metrics are already predefined in the HF `evaluate` package
    precision_metric = evaluate.load("precision")
    recall_metric = evaluate.load("recall")
    f1_metric= evaluate.load("f1")
    accuracy_metric = evaluate.load("accuracy")

    logits, labels = eval_pred # eval_pred is the tuple of predictions and labels returned by the model
    predictions = np.argmax(logits, axis=-1)
    precision = precision_metric.compute(predictions=predictions, references=labels)["precision"]
    recall = recall_metric.compute(predictions=predictions, references=labels)["recall"]
    f1 = f1_metric.compute(predictions=predictions, references=labels)["f1"]
    accuracy = accuracy_metric.compute(predictions=predictions, references=labels)["accuracy"]
    # The trainer is expecting a dictionary where the keys are the metrics names and the values are the scores. 
    return {"precision": precision, "recall": recall, "f1-score": f1, 'accuracy': accuracy}

## Custom Trainer for Weighted Loss

As mentioned at the begining of this post, we have an imbalanced distribution between positive and negative classes. To account for that, we need to train our models with a weight cross-entropy loss. The Trainer class doesn't support providing a custom loss as it expects getting the loss directly from the model's outputs.

So, we need to define our custom WeightedCELossTrainer that overrides the compute_loss method to calculate the weighted cross-entropy loss based on the model's predictions and the input labels:

In [11]:
from transformers import TrainingArguments, Trainer

class WeightedCELossTrainer(Trainer):
    def compute_loss(self, model, inputs, return_outputs=False):
        labels = inputs.pop("labels")
        # Get model's predictions
        outputs = model(**inputs)
        logits = outputs.get("logits")
        # compute custom loss
        loss_fct = torch.nn.CrossEntropyLoss(weight=torch.tensor([neg_weights, pos_weights], device=model.device, dtype=logits.dtype))
        loss = loss_fct(logits.view(-1, self.model.config.num_labels), labels.view(-1))
        return (loss, outputs) if return_outputs else loss

## Trainer Setup

Let's set the training arguments and the trainer for the three models.

### B. Mistral 7B

First important step is to move the models to the GPU device for training

In [12]:
from transformers import TrainingArguments, Trainer

mistral_model = mistral_model.cuda()

lr = 1e-4
batch_size = 8
num_epochs = 10

training_args = TrainingArguments(
    output_dir="mistral-lora-token-classification",
    learning_rate=lr,
    lr_scheduler_type= "constant",
    warmup_ratio= 0.1,
    max_grad_norm= 0.3,
    per_device_train_batch_size=batch_size,
    per_device_eval_batch_size=batch_size,
    num_train_epochs=num_epochs,
    weight_decay=0.001,
    evaluation_strategy="epoch",
    save_strategy="epoch",
    load_best_model_at_end=True,
    report_to="wandb",
    fp16=True,
    gradient_checkpointing=True,
)



trainer = WeightedCELossTrainer(
    model=mistral_model,
    args=training_args,
    train_dataset=mistral_tokenized_datasets['train'],
    eval_dataset=mistral_tokenized_datasets["val"],
    data_collator=mistral_data_collator,
    compute_metrics=compute_metrics
)

dataloader_config = DataLoaderConfiguration(dispatch_batches=None)
Detected kernel version 5.4.0, which is below the recommended minimum of 5.5.0; this can cause the process to hang. It is recommended to upgrade the kernel to the minimum version or higher.


In [13]:

trainer.train()

Failed to detect the name of this notebook, you can set it manually with the WANDB_NOTEBOOK_NAME environment variable to enable code saving.
[34m[1mwandb[0m: Currently logged in as: [33mplster9[0m ([33mkariato[0m). Use [1m`wandb login --relogin`[0m to force relogin


You're using a LlamaTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.
`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...


Epoch,Training Loss,Validation Loss,Precision,Recall,F1-score,Accuracy
1,No log,2.496133,0.382022,0.459459,0.417178,0.787946
2,No log,2.304682,0.486111,0.472973,0.479452,0.830357
3,1.801400,2.145498,0.486486,0.486486,0.486486,0.830357
4,1.801400,1.925652,0.493827,0.540541,0.516129,0.832589
5,0.804200,2.808691,0.54902,0.378378,0.448,0.845982
6,0.804200,2.554413,0.566667,0.459459,0.507463,0.852679
7,0.547700,2.519836,0.566038,0.405405,0.472441,0.850446
8,0.547700,1.955364,0.531646,0.567568,0.54902,0.845982
9,0.362500,2.375746,0.610169,0.486486,0.541353,0.863839
10,0.362500,2.200897,0.609375,0.527027,0.565217,0.866071




TrainOutput(global_step=2240, training_loss=0.8206518650054931, metrics={'train_runtime': 2734.2513, 'train_samples_per_second': 6.539, 'train_steps_per_second': 0.819, 'total_flos': 3.821190092604703e+17, 'train_loss': 0.8206518650054931, 'epoch': 10.0})