## Classification Head w/ multi-e5 embeddings (or bge) - training

>  training w/ bge-m3 or multi-e5-base or multi-e5-large on human annotated (300) + synthetic data : 20k samples labeled with out ensemble model Setfit + Mistral

**Steps / end goal**
1. Started with our +-500 human annotated comments (out of 200k)
2. Synthetic data generation (comments + label) w/ Mistral OpenHermes : around 2k samples
3. Prepare instruction dataset, before fine tuning, using Alpaca format  
4. Fine-tune mistral-7B (classif. / label completion), using unsloth, on train + synthetic data.  
5. More tests on the fine-tuned model. If good enough, labels unlabeled data to several k examples (fine-tuned model as a classifier or weighted avg. w/ our Few shot SetFit baseline).
6. Extend dataset to several 20k examples with fine-tuned Mistral (and/or ensemble model w/ Setfit) doing the classification.  
7. End goal being deployment/inference performance: train a classifier on the extended dataset using bge-m3 or multi-e5 embeddings. **<- we're here**

**Ressources**  
- [MLabonne Repo](https://github.com/mlabonne/llm-course)  
- [Dataset Gen - Kaggle example](https://www.kaggle.com/code/phanisrikanth/generate-synthetic-essays-with-mistral-7b-instruct)  
- [Dataset Gen - blog w/ prompt examples](https://hendrik.works/blog/leveraging-underrepresented-data)  
- [Prepare dataset- /r/LocalLLaMA best practice classi](https://www.reddit.com/r/LocalLLaMA/comments/173o5dv/comment/k448ye1/?utm_source=reddit&utm_medium=web2x&context=3)  
- [Prepare dataset - using gpt3.5](https://medium.com/@kshitiz.sahay26/how-i-created-an-instruction-dataset-using-gpt-3-5-to-fine-tune-llama-2-for-news-classification-ed02fe41c81f) 
- [Prepare dataset - Predibase prompts for diverse fine-tuning tasks](https://predibase.com/lora-land)
- [Fine tune OpenHermes-2.5-Mistral-7B - including prompt template gen](https://towardsdatascience.com/fine-tune-a-mistral-7b-model-with-direct-preference-optimization-708042745aac)  
- [Fine tune - Unsloth colab example](https://colab.research.google.com/drive/1Dyauq4kTZoLewQ1cApceUQVNcnnNTzg_?usp=sharing)
- [Fine tune - w/o unsloth](https://gathnex.medium.com/mistral-7b-fine-tuning-a-step-by-step-guide-52122cdbeca8) or [wandb](https://wandb.ai/vincenttu/finetuning_mistral7b/reports/Fine-tuning-Mistral-7B-with-W-B--Vmlldzo1NTc3MjMy) or [philschmid](https://www.philschmid.de/fine-tune-llms-in-2024-with-trl#6-deploy-the-llm-for-production)
- [Fine tune - impact of parameters S. Raschka](https://lightning.ai/pages/community/lora-insights/)
- [Embeddings - multilingual, latest comparison](https://towardsdatascience.com/openai-vs-open-source-multilingual-embedding-models-e5ccb7c90f05)

In [1]:
%%capture
!pip install transformers datasets evaluate

In [2]:
import warnings
warnings.filterwarnings('ignore')
warnings.filterwarnings("ignore", category=DeprecationWarning)

import numpy as np
from datasets import load_dataset, concatenate_datasets
from evaluate import load
from sklearn.utils.class_weight import compute_class_weight

import torch
from transformers import TrainingArguments, Trainer
from transformers import AutoTokenizer, AutoModelForSequenceClassification

2024-03-26 14:16:40.504053: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-03-26 14:16:40.504150: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-03-26 14:16:40.639253: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered


In [3]:
import os
import wandb
from kaggle_secrets import UserSecretsClient
from huggingface_hub import login

user_secrets = UserSecretsClient()
wandb_token = user_secrets.get_secret("wandb_e5")
os.environ["WANDB_PROJECT"]= "e5"
wandb.login(key=wandb_token)

hf_token = user_secrets.get_secret("hf_key")
login(token=hf_token)

[34m[1mwandb[0m: W&B API key is configured. Use [1m`wandb login --relogin`[0m to force relogin
[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc


Token will not been saved to git credential helper. Pass `add_to_git_credential=True` if you want to set the git credential as well.
Token is valid (permission: write).
Your token has been saved to /root/.cache/huggingface/token
Login successful


### Load Datasets

Three datasets : 1. original, human annotated data, 2. Extended dataset w/ 20k *predicted* labels, 3. generated synthetic data + labels that we used to fine-tune our mistral clf.  
We conducted experiments on several runs, varying both training data size and training data composition (annotated data and/or 20k mistral-predicted dataset and/or mistral-generated data).

In [4]:
%%capture
# Original dataset with human annotated train-eval comments ; saved on HF hub
filepath_eval_dataset = "gentilrenard/lmd_ukraine_comments"

# Optional : synthetic dataset created by Mistral Open Hermes
filepath_synthetic_data = "/kaggle/input/lmd-synthetic/lmd_synthetic_data.parquet"

# locally saved, 20k comments labeled with ensemble model (setfit + fine tuned mistral 7b)
# v1 (no suffix) w/ predictions w/ ft mistral base 0.1, v2 w/ fr mistral base 0.2
# filepath_preds = "/kaggle/input/ensemble-preds/lmd_predictions.json"
filepath_preds = "/kaggle/input/ensemble-preds-v2/lmd_predictions_v2.json"

dataset_pred = load_dataset("json", data_files=filepath_preds, split='train')

# we experimented w/ different sample sizes : none, all, 2k, 5k
# dataset_pred = dataset_pred.shuffle(seed=11).select(range(5000))
dataset_pred = dataset_pred.rename_column("pred", "label")

dataset_syn = load_dataset("parquet", data_files=filepath_synthetic_data, split='train')
dataset_eval = load_dataset(filepath_eval_dataset)

In [5]:
# Datasets structure
print(f"Human annotated dataset:\n{dataset_eval}")
print(f"Ensemble models prediction:\n{dataset_pred}")

# Concatenate human annotated and/or synthetic and/or models predicted datasets
# results shows no improvement w/ synthetic data added (cf. evaluation notebook) 
dataset = concatenate_datasets([dataset_eval['train'],dataset_pred]).shuffle(seed=11)

# Use a subset as evaluation for our training. We keep human annotated eval set for final benchmark.
dataset = dataset.train_test_split(test_size=0.05, seed=11)
print(f"Final dataset:\n{dataset}")

Human annotated dataset:
DatasetDict({
    train: Dataset({
        features: ['text', 'label'],
        num_rows: 323
    })
    validation: Dataset({
        features: ['text', 'label'],
        num_rows: 139
    })
    unlabeled: Dataset({
        features: ['text', 'label'],
        num_rows: 174891
    })
})
Ensemble models prediction:
Dataset({
    features: ['text', 'label'],
    num_rows: 20000
})
Final dataset:
DatasetDict({
    train: Dataset({
        features: ['text', 'label'],
        num_rows: 19306
    })
    test: Dataset({
        features: ['text', 'label'],
        num_rows: 1017
    })
})


### Embeddings/classification head

Several experiments conducted with :  
- embeddings : `intfloat/multilingual-e5-large`, `intfloat/multilingual-e5-base`, `intfloat/multilingual-e5-small`, `BAAI/bge-m3`  
- loss function : std loss function or weighted  
- data : human annotated and/or (mistral) synthetic and/or (mistral+setfit) models predicted, number of samples.   
- classic experiments : lr, scheduler, epochs etc.
- Evaluation metrics / benchmark : in wandb + separate notebook.

In [6]:
checkpoint = "intfloat/multilingual-e5-base"

tokenizer = AutoTokenizer.from_pretrained(checkpoint)
model = AutoModelForSequenceClassification.from_pretrained(checkpoint, num_labels=3)

tokenizer_config.json:   0%|          | 0.00/418 [00:00<?, ?B/s]

sentencepiece.bpe.model:   0%|          | 0.00/5.07M [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/17.1M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/280 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/694 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/1.11G [00:00<?, ?B/s]

Some weights of XLMRobertaForSequenceClassification were not initialized from the model checkpoint at intfloat/multilingual-e5-base and are newly initialized: ['classifier.dense.bias', 'classifier.dense.weight', 'classifier.out_proj.bias', 'classifier.out_proj.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [7]:
# e5 requires a "query: " to text inputs, for better performance
# cf. https://huggingface.co/intfloat/multilingual-e5-base
# Post runs/trials : no impact for us tho. Maybe does not impact our use-case/task.
#def add_query_prefix(example):
#    example["text"] = "query: " + example["text"]
#    return example

In [8]:
# could also apply dynamic padding.
# cf. https://huggingface.co/learn/nlp-course/chapter3/2?fw=pt
def tokenize_function(examples):
    return tokenizer(examples["text"], padding="max_length", truncation=True, max_length=512)

In [9]:
# add "query: " prefix (e5 embeddings)
#dataset = dataset.map(add_query_prefix)

# tokenize
tokenized_datasets = dataset.map(tokenize_function, batched=True)

# optional, if we evaluate on human annotated "validation" dataset
tokenized_eval_datasets = dataset_eval.map(tokenize_function, batched=True)

  0%|          | 0/20 [00:00<?, ?ba/s]

  0%|          | 0/2 [00:00<?, ?ba/s]

  0%|          | 0/1 [00:00<?, ?ba/s]

  0%|          | 0/1 [00:00<?, ?ba/s]

  0%|          | 0/175 [00:00<?, ?ba/s]

## Train

We're running on model(s)-labeled 20k predicted samples. Our predictor (ensemble model SetFit + ft mistral) has around 80% accuracy. We do not expect the performance to skyrocket.  
Runtime / oom etc. :  
- `multi-e5-small`: 25mn runtime.  
- `multi-e5-base` runs fine w/ batch=8 (+-1h30 run); around 75% accuracy and weighted F1.    
- prevent OOM errors w/`multi-e5-large` or `bge-m3` (+- 4h run), cf. [HF performance params](https://huggingface.co/docs/transformers/v4.18.0/en/performance). For instance, batch_size=4 with gradient_accumulation=2 worked fine for us.  
Post runs note on accuracy:  
- e5-base vs. e5-large or bge-m3 were close (76-ish accuracy), maybe w/ a slight advantage for e5-large.  
- e5-small : 74% accuracy but probably a way better latency.
- Could also try to fine-tune embedding model ; probably overkill. Cf. [FlagOpen](https://github.com/FlagOpen/FlagEmbedding)

In [10]:
training_args = TrainingArguments(
    #gradient_checkpointing=True,
    #gradient_accumulation_steps=2,
    # fp16=False,
    learning_rate=2e-5,
    lr_scheduler_type="linear", # cosine
    warmup_ratio=0.1,
    per_device_train_batch_size=8,
    per_device_eval_batch_size=8,
    num_train_epochs=2,
    weight_decay=0.01,
    output_dir='multi-e5-base_lmd-comments_v2',
    logging_dir='./logs',
    logging_strategy="steps",
    logging_steps=100,
    eval_steps=100,
    load_best_model_at_end=True,
    metric_for_best_model="accuracy",
    evaluation_strategy="steps",
    save_strategy="steps",
    save_steps=100,
    save_total_limit=2,
    report_to="wandb",
    run_name='e5_base_v15',
)

In [11]:
def compute_metrics(eval_pred):
    logits, labels = eval_pred
    predictions = np.argmax(logits, axis=-1)
    
    # Load metrics
    f1 = load("f1")
    accuracy = load("accuracy")
    
    # Compute
    # F1'weighted' to handle class imbalance
    f1_score = f1.compute(predictions=predictions, references=labels, average='weighted')
    accuracy_score = accuracy.compute(predictions=predictions, references=labels)
    
    return {
        "f1": f1_score["f1"],
        "accuracy": accuracy_score["accuracy"],
    }

**"Uncomment" to run standard loss function :**

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_datasets["train"],
    eval_dataset=tokenized_datasets["test"],
    compute_metrics=compute_metrics,
)

Use a **custom weighted loss function** instead (better results, especially on our minority class 1 (pro russian comments):

In [12]:
# Calculate classes weights
labels = dataset['train']['label']
class_weights = compute_class_weight(class_weight='balanced', classes=np.unique(labels), y=labels)

# Class weights to a tensor
class_weights_tensor = torch.tensor(class_weights, dtype=torch.float)

print(class_weights_tensor)

tensor([1.0283, 3.1224, 0.5857])


In [13]:
# custom loss function
class CustomTrainer(Trainer):
    def __init__(self, *args, class_weights=None, **kwargs):
        super().__init__(*args, **kwargs)
        # Store class weights
        self.class_weights = class_weights

    def compute_loss(self, model, inputs, return_outputs=False):
        # Ensure the class_weights tensor is on the same device as model parameters
        class_weights = self.class_weights.to(model.device)

        # Call the original model forward pass
        outputs = model(**inputs)
        # Extract the logits and labels
        logits = outputs.get('logits')
        labels = inputs.get('labels')
        # Define the weighted loss
        loss_fct = torch.nn.CrossEntropyLoss(weight=class_weights)
        loss = loss_fct(logits.view(-1, self.model.config.num_labels), labels.view(-1))
        return (loss, outputs) if return_outputs else loss


In [14]:
trainer = CustomTrainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_datasets["train"],
    # eval_dataset=tokenized_datasets["test"],
    eval_dataset=tokenized_eval_datasets["validation"],
    compute_metrics=compute_metrics,
    class_weights=class_weights_tensor,  # Pass the computed class weights here
)

In [15]:
trainer.train()
wandb.finish()

[34m[1mwandb[0m: Currently logged in as: [33mvionmatthieu[0m. Use [1m`wandb login --relogin`[0m to force relogin
[34m[1mwandb[0m: wandb version 0.16.5 is available!  To upgrade, please run:
[34m[1mwandb[0m:  $ pip install wandb --upgrade
[34m[1mwandb[0m: Tracking run with wandb version 0.16.3
[34m[1mwandb[0m: Run data is saved locally in [35m[1m/kaggle/working/wandb/run-20240326_141816-fzfnf6bi[0m
[34m[1mwandb[0m: Run [1m`wandb offline`[0m to turn off syncing.
[34m[1mwandb[0m: Syncing run [33me5_base_v15[0m
[34m[1mwandb[0m: ⭐️ View project at [34m[4mhttps://wandb.ai/vionmatthieu/e5[0m
[34m[1mwandb[0m: 🚀 View run at [34m[4mhttps://wandb.ai/vionmatthieu/e5/runs/fzfnf6bi[0m


Step,Training Loss,Validation Loss,F1,Accuracy
100,1.0794,1.182744,0.348064,0.446043
200,0.9791,0.982274,0.584518,0.647482
300,0.7729,0.806607,0.685352,0.690647
400,0.624,0.930566,0.673364,0.705036
500,0.6673,0.917162,0.672292,0.690647
600,0.6649,1.434585,0.605753,0.654676
700,0.6004,0.904178,0.671315,0.690647
800,0.552,0.885496,0.6995,0.71223
900,0.5891,0.663724,0.708928,0.71223
1000,0.5605,1.043104,0.705314,0.71223


Downloading builder script:   0%|          | 0.00/6.77k [00:00<?, ?B/s]

Downloading builder script:   0%|          | 0.00/4.20k [00:00<?, ?B/s]

[34m[1mwandb[0m:                                                                                
[34m[1mwandb[0m: 
[34m[1mwandb[0m: Run history:
[34m[1mwandb[0m:                  eval/accuracy ▁▅▆▇▆▆▇▇▇▆█▆▇▆▆▇▇▇▇█▇▇█▇▇▇▇▇▇▇▇▇████▇▇▇▇
[34m[1mwandb[0m:                        eval/f1 ▁▅▇▇▇▇▇▇▇▇█▆▇▆▆▇▇▇▇█▇██▇█▇▇▇▇█▇▇█████▇▇▇
[34m[1mwandb[0m:                      eval/loss ▇▅▃▄▄▄▄▁▆▄▅▇▇▆▇▇▇▃▃█▅▄▄▅▄▅▅▆▇▄▇▅▇▇█▇▅▆▆▆
[34m[1mwandb[0m:                   eval/runtime ▅▃▄▃▃▃▆▁▄▁▃▂▂▃▂▁▁▅▂▃▆▃▂▄▁▃▄▁▃▁▁▁▃█▂▁▂▂▂▁
[34m[1mwandb[0m:        eval/samples_per_second ▄▆▄▆▆▆▃█▅█▆▇▇▆▇██▄▇▆▃▅▇▄▇▆▅█▆█▇█▆▁▇█▇▇▆▇
[34m[1mwandb[0m:          eval/steps_per_second ▄▆▄▆▆▆▃█▅█▆▇▇▆▇██▄▇▆▃▅▇▄▇▆▅█▆█▇█▆▁▇█▇▇▆▇
[34m[1mwandb[0m:                    train/epoch ▁▁▁▁▂▂▂▂▂▃▃▃▃▃▄▄▄▄▄▄▅▅▅▅▅▅▆▆▆▆▆▇▇▇▇▇████
[34m[1mwandb[0m:              train/global_step ▁▁▁▁▂▂▂▂▂▃▃▃▃▃▄▄▄▄▄▄▅▅▅▅▅▅▆▆▆▆▆▇▇▇▇▇████
[34m[1mwandb[0m:                train/grad_norm ▂▂▂▂▂▂▂▂▃▄▃▂▂▃▃▄▄▂▄▁▅▃▃▃▃▂▃▅▂▃▄▇▄▆▃▂▃▁▂█
[34m[1mwandb[0m

## Save

Locally and to the Hub

In [16]:
model_name = "multi-e5-base_lmd-comments_v2"
local_directory = "/kaggle/working/e5/"

In [17]:
# save to local directory
trainer.save_model(local_directory)

In [18]:
# push to HF hub
trainer.push_to_hub(model_name)

Upload 2 LFS files:   0%|          | 0/2 [00:00<?, ?it/s]

model.safetensors:   0%|          | 0.00/1.11G [00:00<?, ?B/s]

training_args.bin:   0%|          | 0.00/4.86k [00:00<?, ?B/s]

CommitInfo(commit_url='https://huggingface.co/gentilrenard/multi-e5-base_lmd-comments_v2/commit/e77d1eb96e53234aa52608a30548a82c280ba4f9', commit_message='multi-e5-base_lmd-comments_v2', commit_description='', oid='e77d1eb96e53234aa52608a30548a82c280ba4f9', pr_url=None, pr_revision=None, pr_num=None)