## lmd - Models evaluation SetFit (baseline) and fine-tuned mistral

**Steps / end goal**
1. Started with our +-500 human annotated comments (out of 200k)
2. Synthetic data generation (comments + label) w/ Mistral OpenHermes : around 2k samples
3. Prepare instruction dataset, before fine tuning, using Alpaca format  
4. Fine-tune mistral-7B (classif. / label completion), using unsloth, on train + synthetic data.  
5. More tests on the fine-tuned model. If good enough, labels unlabeled data to several k examples (fine-tuned model as a classifier or weighted avg. w/ our Few shot SetFit baseline). **<- we're here**
6. Extend dataset to several 20k examples with fine-tuned Mistral (and/or ensemble model w/ Setfit) doing the classification.  
7. End goal being deployment/inference performance: train a classifier on the extended dataset using bge-m3 or multi-e5 embeddings. 

**Ressources**  
- [MLabonne Repo](https://github.com/mlabonne/llm-course)  
- [Dataset Gen - Kaggle example](https://www.kaggle.com/code/phanisrikanth/generate-synthetic-essays-with-mistral-7b-instruct)  
- [Dataset Gen - blog w/ prompt examples](https://hendrik.works/blog/leveraging-underrepresented-data)  
- [Prepare dataset- /r/LocalLLaMA best practice classi](https://www.reddit.com/r/LocalLLaMA/comments/173o5dv/comment/k448ye1/?utm_source=reddit&utm_medium=web2x&context=3)  
- [Prepare dataset - using gpt3.5](https://medium.com/@kshitiz.sahay26/how-i-created-an-instruction-dataset-using-gpt-3-5-to-fine-tune-llama-2-for-news-classification-ed02fe41c81f) 
- [Prepare dataset - Predibase prompts for diverse fine-tuning tasks](https://predibase.com/lora-land)
- [Fine tune OpenHermes-2.5-Mistral-7B - including prompt template gen](https://towardsdatascience.com/fine-tune-a-mistral-7b-model-with-direct-preference-optimization-708042745aac)  
- [Fine tune - Unsloth colab example](https://colab.research.google.com/drive/1Dyauq4kTZoLewQ1cApceUQVNcnnNTzg_?usp=sharing)
- [Fine tune - w/o unsloth](https://gathnex.medium.com/mistral-7b-fine-tuning-a-step-by-step-guide-52122cdbeca8) or [wandb](https://wandb.ai/vincenttu/finetuning_mistral7b/reports/Fine-tuning-Mistral-7B-with-W-B--Vmlldzo1NTc3MjMy) or [philschmid](https://www.philschmid.de/fine-tune-llms-in-2024-with-trl#6-deploy-the-llm-for-production)
- [Fine tune - impact of parameters S. Raschka](https://lightning.ai/pages/community/lora-insights/)

In [None]:
%%capture
# take several minutes, uncomment %%capture to see installation details
!mamba install -q cudatoolkit xformers bitsandbytes pytorch pytorch-cuda=11.8 \
                   -c pytorch -c nvidia -c xformers -c conda-forge -y
!pip install "unsloth[kaggle] @ git+https://github.com/unslothai/unsloth.git"
!pip uninstall -q datasets -y
!pip install -q datasets

!pip install "git+https://github.com/huggingface/transformers.git"

In [None]:
# should not be required, but had to, as of 2024 feb 28th
!pip install -q bitsandbytes triton xformers

In [None]:
# install setfit (few shot classifier) and evaluate from HuggingFace
!pip install -q setfit evaluate

In [None]:
import warnings
warnings.filterwarnings('ignore')
import re
import numpy as np
import pandas as pd
from time import perf_counter

from unsloth import FastLanguageModel
from setfit import SetFitModel

from datasets import load_dataset, Dataset, DatasetDict
import evaluate
from evaluate import load
from evaluate.visualization import radar_plot
from sklearn.metrics import accuracy_score, classification_report

In [None]:
import torch
DEVICE = torch.device("cuda") if torch.cuda.is_available() else torch.device("cpu")
print(f"Device: {DEVICE}")
print(f"CUDA Version: {torch.version.cuda}")
print(f"Pytorch {torch.__version__}")

#### Load datasets : original data (train/test, manually labeled)

In [None]:
# load original, custom and human-annotated dataset, previously saved on HF
filepath = "gentilrenard/lmd_ukraine_comments"

# HF Datasets format
ds = load_dataset(filepath)

In [None]:
# Extract train and eval datasets from DatasetDict
train_dataset = ds['train']
eval_dataset = ds['validation']

# Define our eval column with ground truth labels
eval_labels = eval_dataset["label"]

# dataset structure
print(ds)

#### Evaluation metrics / func

In [None]:
def perf_global(predictions: list[int], references: list[int]) -> dict[str, float]:
    """
    Computes model global perf. metrics.
    
    Args:
    predictions (list[int]): The predicted labels by the model.
    references (list[int]): The true labels.
    """
    # Load metrics from evaluate
    accuracy_metric = load("accuracy")
    f1_metric = load("f1", config_name="multiclass")
    precision_metric = load("precision", config_name="multiclass")
    recall_metric = load("recall", config_name="multiclass")
    
    accuracy_result = accuracy_metric.compute(predictions=predictions, references=references)
    f1_result = f1_metric.compute(predictions=predictions, references=references, average="macro")
    precision_result = precision_metric.compute(predictions=predictions, references=references, average="macro")
    recall_result = recall_metric.compute(predictions=predictions, references=references, average="macro")
    
    return {
        "accuracy": accuracy_result["accuracy"],
        "f1": f1_result["f1"],
        "precision": precision_result["precision"],
        "recall": recall_result["recall"]
    }


def perf_per_class(predictions: list[int], references: list[int]) -> dict[str, float]:
    """
    Compute metrics per class.
    """
    f1_detail = load("f1", config_name="multiclass", average=None)
    precision_detail = load("precision", config_name="multiclass", average=None)
    recall_detail = load("recall", config_name="multiclass", average=None)
    
    f1_result = f1_detail.compute(predictions=predictions, references=references, average=None)
    precision_result = precision_detail.compute(predictions=predictions, references=references, average=None)
    recall_result = recall_detail.compute(predictions=predictions, references=references, average=None)
    
    return {
        "f1_per_class": f1_result["f1"],
        "precision_per_class": precision_result["precision"],
        "recall_per_class": recall_result["recall"],
    }

## Eval SetFit model (baseline)

Our SetFit model was trained upon `paraphrase-multilingual-mpnet-base-v2` with a logistic head and optimized hyperparameters. Model is saved on [HF hub](https://huggingface.co/gentilrenard/paraphrase-multilingual-mpnet-base-v2_setfit-lemonde-french).

In [None]:
# Download Setfit model (incl. logistic head) from Hub
filepath_model = "gentilrenard/paraphrase-multilingual-mpnet-base-v2_setfit-lemonde-french"
setfit_model = SetFitModel.from_pretrained(filepath_model)

### Predict

Predict on a few (fake, simple) samples. 0 is pro_ukraine, 1: pro_russia, 2: off topic/don't know

In [None]:
# Run inference
preds = setfit_model.predict(
    [
        "La Russie va gagner cette guerre, ils ont plus de ressources",
        "les journalistes sont corrompus, le traitement est partial",
        "les pauvres ukrainiens se font anéantir et subissent des crimes de guerre",
        "La France doit donner plus d'armes à l'ukraine"
    ]
)
print(preds)

Predict on full evaluation dataset.  
We're adding a perf counter to compute latency. Not the best implementation here but will give us a rough idea.

In [None]:
eval_samples = eval_dataset["text"]
start_time = perf_counter()

setfit_preds = setfit_model.predict(eval_samples, batch_size=32, as_numpy=False, use_labels=False)

setfit_latency = perf_counter() - start_time
setfit_avg_latency = 1000 * (setfit_latency/139)
print(f"setfit_avg_latency (gpu): {setfit_avg_latency}")

### Evaluate

TL;DR  
Overall SetFit provided good results (76% accuracy), with a few shots approach, on a difficult, real life dataset ;). Train set have around 100 labels per class only.  
Setfit model performs best on Class 2 (identifying off topic/no clear opinion to the conflict), with excellent Precision and Recall, indicated by a high F1 Score.  
Class 0 (pro Ukraine) has good results, with balanced Precision and Recall.  
Class 1 (pro Russia), room for improvement : the model struggles to identify all actual instances of pro_russian comment.  
To be honest, it was even hard for me (the annotator) to find a lot and clear pro russia comments. Also, several days after, I find myself questionning whether some of the label could be labeled otherwise.

In [None]:
# overall metrics
setfit_global_metrics = perf_global(predictions=setfit_preds, references=eval_labels)
print(f"Model overall performance:\n{setfit_global_metrics}")

# detail per class
setfit_detailed_metrics = perf_per_class(predictions=setfit_preds, references=eval_labels)

print(f"Per class:\n{setfit_detailed_metrics}")

## Eval data-augmented, fine-tuned mistral

Our (4b quantized) Mistral-7B was fine-tuned using `Unsloth` on labeled data + synthetic data generated by Mistral-7B OpenHermes variant. LoRa adapters : alpha=16 [HF hub](https://huggingface.co/gentilrenard/Mistral-7B-lora-lmd-en), alpha=8 [HF hub](https://huggingface.co/gentilrenard/Mistral-7B-lora-lmd-en-v2)

In [None]:
lora_hf_filename = "gentilrenard/Mistral-7B-lora-lmd-en-v3" # mistral 0.2 alpha=16
# lora_hf_filename = "gentilrenard/Mistral-7B-lora-lmd-en" # mistral 0.1 alpha=16
# lora_hf_filename = "gentilrenard/Mistral-7B-lora-lmd-en-v2" # mistral 0.1 alpha=8

In [None]:
# load saved LoRa adapters from HuggingFace
# unsloth automatically patches a 4bit quantized mistral
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = lora_hf_filename,
    max_seq_length = 2048,
    dtype = None, # autodetection by unsloth
    load_in_4bit = True,
)

FastLanguageModel.for_inference(model)

### Predict

This our crafted prompt, used for fine-tuning.

In [None]:
prompt = """You are a helpful, precise, detailed, and concise artificial intelligence assistant with a deep expertise in reading and interpreting comments about Ukraine invasion by Russia.
In February 2022, Russia's invasion of Ukraine escalated the ongoing conflict in Ukraine Dombass region since 2014 and is causing massive casualties. President Putin claimed the operation aimed to "demilitarize and denazify" Ukraine. Despite Russian territorial gains, Ukraine's resistance and counterattacks have reclaimed some areas. The international community responded with sanctions, support for Ukraine, and legal actions against Russia.
You are very intelligent and sharp, having a keen ability and nuanced enough to distinguish which side of the conflict the comment is on.
Your task is to classify a comment into one of 3 labels : 0, 1 or 2. Possible labels and their meaning:
- 0: rather in favor of Ukraine and its allies. Support sanctions against Russia or criticizes Russian policy. Ukraine will win.
- 1: in favor of Russia, even if only so slightly. Criticizes Ukraine, Western, UE or OTAN policies against Russia. Fears of a costly escalation if support is brought to Ukraine. Russia will win.
- 2: irrelevant to the conflict or does not take side.
You will be evaluated based on the following criteria: - The generated answer is best matching label for the comment. - The generated answer is always one label (0, 1 or 2).
Categorize the comment into a single comment label only:
### Comment:
{}
### Comment label:
{}"""

Predict one sample

In [None]:
comment = eval_dataset[22]['text']
print(f"Comment + label :\n{eval_dataset[22]}")
print(f"Comment :\n{comment}")

In [None]:
inputs = tokenizer(
    [
        prompt.format(
        comment, # insert comment
        "", # output - blank (instead of label) for generation
        )
    ],
    return_tensors = "pt").to("cuda")

outputs = model.generate(**inputs, max_new_tokens = 10, use_cache = True)
tokenizer.batch_decode(outputs)

In [None]:
def parse_label(decoded_outputs):
    """Simple parser to extract label from LLM decoded output"""
    matches = re.findall(r'Comment label:\n(\d)', decoded_outputs[0])
    return int(matches[-1]) if matches else decoded_outputs

Predict full evaluation dataset and store parsed predicted labels.  
We're adding a perf_counter at runtime to measure model latency

In [None]:
mistral_preds = []
mistral_latencies = []

for i, sample in enumerate(eval_dataset):
    start_time = perf_counter()
    inputs = tokenizer(
        [
            prompt.format(
            sample['text'], # insert comment
            "", # output - blank (instead of label) for generation
            )
        ],
        return_tensors = "pt").to("cuda")

    outputs = model.generate(**inputs, max_new_tokens = 20, use_cache = True)
    decoded_outputs = tokenizer.batch_decode(outputs)
    
    # parse label from LLM answer
    predicted_label = parse_label(decoded_outputs)
    
    # store results
    mistral_preds.append(predicted_label)
    
    # latency measurement
    mistral_latency = perf_counter() - start_time
    mistral_latencies.append(mistral_latency)
    
    # look at what goes wrong/well
    #print(f"Comment {i}:\n{sample['text']}\nTrue label: {sample['label']}\nPredicted label:\n{predicted_label}")
    
# Compute run statistics
time_avg_ms = 1000 * np.mean(mistral_latencies)
time_std_ms = 1000 * np.std(mistral_latencies)

## Evaluate

In [None]:
mistral_global_metrics = perf_global(predictions=mistral_preds, references=eval_labels)
print(f"Model overall performance:\n{mistral_global_metrics}")

mistral_detailed_metrics = perf_per_class(predictions=mistral_preds, references=eval_labels)
print(f"Per class:\n{mistral_detailed_metrics}")

mistral_latency_metrics = {"avg latency":time_avg_ms, "std latency":time_std_ms}
print(mistral_latency_metrics)

## Models benchmark

Our end goal is to enrich our non-synthethic dataset (=unlabeled data and not synthetic) with model(s) predictions to fine-tune a Bert-like model with more than 300 original samples. We're interested in low cost inference / deployment.
- Setfit shines on identifying off topic (see above). Quite weak for prediction class 1 (pro russian)
- We will value precision a bit more than recall, to be sure to retain more "precise" examples, over quantity.
- Our class 1 (pro_russian comments) is the minority class, but the most important for us, re. qualitative objectives.

**Models comparison : overall performance**

In [None]:
model_perf = [
    {"Accuracy":0.76, "F1":0.73, "Precision": 0.74, "Recall":0.73},
    {"Accuracy":0.81, "F1":0.80, "Precision":0.83, "Recall":0.79},
    {"Accuracy":0.79, "F1":0.78, "Precision":0.79, "Recall":0.79},
    {"Accuracy":0.80, "F1":0.80, "Precision":0.80, "Recall":0.80},
   ]
model_names = ["Setfit", "Mistral_ft_α=16", "Mistral_ft_α=8", "Mistral_0_2_ft_α=16"]
plot = radar_plot(data=model_perf, model_names=model_names)
plot.show()

**Per class : class 0 (pro_ukrain)**

In [None]:
cls_0_perf = [
    {"F1":0.71, "Precision":0.70, "Recall":0.72},
    {"F1":0.79, "Precision":0.72, "Recall":0.88},
    {"F1":0.77, "Precision":0.74, "Recall":0.8},
    {"F1":0.79, "Precision":0.79, "Recall":0.78},
   ]
model_names = ["Setfit", "Mistral_ft_α=16", "Mistral_ft_α=8", "Mistral_0_2_ft_α=16"]
plot = radar_plot(data=cls_0_perf, model_names=model_names)
plot.show()

**Per class : class 1 (pro_russian)**

In [None]:
cls_1_perf = [
    {"F1":0.57, "Precision":0.64, "Recall":0.51},
    {"F1":0.74, "Precision":0.92, "Recall":0.63},
    {"F1":0.72, "Precision":0.75, "Recall":0.68},
    {"F1":0.75, "Precision":0.76, "Recall":0.74},
   ]
model_names = ["Setfit", "Mistral_ft_α=16", "Mistral_ft_α=8", "Mistral_0_2_ft_α=16"]
plot = radar_plot(data=cls_1_perf, model_names=model_names)
plot.show()

**Per class : class 2 (off_topic)**

In [None]:
cls_2_perf = [
    {"F1":0.91, "Precision":0.87, "Recall":0.96},
    {"F1":0.87, "Precision":0.87, "Recall":0.87},
    {"F1":0.85, "Precision":0.86, "Recall":0.85},
    {"F1":0.85, "Precision":0.84, "Recall":0.87},
   ]
model_names = ["Setfit", "Mistral_ft_α=16", "Mistral_ft_α=8", "Mistral_0_2_ft_α=16"]
plot = radar_plot(data=cls_2_perf, model_names=model_names)
plot.show()

## Voting ensemble

In lack of output probabilities, simple mechanism taking advantage of SetFit perf on detecting class 2 and Fine-tuned Mistral on synthetically augmented data for label 0 or 1

In [None]:
# We simulate on eval dataset. Could save time at inference w/ skipping mistral predict if Setfit label == 2
voted_preds = []
for setfit_pred, mistral_pred in zip(setfit_preds, mistral_preds):
    if setfit_pred == 2:
        voted_preds.append(setfit_pred)
    else:
        voted_preds.append(mistral_pred)

In [None]:
ensemble_global_metrics = perf_global(predictions=voted_preds, references=eval_labels)
print(f"Model overall performance:\n{ensemble_global_metrics}")

ensemble_detailed_metrics = perf_per_class(predictions=voted_preds, references=eval_labels)
print(f"Per class:\n{ensemble_detailed_metrics}")

Models comparison : ft-mistral vs. basic voting ensemble (ft-mistral + Setfit for class 2)  
Not too convinced, at least with this basic approach. Precision gain on class of interest (1) is good tho.  
Advantage of voting ensemble will be in label prediction : setfit will discriminate in 10ms between class 2 vs (0 or 1) with very good accuracy, and ft Mistral will predict 0 or 1 (900ms latency).

In [None]:
{"F1":0.75, "Precision":0.76, "Recall":0.74},

In [None]:
Model overall performance:
{'accuracy': 0.8129496402877698, 'f1': 0.7989719700477534, 'precision': 0.8217062833978256, 'recall': 0.7928747795414463}
Per class:
{'f1_per_class': array([0.77894737, 0.74193548, 0.87603306]), 'precision_per_class': array([0.82222222, 0.85185185, 0.79104478]), 'recall_per_class': array([0.74      , 0.65714286, 0.98148148])}

In [None]:
bench = [
    {"Accuracy":0.80, "F1":0.80, "Precision":0.80, "Recall":0.80},
    {"Accuracy":0.81, "F1":0.80, "Precision":0.82, "Recall":0.79},
]
model_names = ["Mistral_0_2_ft_α=16", "Voting ensemble"]
plot = radar_plot(data=bench, model_names=model_names)
plot.show()

In [None]:
# focus class of interest 'pro russian' (1)
bench_cls_1 = [
    {"F1":0.75, "Precision":0.76, "Recall":0.74},
    {"F1":0.74, "Precision":0.85, "Recall":0.65},
   ]
model_names = ["Mistral_0_2_ft_α=16", "Voting ensemble"]
plot = radar_plot(data=bench_cls_1, model_names=model_names)
plot.show()