GitHub - matthieuvion/lmd_classi: Classifier(s). Complete guide SetFit vs fine-tuned Mistral-7B vs. quantized ONNX model based on multi-e5 embeddings

Classification - from LLM fine-tuning to quantized ONNX (e5-base embeddings)

End to end benchmark on comments classification with SetFit vs. fine-tuned Mistral-7b vs. multi-e5-base + clf layer.

Fully reproducible guide w/ online notebooks, model(s), dataset.

Note

Following my previous work on people's engagement with the Ukraine War, I decided to manually annotate approximately 400 comments (out of 180k) and train a classifier to assess the weight of pro-Russian comments. We initially experimented with a few-shot learning model (SetFit), but then we found ourselves going down the rabbit hole. Update 2024/04/22 : also check Docker+FastAPI serving app here

benchmark Mistral vs. multi-e5 vs. SetFit

Learnings

Baseline model - SetFit: very good performance (and latency...) for a few-shots learning approach. After many trials, final choice was MPNet embeddings + had to extend to 90 labels per class (3 classes : 1. pro Ukraine, 2. pro Russia, 3. off topic/no opinion) + logistic head, to achieve good accuracy. Anything lower (16, 32 samples etc.) wouldn't be enough.

Mistral-7B Fine Tuning: a well-crafted prompt + 2K samples synthetic generation (with OpenHermes) was enough to fine-tune a Mistral-7B with 81% accuracy and a notable --better, performance on our class of interest (1: pro-Russia comments). Unsurprisingly, LLM shows its amazing power to capture (some of) the human subtleties.

Classifier training on augmented dataset: I was eager to know if we could train a more "classic" classifier on a larger portion of our initial dataset. 20k unlabeled comments were labeled using a voting ensemble SetFit + our fine-tuned Mistral-7B and used to train our classifier on top of multi-e5-base embeddings (vs. BGE and multi-e5-small). We tried many training data combinations (train size and/our 2k synthetic sample and/or 5-20k predicted data added), best performance was achieved with a weighted loss + only the 20k ensemble-predicted labels, without the 2k synthetic examples. Still perform better than our baseline on class 1.

Model Optimization: e5-based classifier is converted to ONNX and then optimized + quantized. We retain 98% accuracy of the base e5 model, while shrinking the model size to 266Mb (instead of 1Gb) and doing x1.9 on our inference latency (180ms vs. 90ms on a P100). Performs slightly worse than our fine-tuned LLM but the latency gain is huge! (800ms vs. 90ms).

Model	Size	Accuracy (%)	F1, class 1 (pro Russia)	Latency (ms)
SetFit (logistic head)	n/a	78	58	10
Fine-Tuned Mistral-7B	13-4Gb	81	74	800
Fine-Tuned Llama3-8B	13-4Gb	80	77	800
multi-e5-base	1Gb	79	70	180
Quantized ONNX multi-e5-base	266Mb	78	69	90

TL;DR - organized notebooks

Notebooks should work on Google Colab too, maybe with a few adaptations for fine-tuning with Unsloth (libs install).
Dataset (here) , LLM LoRa adapters (here) and final multi-e5-base ONNX model (here) are available on HuggingFace.

Notebook	Description	Resource
lmd_setfit_modeling_logistic_head	Baseline model - few-shot clf using SetFit	notebook
lmd_mistral_synthetic_gen_testprompt	Synthetic data gen - prepare dataset - prompts tests Mistral-7B-OpenHermes	notebook
lmd_mistral_synthetic_gen_run	Synthetic data gen - run (output : 2k synthetic samples)	notebook
lmd_mistral_synthetic_fine_tune	Fine-tuning Mistral-7B-base for classi. (output: json label) from synthetic data, using Unsloth (Qlora), Alpaca template	notebook
lmd_setfit_mistral_evaluation	Benchmark SetFit / fine-tuned Mistral (several experiments)	notebook
lmd_setfit_mistral_inference	Augment original dataset - voting ensemble SetFit + f-tuned LLM to infer 20k unlabeled comments	notebook
lmd_multi-e5_train	multi-e5/bge embeddings + nn classifier on augmented dataset (several experiments)	notebook
e5_onnx_optimization	multi-e5 - ONNX conversion & optimization/quantization	notebook
lmd_e5_evaluation	Benchmark all models - focus on global accuracy, minority class accuracy and inference latency	notebook

Detailed guide

Baseline : few shots learning with Huggingface/SetFit

Three labels 0. support to Ukraine, 1. (rather) support Russia 2. off topic / no opinion.
Initial seed (annotated data) was labeled with Label Studio : around 400 samples with 'oversampling' on the minority class to capture enough information :
Overall (obvious/direct) support for Russia is rare (+- 10%), and Le Monde subscribers love to digress (2 is vast majority).
We used our Faiss index / vector search previously built to retrieve enough "pro russian" comments among the 180k we scrapped, along with random exploration.
We tried many optimizations on the few shot model SetFit (not shared here): # labels, grid search, different heads.
Compared to sample size, a very good performance (78% accuracy) & deployablility (5ms latency) but not satisfied with performance on our class of interest (pro-russian comments).

Notebook	Description	Resource
lmd_setfit_modeling_logistic_head	Baseline model - few shots clf using SetFit	notebook

Synthetic data generation (Mistral-OpenHermes)

Goal : augment our initial, manually annotated seed, with synthetic data so we can fine-tune a LLM to perform classification.
Room for improvement : efforts on prompting (context + examples + lot of tests), but still had to discard 40% of comments eventually. But manual (random) review showed good enough results : credible comments with right ssociated label.

Notebook	Description	Resource
lmd_mistral_synthetic_gen_testprompt	Synthetic data gen - prepare dataset - prompts tests Mistral-7B-OpenHermes	notebook
lmd_mistral_synthetic_gen_run	Synthetic data gen - run (output : 2k synthetic samples)	notebook

Mistral-7B-base fine-tuning (using `Unsloth`): outputs JSON with predicted class label

We are not using LLM embeddings with a classification layer. Instead we fine-tune our model with annotated + synthetic data so it predicts a label {label:0} or {label:1} our {label:2} given a prompt (instruction + comment).
Our Alpaca-like template showed good performance with Mistral-7b base v0.1 (no improvement with recently released v0.2).
LLM as a predictor shows very good accuracy (81%) and most importantly performs well on our minority class.
We could use it to extend our dataset ? We have nearly 180k unlabeled comments that could be used to train a standard classifier!

Notebook	Description	Resource
lmd_mistral_synthetic_fine_tune	Fine-tuning Mistral-7B-base for classi. (output: json label) w/ synth. data, using Unsloth (Qlora), Alpaca template	notebook
lmd_setfit_mistral_evaluation	Benchmark SetFit / fine-tuned Mistral (several experiments)	notebook

Multi-e5-base embeddings based clf. ONNX & latency optim.

End goal : retain enough accuracy while lowering inference time (900ms for fine-tuned LLM).
Train a classifier on top of multi-e5-embeddins (tested m3-bge as well). Several dataset composition tried (synthetic data and/or llm-predicted labels and/or initial seed).
Our accuracy is good enough, especially on minority class, considering we trained on 20k synthetic/llm-predicted data.
Final model is converted to ONNX, quantized & optimized -> 80ms avg latency.

Notebook	Description	Resource
lmd_setfit_mistral_inference	Augment original dataset - voting ensemble SetFit + f-tuned LLM to infer 20k unlabeled comments	notebook
lmd_multi-e5_train	multi-e5/bge embeddings + nn classifier on augmented dataset (several experiments)	notebook
e5_onnx_optimization	multi-e5 - ONNX conversion & optimization/quantization	notebook
lmd_e5_evaluation	Benchmark all models - focus on global, minority class and inference latency	notebook

Ressources & links

Ressources I found to be particularly useful re. prompting, template, fine-tuning, embeddings performance, quantization etc.

Name		Name	Last commit message	Last commit date
Latest commit History 30 Commits
lmd_classi		lmd_classi
tests		tests
.gitattributes		.gitattributes
.gitignore		.gitignore
README.md		README.md
poetry.lock		poetry.lock
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

lmd_classi

lmd_classi

tests

tests

.gitattributes

.gitattributes

.gitignore

.gitignore

README.md

README.md

poetry.lock

poetry.lock

pyproject.toml

pyproject.toml

Repository files navigation

Classification - from LLM fine-tuning to quantized ONNX (e5-base embeddings)

Learnings

TL;DR - organized notebooks

Detailed guide

Baseline : few shots learning with Huggingface/SetFit

Synthetic data generation (Mistral-OpenHermes)

Mistral-7B-base fine-tuning (using `Unsloth`): outputs JSON with predicted class label

Multi-e5-base embeddings based clf. ONNX & latency optim.

Ressources & links

About

Releases

Packages

Languages

matthieuvion/lmd_classi

Folders and files

Latest commit

History

Repository files navigation

Classification - from LLM fine-tuning to quantized ONNX (e5-base embeddings)

Learnings

TL;DR - organized notebooks

Detailed guide

Baseline : few shots learning with Huggingface/SetFit

Synthetic data generation (Mistral-OpenHermes)

Mistral-7B-base fine-tuning (using Unsloth): outputs JSON with predicted class label

Multi-e5-base embeddings based clf. ONNX & latency optim.

Ressources & links

About

Resources

Stars

Watchers

Forks

Languages

Mistral-7B-base fine-tuning (using `Unsloth`): outputs JSON with predicted class label