## lmd multi-e5-base Classifier - ONNX optimization (for CPU)

> Latency optimization of our --multi-e5-base, based classifier. CPU.

**Steps / end goal**
- 1. Started with our +-500 human annotated comments (out of 200k)
- 2. Synthetic data generation (comments + label) w/ Mistral OpenHermes : around 2k samples
- 3. Prepare instruction dataset, before fine tuning, using Alpaca format  
- 4. Fine-tune mistral-7B (classif. / label completion), using unsloth, on train + synthetic data.  
- 5. More tests on the fine-tuned model. If good enough, labels unlabeled data to several k examples (fine-tuned model as a classifier or weighted avg. w/ our Few shot SetFit baseline).
- 6. Extend dataset to several 20k examples with fine-tuned Mistral (and/or ensemble model w/ Setfit) doing the classification.  
- 7. End goal being deployment/inference performance: train a classifier on the extended dataset using bge-m3 or multi-e5 embeddings.  
- 8. Benchmark e5-base classifier against ft Mistral and SetFit baseline
- 9. Model optimization (latency) **<- we're here**

**Ressources**  
- [MLabonne Repo](https://github.com/mlabonne/llm-course)  
- [Dataset Gen - Kaggle example](https://www.kaggle.com/code/phanisrikanth/generate-synthetic-essays-with-mistral-7b-instruct)  
- [Dataset Gen - blog w/ prompt examples](https://hendrik.works/blog/leveraging-underrepresented-data)  
- [Prepare dataset- /r/LocalLLaMA best practice classi](https://www.reddit.com/r/LocalLLaMA/comments/173o5dv/comment/k448ye1/?utm_source=reddit&utm_medium=web2x&context=3)  
- [Prepare dataset - using gpt3.5](https://medium.com/@kshitiz.sahay26/how-i-created-an-instruction-dataset-using-gpt-3-5-to-fine-tune-llama-2-for-news-classification-ed02fe41c81f) 
- [Prepare dataset - Predibase prompts for diverse fine-tuning tasks](https://predibase.com/lora-land)
- [Fine tune OpenHermes-2.5-Mistral-7B - including prompt template gen](https://towardsdatascience.com/fine-tune-a-mistral-7b-model-with-direct-preference-optimization-708042745aac)  
- [Fine tune - Unsloth colab example](https://colab.research.google.com/drive/1Dyauq4kTZoLewQ1cApceUQVNcnnNTzg_?usp=sharing)
- [Fine tune - w/o unsloth](https://gathnex.medium.com/mistral-7b-fine-tuning-a-step-by-step-guide-52122cdbeca8) or [wandb](https://wandb.ai/vincenttu/finetuning_mistral7b/reports/Fine-tuning-Mistral-7B-with-W-B--Vmlldzo1NTc3MjMy) or [philschmid](https://www.philschmid.de/fine-tune-llms-in-2024-with-trl#6-deploy-the-llm-for-production)
- [Fine tune - impact of parameters S. Raschka](https://lightning.ai/pages/community/lora-insights/)
- [Embeddings - multilingual, latest comparison](https://towardsdatascience.com/openai-vs-open-source-multilingual-embedding-models-e5ccb7c90f05)
- [philschmid ONNX optim](https://github.com/philschmid/optimum-transformers-optimizations/blob/master/notebook.ipynb)

In [1]:
!pip install -q optimum[onnxruntime]

In [2]:
!pip install -q transformers datasets evaluate

In [3]:
import warnings
warnings.filterwarnings('ignore')

import os
from pathlib import Path
from time import perf_counter

import numpy as np

from evaluate import evaluator
from datasets import load_dataset

from optimum.onnxruntime import (
    ORTModelForSequenceClassification,
    AutoOptimizationConfig,
    ORTOptimizer,
    ORTQuantizer
)
from optimum.onnxruntime.configuration import OptimizationConfig, AutoQuantizationConfig

from transformers import AutoTokenizer, pipeline

2024-03-24 18:08:19.562302: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-03-24 18:08:19.562471: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-03-24 18:08:19.730188: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered


#### Filepaths, dataset

In [4]:
dataset_id = "gentilrenard/lmd_ukraine_comments"
model_id = "gentilrenard/multi-e5-base_lmd-comments_v1"
tokenizer_id = "intfloat/multilingual-e5-base"
onnx_path = Path("onnx")

In [5]:
%%capture
# HF Datasets format
ds = load_dataset(dataset_id)

In [6]:
# Extract train and eval datasets from DatasetDict
train_dataset = ds['train']
eval_dataset = ds['validation']

# Define our eval column with ground truth labels
eval_labels = eval_dataset["label"]

# dataset structure
print(ds)

DatasetDict({
    validation: Dataset({
        features: ['text', 'label'],
        num_rows: 139
    })
    unlabeled: Dataset({
        features: ['text', 'label'],
        num_rows: 174891
    })
    train: Dataset({
        features: ['text', 'label'],
        num_rows: 323
    })
})


## Model optimization

We will carefully follow [philschmid guide](https://github.com/philschmid/optimum-transformers-optimizations/blob/master/notebook.ipynb), [HF doc](https://huggingface.co/docs/optimum/onnxruntime/usage_guides/optimization) and [e5 optim](https://medium.com/nixiesearch/how-to-compute-llm-embeddings-3x-faster-with-model-quantization-25523d9b4ce5)  
TODO (maybe) additional gain w/ TokenizerFast

###  Transformers to ONNX

In [7]:
# transformers to onnx. Changed from_transformers -> export=True
model = ORTModelForSequenceClassification.from_pretrained(model_id, export=True)
tokenizer = AutoTokenizer.from_pretrained(tokenizer_id)

# save onnx checkpoint and tokenizer
model.save_pretrained(onnx_path)
tokenizer.save_pretrained(onnx_path)

config.json:   0%|          | 0.00/937 [00:00<?, ?B/s]

Framework not specified. Using pt to export the model.


model.safetensors:   0%|          | 0.00/1.11G [00:00<?, ?B/s]

Using the export variant default. Available variants are:
    - default: The default ONNX variant.
Using framework PyTorch: 2.1.2+cpu
Overriding 1 configuration item(s)
	- use_cache -> False


tokenizer_config.json:   0%|          | 0.00/418 [00:00<?, ?B/s]

sentencepiece.bpe.model:   0%|          | 0.00/5.07M [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/17.1M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/280 [00:00<?, ?B/s]

('onnx/tokenizer_config.json',
 'onnx/special_tokens_map.json',
 'onnx/sentencepiece.bpe.model',
 'onnx/added_tokens.json',
 'onnx/tokenizer.json')

In [8]:
vanilla_clf = pipeline("text-classification", model=model, tokenizer=tokenizer)
vanilla_clf("Putin est vraiment très méchant")

[{'label': 'LABEL_0', 'score': 0.9947955012321472}]

## Quantization *then* optimization

Ressource shows that optimization -> quantization works, not in our case. We apply quant first, then perform graph optimization. On a totally different topic tho, [this](https://www.kaggle.com/code/mtitarenko/whisper-cpu-acceleration-tests#Pipe---ONNX,-quantized-+-optimized) shows no significant difference anyways.

### Quantization

In [9]:
# load onnx model
model = ORTModelForSequenceClassification.from_pretrained(onnx_path, file_name="model.onnx")

In [10]:
# create ORTQuantizer and define quantization configuration
dqconfig = AutoQuantizationConfig.avx512_vnni(is_static=False, per_channel=True)
quantizer = ORTQuantizer.from_pretrained(model)

# apply the quantization configuration to the model
model_quantized_path = quantizer.quantize(
    save_dir=onnx_path,
    quantization_config=dqconfig,
)

Creating dynamic quantizer: QOperator (mode: IntegerOps, schema: u8/s8, channel-wise: True)
Quantizing model...
Saving quantized model at: onnx (external data format: False)
Configuration saved in onnx/ort_config.json


In [11]:
# Quantization (auto) parameters used. Note that QUInt8 is used.
dqconfig.__dict__

{'is_static': False,
 'format': <QuantFormat.QOperator: 0>,
 'mode': <QuantizationMode.IntegerOps: 0>,
 'activations_dtype': <QuantType.QUInt8: 1>,
 'activations_symmetric': False,
 'weights_dtype': <QuantType.QInt8: 0>,
 'weights_symmetric': True,
 'per_channel': True,
 'reduce_range': False,
 'nodes_to_quantize': [],
 'nodes_to_exclude': [],
 'operators_to_quantize': ['Conv',
  'MatMul',
  'Attention',
  'LSTM',
  'Gather',
  'Transpose',
  'EmbedLayerNormalization'],
 'qdq_add_pair_to_weight': False,
 'qdq_dedicated_pair': False,
 'qdq_op_type_per_channel_support_to_axis': {'MatMul': 1}}

In [12]:
# load optimized model
model = ORTModelForSequenceClassification.from_pretrained(onnx_path, file_name="model_quantized.onnx")

# create optimized pipeline
quantized_clf = pipeline("text-classification", model=model, tokenizer=tokenizer)
quantized_clf("Putin est vraiment très méchant")

[{'label': 'LABEL_0', 'score': 0.9870927333831787}]

In [13]:
size = os.path.getsize(onnx_path / "model.onnx")/(1024*1024)
quant_size = os.path.getsize(onnx_path / "model_quantized.onnx")/(1024*1024)
print(f"model size (e5-base): {size:.2f} MB, model quantized size: {quant_size:.2f} MB")

model size (e5-base): 1060.93 MB, model quantized size: 266.40 MB


### Graph Optimization

In [14]:
# load onnx quantized model
model = ORTModelForSequenceClassification.from_pretrained(onnx_path, file_name="model_quantized.onnx")

In [15]:
# create ORTOptimizer and define optimization configuration
optimization_config = AutoOptimizationConfig.O2() 
optimizer = ORTOptimizer.from_pretrained(model)

# apply the optimization configuration to the model
optimizer.optimize(
    save_dir=onnx_path,
    optimization_config=optimization_config,
)

Optimizing model...
Configuration saved in onnx/ort_config.json
Optimized model saved at: onnx (external data format: False; saved all tensor to one file: True)


PosixPath('onnx')

In [16]:
# load quant+optimized model
model = ORTModelForSequenceClassification.from_pretrained(onnx_path, file_name="model_quantized_optimized.onnx")

# create quant+optimized pipeline
q8_clf = pipeline("text-classification", model=model, tokenizer=tokenizer)
q8_clf("Putin est vraiment très méchant")

The ONNX file model_quantized_optimized.onnx is not a regular name used in optimum.onnxruntime, the ORTModel might not behave as expected.


[{'label': 'LABEL_0', 'score': 0.9870927333831787}]

## Performance and speed

Reminder : non-quantized model: around 110 to 130ms latency (P95=180ms though), 0.769 accuracy and 1gb size.  
Quantized + optimized model retains 98,1% accuracy of vanilla model, with 75,5% accuracy, 266 mb size and 80ms latency (P95=80ms too).  1.6 to +2x improvement in speed.

**Performance**

In [17]:
eval_ = evaluator("text-classification")
results = eval_.compute(
    model_or_pipeline=q8_clf,
    data=eval_dataset,
    metric="accuracy",
    input_column="text",
    label_column="label",
    label_mapping=model.config.label2id,
    strategy="simple",
)
print(results)

Downloading builder script:   0%|          | 0.00/4.20k [00:00<?, ?B/s]

{'accuracy': 0.7553956834532374, 'total_time_in_seconds': 17.129535970999996, 'samples_per_second': 8.114638962510401, 'latency_in_seconds': 0.12323407173381291}


In [18]:
print(f"Vanilla model: 76.9%")
print(f"Quantized model: {results['accuracy']*100:.2f}%")
print(f"The quantized model achieves {results['accuracy']/0.77*100:.2f}% accuracy of the fp32 model")

Vanilla model: 76.9%
Quantized model: 75.54%
The quantized model achieves 98.10% accuracy of the fp32 model


**Latency**

In [19]:
payload="""Vous êtes naif si vous croyez que seuls les Russes ont ce genre de comportement en tant de guerre...vous devez être de ceux qui croient en la guerre "propre" que les Occidentaux prétendre faire depuis 40 ans (parfois avec les Russes comme alliés d'ailleurs)."""
print(f'Payload sequence length: {len(tokenizer(payload)["input_ids"])}')

Payload sequence length: 70


In [20]:
def measure_latency(pipe):
    latencies = []
    # warm up
    for _ in range(10):
        _ = pipe(payload)
    # Timed run
    for _ in range(200):
        start_time = perf_counter()
        _ =  pipe(payload)
        latency = perf_counter() - start_time
        latencies.append(latency)
    # Compute run statistics
    time_avg_ms = 1000 * np.mean(latencies)
    time_std_ms = 1000 * np.std(latencies)
    time_p95_ms = 1000 * np.percentile(latencies,95)
    return f"P95 latency (ms) - {time_p95_ms}; Average latency (ms) - {time_avg_ms:.2f} +\- {time_std_ms:.2f};", time_p95_ms


In [21]:
vanilla_model=measure_latency(vanilla_clf)
quantized_model=measure_latency(q8_clf)

print(f"Vanilla model: {vanilla_model[0]}")
print(f"Quantized model: {quantized_model[0]}")
print(f"Improvement through quantization: {round(vanilla_model[1]/quantized_model[1],2)}x")

Vanilla model: P95 latency (ms) - 220.64800989992352; Average latency (ms) - 136.08 +\- 36.95;
Quantized model: P95 latency (ms) - 138.0240730499395; Average latency (ms) - 97.38 +\- 21.88;
Improvement through quantization: 1.6x


## Push optimized model to Hub

In [22]:
from kaggle_secrets import UserSecretsClient
from huggingface_hub import login

user_secrets = UserSecretsClient()
hf_token = user_secrets.get_secret("hf_key")
login(token=hf_token)

Token will not been saved to git credential helper. Pass `add_to_git_credential=True` if you want to set the git credential as well.
Token is valid (permission: write).
Your token has been saved to /root/.cache/huggingface/token
Login successful


In [23]:
tmp_store_directory = "onnx_hub_repo"
repository_id = "gentilrenard/multi-e5-base_lmd-comments_q8_onnx"

In [24]:
model = ORTModelForSequenceClassification.from_pretrained(onnx_path,file_name="model_quantized_optimized.onnx")
tokenizer = AutoTokenizer.from_pretrained(onnx_path)

model.save_pretrained(tmp_store_directory)
tokenizer.save_pretrained(tmp_store_directory)

The ONNX file model_quantized_optimized.onnx is not a regular name used in optimum.onnxruntime, the ORTModel might not behave as expected.


('onnx_hub_repo/tokenizer_config.json',
 'onnx_hub_repo/special_tokens_map.json',
 'onnx_hub_repo/sentencepiece.bpe.model',
 'onnx_hub_repo/added_tokens.json',
 'onnx_hub_repo/tokenizer.json')

In [25]:
model.push_to_hub(
    tmp_store_directory,
    repository_id=repository_id,
    use_auth_token=True
)

model_quantized_optimized.onnx:   0%|          | 0.00/279M [00:00<?, ?B/s]