# Sparsifying the BGE-Small Model for Embeddings

BGE models are currently state-of-the-art models for embeddings on the [MTEB Leaderboard](https://huggingface.co/spaces/mteb/leaderboard). In this notebook, we will sparsify the [bge-small-en-v1.5](https://huggingface.co/BAAI/bge-small-en-v1.5) model using [Sparsify's](https://github.com/neuralmagic/sparsify) INT8 quantization via its one-shot method. We will also evaluate its accuracy and speed improvements vs. its dense variant after sparsification.To learn more about one-shot, refer to this [guide](https://github.com/neuralmagic/sparsify/blob/main/docs/one-shot-experiment-guide.md).

In [1]:
!pip install -U deepsparse-nightly[sentence_transformers] -q
!pip install git+https://github.com/neuralmagic/sparsify.git -q
!pip install sentence-transformers evaluate -q

# Optimum DeepSparse

In order to use the Dense BGE model in Sparsify, we first have to convert it into ONNX using the Optimum DeepSparse library.

In [2]:
from optimum.deepsparse import DeepSparseModelForFeatureExtraction
from transformers.onnx.utils import get_preprocessor
from pathlib import Path

model_id = "BAAI/bge-small-en-v1.5"

# load model and convert to onnx
model = DeepSparseModelForFeatureExtraction.from_pretrained(model_id, export=True)
tokenizer = get_preprocessor(model_id)

# save onnx checkpoint and tokenizer
onnx_path = Path("bge-small-en-v1.5-dense")
model.save_pretrained(onnx_path)
tokenizer.save_pretrained(onnx_path)

Framework not specified. Using pt to export to ONNX.
Using the export variant default. Available variants are:
	- default: The default ONNX variant.
Using framework PyTorch: 2.0.0+cu117
Overriding 1 configuration item(s)
	- use_cache -> False


verbose: False, log level: Level.ERROR



('bge-small-en-v1.5-dense/tokenizer_config.json',
 'bge-small-en-v1.5-dense/special_tokens_map.json',
 'bge-small-en-v1.5-dense/vocab.txt',
 'bge-small-en-v1.5-dense/added_tokens.json',
 'bge-small-en-v1.5-dense/tokenizer.json')

# Create NPZ files

Sparsify's One-Shot is a post-training sparsification method that utilizes sampled data (~1,000 samples is satisfactory) from a calibration dataset resulting in no further training time and much faster sparsification times compared with Training-Aware Experiments.

The samples need to be stored in the .npz format, which is a file format based on the NumPy library. In the BERT model architecture (such as in the BGE models), Sparsify one-shot requires the input_ids, attention_mask and token_type_ids of each data sample from your dataset to be stored as a dictionary prior to .npz storage. For more information, refer to the Sparsify guide.

For our example, we'll use the popular semantic textual similarity benchmark (stsb) dataset for calibration. Now, let's extract 1,000 samples from the train split and convert them to .npz:

In [3]:
import os
import numpy as np
from datasets import load_dataset
from sentence_transformers import InputExample

# Load the dataset
dataset = load_dataset("glue", "stsb", split="train")

# Adjusted to get the first 1000 examples
n_examples = 1000

# Create the "data" directory if it doesn't exist
if not os.path.exists('data'):
    os.makedirs('data')

# Define a function to create NPZ dictionaries
def create_npz_data(texts, index):
    # Tokenize the texts
    inputs = tokenizer(texts, padding=True, truncation=True, max_length=512, return_tensors='pt')

    # Extract input_ids, attention_mask, and token_type_ids
    input_ids = inputs['input_ids'].cpu().numpy()[0]
    attention_mask = inputs['attention_mask'].cpu().numpy()[0]
    token_type_ids = inputs.get('token_type_ids', None)
    if token_type_ids is not None:
        token_type_ids = token_type_ids.cpu().numpy()[0]

    # Create the NPZ dictionary
    npz_data = {
        "input_ids": input_ids,
        "attention_mask": attention_mask,
        "token_type_ids": token_type_ids if token_type_ids is not None else np.array([]),  # Handle cases where token_type_ids are not present
    }

    # Save the dictionary as an NPZ file
    npz_file_path = f'data/input_{index:04d}.npz'
    np.savez(npz_file_path, **npz_data)

# Create NPZ dictionaries and save them individually
train_examples = []
for i in range(n_examples):

    example = dataset[i]
    train_examples.append(InputExample(texts=[example['sentence1'], example['sentence2']]))

    # Extract texts from InputExample instances
    texts = [example.texts for example in train_examples]

    # Create the NPZ dictionary and save it
    create_npz_data(texts, i)

print(f'Saved {n_examples} npz files to data/')

Saved 1000 npz files to data/


# Login to Sparsify

In [4]:
!sparsify.login EJMfcp88Wpp7efA99WjAfSAFH6jhwfG9

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


INFO:sparsify.login:Logging into sparsify...
INFO:sparsify.utils.helpers:Successfully authenticated with Neural Magic Account API key
INFO:sparsify.login:sparsifyml version 1.6 is already installed, skipping installation from neuralmagic pypi server
INFO:sparsify.login:Logged in successfully, sparsify setup is complete.


# Run Sparsify One-Shot

Pass the model directory and set optim-level to 0.2 which initiate quantization with a single CLI command:

In [5]:
!sparsify.run one-shot --use-case nlp-embeddings --model ./bge-small-en-v1.5-dense/model.onnx --data ./data --optim-level 0.2

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


INFO:sparsify.utils.helpers:Successfully authenticated with Neural Magic Account API key
INFO:sparsify.login:sparsifyml version 1.6 is already installed, skipping installation from neuralmagic pypi server
2023-11-14 09:18:47 deepsparse.utils.onnx INFO     Generating input 'X', type = float32, shape = [1, 3, 32, 32]
INFO:deepsparse.utils.onnx:Generating input 'X', type = float32, shape = [1, 3, 32, 32]
INFO:sparsifyml.one_shot.sparsification.obcq.fast_obcq_modifier:Folded 0 Conv-BatchNormalization blocks
INFO:sparsifyml.one_shot.sparsification.obcq.base_obcq_modifier:FastOBCQModifier: starting compression on layers: ['/encoder/layer.0/attention/self/query/MatMul', '/encoder/layer.0/attention/self/value/MatMul', '/encoder/layer.0/attention/output/dense/MatMul', '/encoder/layer.0/intermediate/dense/MatMul', '/encoder/layer.0/output/dense/MatMul', '/encoder/layer.1/attention/self/key/MatMul', '/encoder/layer.1/attention/self/query/MatMul', '/encoder/layer.1/attention/self/value/MatMul', '/

In [6]:
!mv deployment bge-small-en-v1.5-quant
!cp bge-small-en-v1.5-dense/tokenizer.json bge-small-en-v1.5-quant/
!cp bge-small-en-v1.5-dense/config.json bge-small-en-v1.5-quant/

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


# Testing the DeepSparseSentenceTransformers Embeddings Pipeline 

In [8]:
from deepsparse.sentence_transformers import DeepSparseSentenceTransformer

quant = "bge-small-en-v1.5-quant"
sample_text = "I love quantized embedding models!"

quant_pipe = DeepSparseSentenceTransformer(quant, export=False)
quant_infer = quant_pipe.encode(sample_text)

# Get Shapes
print(quant_infer.shape)



Batches:   0%|          | 0/1 [00:00<?, ?it/s]

(384,)


# Evaluate the Accuracy of the Dense vs. Quantized BGE Models on the STSB Dataset

The [DeepSparseSentenceTransformer](https://github.com/neuralmagic/deepsparse/tree/main/src/deepsparse/sentence_transformers) integration allows easy access for compressed models to be evaluated on the [MTEB Leaderboard](https://huggingface.co/spaces/mteb/leaderboard). Let's compare the performance of the dense vs. quantized models on the STSB validation split:

In [9]:
!pip install mteb -q

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


In [11]:
from mteb import MTEB

# Specify the model to use
quant = "bge-small-en-v1.5-quant"
dense = "BAAI/bge-small-en-v1.5"

# DeepSparse Model Evaluation
from deepsparse.sentence_transformers import DeepSparseSentenceTransformer
model = DeepSparseSentenceTransformer(quant, export=False)
evaluation = MTEB(tasks=["STSBenchmark"])
results_ds = evaluation.run(model, output_folder=f"results/ds-{quant}")
print(results_ds)

# Original SentenceTransformers Model Evaluation
import sentence_transformers
model = sentence_transformers.SentenceTransformer(dense)
evaluation = MTEB(tasks=["STSBenchmark"])
results_st = evaluation.run(model, output_folder=f"results/st-{dense}")
print(results_st)



Batches:   0%|          | 0/24 [00:00<?, ?it/s]

Batches:   0%|          | 0/24 [00:00<?, ?it/s]

Batches:   0%|          | 0/22 [00:00<?, ?it/s]

Batches:   0%|          | 0/22 [00:00<?, ?it/s]

{'STSBenchmark': {'mteb_version': '1.1.1', 'dataset_revision': 'b0fddb56ed78048fa8b90373c8a3cfc37b684831', 'mteb_dataset_name': 'STSBenchmark', 'validation': {'cos_sim': {'pearson': 0.8794062860922744, 'spearman': 0.8844550053896325}, 'manhattan': {'pearson': 0.8878536584253526, 'spearman': 0.8895857544820187}, 'euclidean': {'pearson': 0.887834459531111, 'spearman': 0.8895413237473978}, 'evaluation_time': 14.8}, 'test': {'cos_sim': {'pearson': 0.8473532103761534, 'spearman': 0.8583765105094451}, 'manhattan': {'pearson': 0.8608693121882481, 'spearman': 0.8616294524581138}, 'euclidean': {'pearson': 0.8628642729555878, 'spearman': 0.8631236609576122}, 'evaluation_time': 11.84}}}


{'STSBenchmark': {'mteb_version': '1.1.1', 'dataset_revision': 'b0fddb56ed78048fa8b90373c8a3cfc37b684831', 'mteb_dataset_name': 'STSBenchmark', 'validation': {'cos_sim': {'pearson': 0.8828211766495108, 'spearman': 0.8892465763120051}, 'manhattan': {'pearson': 0.886201824808084, 'spearman': 0.8907627276162985}, 'euclidean': {'pearson': 0.8868149931196716, 'spearman': 0.8913096186609996}, 'evaluation_time': 4.43}, 'test': {'cos_sim': {'pearson': 0.8431285123201885, 'spearman': 0.8586295017067542}, 'manhattan': {'pearson': 0.854393933014824, 'spearman': 0.8591549232752812}, 'euclidean': {'pearson': 0.8565471782504085, 'spearman': 0.8612847755343875}, 'evaluation_time': 1.19}}}


The quantized model achieves 99.9% recovery when compared to the dense model on MTEB's `cos_sim` `spearman` metric.

# Benchmark the Dense PyTorch vs. Quantized ONNX Model for Latency

In addition to the MTEB benchmarking, the integration includes a custom script for benchmarking latency and throughput, let's test how the dense vs. quantized model perform against each other. First, git clone deepsparse:

In [13]:
!git clone https://github.com/neuralmagic/deepsparse.git

Cloning into 'deepsparse'...


huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


remote: Enumerating objects: 18974, done.[K
remote: Counting objects: 100% (5600/5600), done.[K
remote: Compressing objects: 100% (1547/1547), done.[K
remote: Total 18974 (delta 4935), reused 4451 (delta 4037), pack-reused 13374[K
Receiving objects: 100% (18974/18974), 139.80 MiB | 31.52 MiB/s, done.
Resolving deltas: 100% (13356/13356), done.


Now, run this CLI command to benchmmark the models' latency on encoding 100 sentences on a max sequence length=512 and batch size=1:

In [12]:
!python deepsparse/src/deepsparse/sentence_transformers/benchmark_encoding.py --base_model BAAI/bge-small-en-v1.5 --sparse_model bge-small-en-v1.5-quant

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


DeepSparse, Copyright 2021-present / Neuralmagic, Inc. version: 1.6.0.20231110 COMMUNITY | (6c521a73) (release) (optimized) (system=avx2_vnni, binary=avx2)

[SentenceTransformer]
Batch size: 1, Sentence length: 700
Latency: 100 sentences in 23.41 seconds
Throughput: 4.27 sentences/second
Batches: 100%|████████████████████████████████| 100/100 [00:07<00:00, 14.13it/s]

[DeepSparse Optimized]
Batch size: 1, Sentence length: 700
Latency: 100 sentences in 7.09 seconds
Throughput: 14.11 sentences/second


The quantized BGE model is able to improve latency performance against the dense variant on a 10 core laptop by 3.3X! Furthermore, on optimized hardware, especially avx512 with VNNI instructions, up to 5X improvement can be observed.