# Sparsifying the BGE-Small for Embeddings

BGE models are currently state-of-the-art models for embeddings on the [MTEB Leaderboard](https://huggingface.co/spaces/mteb/leaderboard). In this notebook, we will sparsify the [bge-small-en-v1.5](https://huggingface.co/BAAI/bge-small-en-v1.5) model using [Sparsify's](https://github.com/neuralmagic/sparsify) INT8 quantization and unstructured pruning via its One-Shot method. We will also evaluate its accuracy and speed improvements vs. its dense variant after sparsification.To learn more about One-Shot, refer to this [guide](https://github.com/neuralmagic/sparsify/blob/main/docs/one-shot-experiment-guide.md).

In [1]:
!pip install git+https://github.com/neuralmagic/optimum-deepsparse.git -q
!pip install git+https://github.com/neuralmagic/sparsify.git -q
!pip install sentence-transformers evaluate -q

  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m46.3/46.3 MB[0m [31m4.7 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m301.0/301.0 kB[0m [31m31.5 MB/s[0m eta [36m0:00:00[0m
[?25h  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.1/1.1 MB[0m [31m71.0 MB/s[0m eta [36m0:00:00[0m
[?25h  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m46.0/46.0 kB[0m [31m5.7 MB/s[0m eta [36m0:00:00

# Optimum DeepSparse

In order to use the Dense BGE model in Sparsify, we first have to convert it into ONNX using the Optimum DeepSparse library.

In [2]:
from optimum.deepsparse import DeepSparseModelForFeatureExtraction
from transformers.onnx.utils import get_preprocessor
from pathlib import Path

model_id = "BAAI/bge-small-en-v1.5"

# load model and convert to onnx
model = DeepSparseModelForFeatureExtraction.from_pretrained(model_id, export=True)
tokenizer = get_preprocessor(model_id)

# save onnx checkpoint and tokenizer
onnx_path = Path(f"dense-bge-small-en-v1.5")
model.save_pretrained(onnx_path)
tokenizer.save_pretrained(onnx_path)

Downloading (…)lve/main/config.json:   0%|          | 0.00/743 [00:00<?, ?B/s]

Framework not specified. Using pt to export to ONNX.


Downloading pytorch_model.bin:   0%|          | 0.00/134M [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/394 [00:00<?, ?B/s]

Downloading (…)solve/main/vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/711k [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/125 [00:00<?, ?B/s]

Using the export variant default. Available variants are:
	- default: The default ONNX variant.
Using framework PyTorch: 2.0.0+cu117
Overriding 1 configuration item(s)
	- use_cache -> False


verbose: False, log level: Level.ERROR



('dense-bge-small-en-v1.5/tokenizer_config.json',
 'dense-bge-small-en-v1.5/special_tokens_map.json',
 'dense-bge-small-en-v1.5/vocab.txt',
 'dense-bge-small-en-v1.5/added_tokens.json',
 'dense-bge-small-en-v1.5/tokenizer.json')

# Create NPZ files

Sparsify's One-Shot is a post-training sparsification method that utilizes sampled data (~1,000 samples is satisfactory) from a calibration dataset resulting in no further training time and much faster sparsification times compared with Training-Aware Experiments.

The samples need to be stored in the .npz format, which is a file format based on the NumPy library. In the BERT model architecture (such as in the BGE models), Sparsify one-shot requires the input_ids, attention_mask and token_type_ids of each data sample from your dataset to be stored as a dictionary prior to .npz storage. For more information, refer to the Sparsify guide.

For our example, we'll use the popular semantic textual similarity benchmark (stsb) dataset for calibration. Now, let's extract 1,000 samples from the train split and convert them to .npz:

In [3]:
import os
import numpy as np
from datasets import load_dataset
from transformers import AutoTokenizer
from sentence_transformers import InputExample
import torch

# Load the dataset
dataset = load_dataset("glue", "stsb", split="train")

# Adjusted to get the first 1000 examples
n_examples = 1000

# Create the "data" directory if it doesn't exist
if not os.path.exists('data'):
    os.makedirs('data')

# Load AutoTokenizer from Hugging Face model repository
model_name = "BAAI/bge-small-en-v1.5"
tokenizer = AutoTokenizer.from_pretrained(model_name)

# Define a function to create NPZ dictionaries
def create_npz_data(texts, index):
    # Tokenize the texts
    inputs = tokenizer(texts, padding=True, truncation=True, max_length=512, return_tensors='pt').to("cpu")

    # Extract input_ids, attention_mask, and token_type_ids
    input_ids = inputs['input_ids'].cpu().numpy()[0]
    attention_mask = inputs['attention_mask'].cpu().numpy()[0]
    token_type_ids = inputs.get('token_type_ids', None)
    if token_type_ids is not None:
        token_type_ids = token_type_ids.cpu().numpy()[0]

    # Create the NPZ dictionary
    npz_data = {
        "input_ids": input_ids,
        "attention_mask": attention_mask,
        "token_type_ids": token_type_ids if token_type_ids is not None else np.array([]),  # Handle cases where token_type_ids are not present
    }

    # Save the dictionary as an NPZ file
    npz_file_path = f'data/input_{index:04d}.npz'
    np.savez(npz_file_path, **npz_data)

# Create NPZ dictionaries and save them individually
train_examples = []
for i in range(n_examples):

    example = dataset[i]
    train_examples.append(InputExample(texts=[example['sentence1'], example['sentence2']]))

    # Extract texts from InputExample instances
    texts = [example.texts for example in train_examples]

    # Create the NPZ dictionary and save it
    create_npz_data(texts, i)

print(f'Saved {n_examples} npz files to data/')

Downloading builder script:   0%|          | 0.00/28.8k [00:00<?, ?B/s]

Downloading metadata:   0%|          | 0.00/28.7k [00:00<?, ?B/s]

Downloading readme:   0%|          | 0.00/27.9k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/803k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/5749 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/1500 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/1379 [00:00<?, ? examples/s]

Downloading (…)8683f/.gitattributes:   0%|          | 0.00/1.52k [00:00<?, ?B/s]

Downloading (…)_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

Downloading (…)19c878683f/README.md:   0%|          | 0.00/89.1k [00:00<?, ?B/s]

Downloading (…)c878683f/config.json:   0%|          | 0.00/743 [00:00<?, ?B/s]

Downloading (…)ce_transformers.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/134M [00:00<?, ?B/s]

Downloading (…)nce_bert_config.json:   0%|          | 0.00/52.0 [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/125 [00:00<?, ?B/s]

Downloading (…)8683f/tokenizer.json:   0%|          | 0.00/711k [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/394 [00:00<?, ?B/s]

Downloading (…)19c878683f/vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading (…)878683f/modules.json:   0%|          | 0.00/229 [00:00<?, ?B/s]

Saved 1000 npz files to data/


# Login to Sparsify

In [4]:
!sparsify.login EJMfcp88Wpp7efA99WjAfSAFH6jhwfG9

INFO:sparsify.login:Logging into sparsify...
INFO:sparsify.utils.helpers:Successfully authenticated with Neural Magic Account API key
INFO:sparsify.login:Installing sparsifyml version 1.6 from neuralmagic pypi server
Looking in indexes: https://nm:****@pypi.neuralmagic.com
Collecting sparsifyml-nightly~=1.6
  Downloading https://pypi.neuralmagic.com/packages/sparsifyml_nightly-1.6.0.20230921-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (855 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m855.2/855.2 kB[0m [31m488.2 kB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: sparsifyml-nightly
Successfully installed sparsifyml-nightly-1.6.0.20230921
INFO:sparsify.login:Logged in successfully, sparsify setup is complete.


# Run Sparsify One-Shot

Pass the model directory and set optim-level to 0.5 which will set unstructured pruning (sparsity) at 50% and INT8 quantization with a single CLI command.

In [5]:
!sparsify.run one-shot --use-case nlp-embeddings --model ./dense-bge-small-en-v1.5/model.onnx --data ./data --optim-level 0.5

INFO:sparsify.utils.helpers:Successfully authenticated with Neural Magic Account API key
INFO:sparsify.login:sparsifyml version 1.6 is already installed, skipping installation from neuralmagic pypi server
2023-09-25 13:21:54 deepsparse.utils.onnx INFO     Generating input 'X', type = float32, shape = [1, 3, 32, 32]
INFO:deepsparse.utils.onnx:Generating input 'X', type = float32, shape = [1, 3, 32, 32]
INFO:sparsifyml.one_shot.sparsification.obcq.fast_obcq_modifier:Folded 0 Conv-BatchNormalization blocks
INFO:sparsifyml.one_shot.sparsification.obcq.base_obcq_modifier:FastOBCQModifier: starting compression on layers: ['/encoder/layer.0/attention/self/query/MatMul', '/encoder/layer.0/attention/self/value/MatMul', '/encoder/layer.0/attention/output/dense/MatMul', '/encoder/layer.0/intermediate/dense/MatMul', '/encoder/layer.0/output/dense/MatMul', '/encoder/layer.1/attention/self/key/MatMul', '/encoder/layer.1/attention/self/query/MatMul', '/encoder/layer.1/attention/self/value/MatMul', '/

In [6]:
!mv deployment sparse-bge-small-en-v1.5
!cp dense-bge-small-en-v1.5/tokenizer.json sparse-bge-small-en-v1.5/
!cp dense-bge-small-en-v1.5/config.json sparse-bge-small-en-v1.5/

# Create a Custom Sentence Embeddings Pipeline



In [50]:
from transformers import Pipeline
import torch.nn.functional as F
import torch

def mean_pooling(model_output, attention_mask):
    token_embeddings = model_output[0]
    input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
    return torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp(input_mask_expanded.sum(1), min=1e-9)

class SentenceEmbeddingPipeline(Pipeline):
    def _sanitize_parameters(self, **kwargs):
        preprocess_kwargs = {}
        return preprocess_kwargs, {}, {}

    def preprocess(self, inputs):
        encoded_inputs = self.tokenizer(inputs, padding=True, truncation=True, return_tensors='pt')
        return encoded_inputs

    def _forward(self, model_inputs):
        outputs = self.model(**model_inputs)
        return {"outputs": outputs, "attention_mask": model_inputs["attention_mask"]}

    def postprocess(self, model_outputs):
        # Perform pooling
        sentence_embeddings = mean_pooling(model_outputs["outputs"], model_outputs['attention_mask'])
        # Normalize embeddings
        sentence_embeddings = F.normalize(sentence_embeddings, p=2, dim=1)
        return sentence_embeddings

In [52]:
dense = "dense-bge-small-en-v1.5"
sparse = "sparse-bge-small-en-v1.5"

dense_model = DeepSparseModelForFeatureExtraction.from_pretrained(dense, export=False)
tokenizer = get_preprocessor(dense)

sparse_model = DeepSparseModelForFeatureExtraction.from_pretrained(sparse, export=False)
tokenizer = get_preprocessor(sparse)

dense_pipe = SentenceEmbeddingPipeline(model=dense_model, tokenizer=tokenizer)
sparse_pipe = SentenceEmbeddingPipeline(model=sparse_model, tokenizer=tokenizer)

sample_text = "I love sparse embedding models!"

dense_infer = dense_pipe(sample_text)
sparse_infer = sparse_pipe(sample_text)

# Get Shapes
print(dense_infer.shape)
print(sparse_infer.shape)

Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.
Model is dynamic and has no shapes defined, skipping reshape..
Model is dynamic and has no shapes defined, skipping reshape..


torch.Size([1, 384])
torch.Size([1, 384])


# Evaluate the Dense vs. Sparse BGE Models for Accuracy on STSB

In [9]:
from datasets import load_dataset
from evaluate import load
import torch

eval_dataset = load_dataset("glue","stsb",split="validation")
metric = load('glue', 'stsb')

def compute_sentence_similarity(sentence_1, sentence_2, pipeline):
    embedding_1 = pipeline(sentence_1)
    embedding_2 = pipeline(sentence_2)

    return torch.nn.functional.cosine_similarity(embedding_1, embedding_2, dim=1)

def evaluate_stsb(example):
    default = compute_sentence_similarity(example["sentence1"], example["sentence2"], dense_pipe)
    sparse = compute_sentence_similarity(example["sentence1"], example["sentence2"], sparse_pipe)
    return {
        'reference': (example["label"] - 1) / (5 - 1),
        'default': float(default),
        'sparse': float(sparse),
        }

# run evaluation
result = eval_dataset.map(evaluate_stsb)

# compute metrics
default_acc = metric.compute(predictions=result["default"], references=result["reference"])
sparse = metric.compute(predictions=result["sparse"], references=result["reference"])

print(f"dense model: pearson={default_acc['pearson']}%")
print(f"sparse model: pearson={sparse['pearson']}%")
print(f"The sparse model achieves {round(sparse['pearson']/default_acc['pearson'],2)*100:.2f}% accuracy of the dense model")

Downloading builder script:   0%|          | 0.00/5.75k [00:00<?, ?B/s]



Map:   0%|          | 0/1500 [00:00<?, ? examples/s]

dense model: pearson=0.8913432543187466%
sparse model: pearson=0.8563085613094055%
The sparse model achieves 96.00% accuracy of the dense model


# Benchmark the Dense PyTorch vs. Sparse ONNX Model for Latency

In [54]:
import subprocess
from time import perf_counter
import numpy as np

payload = "Greetings, I'm Jane the robot, residing in the vibrant city of Seattle, USA. " \
    "My journey involves crafting innovative solutions as a Software Architect, " \
    "driving technological progress through collaborative endeavors and cutting-edge research. " \
    "My experience spans across diverse domains, from optimizing supply chain logistics " \
    "to enhancing medical diagnostics. Passionate about exploring AI ethics and " \
    "the human-machine partnership, I'm constantly evolving to pioneer the future of technology. " \
    "In my spare time, I enjoy exploring the beautiful Pacific Northwest, " \
    "with its majestic mountains and pristine forests. I'm an avid hiker and often find " \
    "myself on the trails, seeking inspiration from nature's wonders. " \
    "When it comes to my work, I believe that artificial intelligence " \
    "has the potential to transform industries and improve people's lives. " \
    "I'm particularly interested in natural language processing and " \
    "machine learning, and I'm dedicated to pushing the boundaries of what AI can achieve. " \
    "In addition to my technical pursuits, I'm also a strong advocate " \
    "for diversity and inclusion in the tech industry. I believe that a diverse " \
    "and inclusive workforce leads to better innovation and more equitable " \
    "technological solutions for society. " \
    "I'm an enthusiastic problem solver and love tackling complex challenges. " \
    "My approach to problem-solving involves a combination of creativity, " \
    "data-driven analysis, and a keen understanding of user needs. " \
    "I'm always eager to collaborate with like-minded individuals " \
    "to bring innovative ideas to life. " \
    "When I'm not working on AI projects or exploring the outdoors, " \
    "I can often be found in the kitchen, experimenting with new recipes " \
    "and cooking up delicious meals for friends and family. " \
    "I believe that the joy of creating extends beyond technology " \
    "and into the realms of culinary art. " \
    "My aspiration is to continue pushing the boundaries " \
    "of what AI can achieve while making a positive impact on society."

print(f'Payload sequence length: {len(tokenizer(payload)["input_ids"])}')

def measure_latency(pipe):
    latencies = []

    # Timed run
    for _ in range(100):
        start_time = perf_counter()
        _ = pipe(payload)
        latency = perf_counter() - start_time
        latencies.append(latency)

    # Compute run statistics
    time_avg_ms = 1000 * np.mean(latencies)
    time_std_ms = 1000 * np.std(latencies)
    time_p95_ms = 1000 * np.percentile(latencies, 95)
    return f"P95 latency (ms) - {time_p95_ms}; Average latency (ms) - {time_avg_ms:.2f} +\- {time_std_ms:.2f};", time_p95_ms

dense_model = measure_latency(dense_pipe)
quantized_model = measure_latency(sparse_pipe)

# Get the number of CPU cores using the nproc command
num_cores = int(subprocess.check_output("nproc").decode().strip())

print(f"dense model latency: {dense_model[0]}")
print(f"sparse model latency: {quantized_model[0]}")
print(f"Latency improvement through one-shot on {num_cores} CPU cores: {round(dense_model[1] / quantized_model[1], 2)}x")


Payload sequence length: 367
dense model latency: P95 latency (ms) - 810.6698678000611; Average latency (ms) - 359.64 +\- 171.19;
sparse model latency: P95 latency (ms) - 375.64537654984633; Average latency (ms) - 321.86 +\- 39.60;
Latency improvement through one-shot on 2 CPU cores: 2.16x
