# Using Sparsify One-Shot for Sparsifying MiniLM for a Semantic Search Use-Case

In this notebook, we aim to explore the capabilities of the innovative Sparsify one-shot method for quantizing a dense [MiniLM](https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2) model, thereby simplifying the DevOps workflow. We will also walk you through the process of abstracting ONNX exportation and optimization by utilizing the [DeepSparse Optimum integration](https://github.com/neuralmagic/optimum-deepsparse). Finally, we will evaluate and compare the accuracy and latency of both the dense and quantized MiniLM models. To demonstrate their effectiveness, we'll employ the Weaviate vector database to efficiently index and search embeddings, underlining the preservation of MiniLM's semantic search functionalities despite the use of INT8 quantization and one-shot weight pruning.

## Installation

We'll install the `optimum-deepsparse` library for ONNX exporting, `sentence-transformers` for generating embeddings, the `evaluate` repo for validating accuracy on the stsb dataset, `sparsify` and the `weaviate` Python client.

In [1]:
!pip install git+https://github.com/neuralmagic/optimum-deepsparse.git
!pip install git+https://github.com/neuralmagic/sparsify.git
!pip install sentence-transformers evaluate
!pip install weaviate-client

Collecting git+https://github.com/neuralmagic/optimum-deepsparse.git
  Cloning https://github.com/neuralmagic/optimum-deepsparse.git to /tmp/pip-req-build-p9eoz4nc
  Running command git clone --filter=blob:none --quiet https://github.com/neuralmagic/optimum-deepsparse.git /tmp/pip-req-build-p9eoz4nc
  Resolved https://github.com/neuralmagic/optimum-deepsparse.git to commit 974aa296fdcc2512b26b3e1ed9fbf9f63c85b7a3
  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
Collecting deepsparse-nightly (from optimum-deepsparse==0.1.0.dev0)
  Downloading deepsparse_nightly-1.6.0.20230825-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (44.7 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m44.7/44.7 MB[0m [31m18.7 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting optimum[exporters]>=1.8.0 (from optimum-deepsparse==0.1.0.dev0)
  Downloading optimum-1.1

Collecting sentence-transformers
  Downloading sentence-transformers-2.2.2.tar.gz (85 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m86.0/86.0 kB[0m [31m2.4 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting evaluate
  Downloading evaluate-0.4.0-py3-none-any.whl (81 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m81.4/81.4 kB[0m [31m8.3 MB/s[0m eta [36m0:00:00[0m
Collecting responses<0.19 (from evaluate)
  Downloading responses-0.18.0-py3-none-any.whl (38 kB)
Building wheels for collected packages: sentence-transformers
  Building wheel for sentence-transformers (setup.py) ... [?25l[?25hdone
  Created wheel for sentence-transformers: filename=sentence_transformers-2.2.2-py3-none-any.whl size=125937 sha256=fbbf809b2470b9d2d5daec8558973b9fb7d7a6a2f7116a85bad52c2a8b28ab45
  Stored in directory: /root/.cache/pip/wheels/62/f2/10/1e606fd5f02395388f74e7462910fe851042f97238cbbd902f
Successfully b

## Convert PyTorch Weights to ONNX

In [2]:
from optimum.deepsparse import DeepSparseModelForFeatureExtraction
from transformers.onnx.utils import get_preprocessor
from pathlib import Path

model_id="sentence-transformers/all-MiniLM-L6-v2"

# load model and convert to onnx
model = DeepSparseModelForFeatureExtraction.from_pretrained(model_id, export=True)
tokenizer = get_preprocessor(model_id)

# save onnx checkpoint and tokenizer
onnx_path = Path("dense_onnx")
model.save_pretrained(onnx_path)
tokenizer.save_pretrained(onnx_path)

Downloading (…)lve/main/config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

Framework not specified. Using pt to export to ONNX.


Downloading pytorch_model.bin:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

Downloading (…)solve/main/vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

Using framework PyTorch: 2.0.0+cu117
Overriding 1 configuration item(s)
	- use_cache -> False


verbose: False, log level: Level.ERROR



('dense_onnx/tokenizer_config.json',
 'dense_onnx/special_tokens_map.json',
 'dense_onnx/vocab.txt',
 'dense_onnx/added_tokens.json',
 'dense_onnx/tokenizer.json')

## Data Prep Prior to Using Sparsify One-Shot
Sparsify's One-Shot is a post-training sparsification method that utilizes sampled data (~1,000 samples is satisfactory) from a calibration dataset resulting in no further training time and much faster sparsification times compared with Training-Aware Experiments.

The samples need to be stored in the `.npz` format, which is a file format based on the NumPy library. In the BERT model architecture, Sparsify one-shot requires the `input_ids`, `attention_mask` and `token_type_ids` of each data sample from your dataset to be stored as a dictionary prior to `.npz` storage. For more information, refer to the Sparsify [guide](https://github.com/neuralmagic/sparsify/blob/main/docs/datasets-guide.md#npz).

 For our example, we'll use the popular semantic textual similarity benchmark (`stsb`) dataset for calibration. Now, let's extract 1,000 samples from the train split and convert them to `.npz`:

In [3]:
import os
import numpy as np
from datasets import load_dataset
from sentence_transformers import SentenceTransformer, InputExample

# Load the dataset
dataset = load_dataset("glue", "stsb", split="train")

# Adjusted to get the first 1000 examples
n_examples = 1000

# Create the "data" directory if it doesn't exist
if not os.path.exists('data'):
    os.makedirs('data')

# Load the Sentence Transformers model
model_name = "sentence-transformers/all-MiniLM-L6-v2"
model = SentenceTransformer(model_name)

train_examples = []
for i in range(n_examples):
    example = dataset[i]
    train_examples.append(InputExample(texts=[example['sentence1'], example['sentence2']]))

# Extract texts from InputExample instances
texts = [example.texts for example in train_examples]

# Generate embeddings for the extracted texts
embeddings = model.encode(texts, convert_to_tensor=True)
embeddings_cpu = embeddings.cpu().numpy()

for i in range(n_examples):
    example = dataset[i]

    # Create a dictionary to store the data
    npz_data = {
        'input_ids': embeddings_cpu[i],  # Store embeddings for this example
        'attention_mask': np.ones_like(embeddings_cpu[i]),  # Attention mask for embeddings
        'token_type_ids': np.zeros_like(embeddings_cpu[i])  # Token type IDs for embeddings
    }

    # Save the dictionary as an npz file
    npz_file_path = f'data/input_{i:04d}.npz'
    np.savez(npz_file_path, **npz_data)

print(f'Saved {n_examples} npz files to data/')

Downloading builder script:   0%|          | 0.00/28.8k [00:00<?, ?B/s]

Downloading metadata:   0%|          | 0.00/28.7k [00:00<?, ?B/s]

Downloading readme:   0%|          | 0.00/27.9k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/803k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/5749 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/1500 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/1379 [00:00<?, ? examples/s]

Downloading (…)e9125/.gitattributes:   0%|          | 0.00/1.18k [00:00<?, ?B/s]

Downloading (…)_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

Downloading (…)7e55de9125/README.md:   0%|          | 0.00/10.6k [00:00<?, ?B/s]

Downloading (…)55de9125/config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

Downloading (…)ce_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

Downloading (…)125/data_config.json:   0%|          | 0.00/39.3k [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

Downloading (…)nce_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

Downloading (…)e9125/tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

Downloading (…)9125/train_script.py:   0%|          | 0.00/13.2k [00:00<?, ?B/s]

Downloading (…)7e55de9125/vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading (…)5de9125/modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

Saved 1000 npz files to data/


## Login to Sparsify

Prior to gaining access to the sparsify API, sign up for a free account [here](https://account.neuralmagic.com/signup). Then on your landing page, copy your personal API key and paste into the following command:

In [4]:
!sparsify.login EJMfcp88Wpp7efA99WjAfSAFH6jhwfG9

INFO:sparsify.login:Logging into sparsify...
INFO:sparsify.utils.helpers:Successfully authenticated with Neural Magic Account API key
INFO:sparsify.login:Installing sparsifyml version 1.6 from neuralmagic pypi server
Looking in indexes: https://nm:****@pypi.neuralmagic.com
Collecting sparsifyml-nightly~=1.6
  Downloading https://pypi.neuralmagic.com/packages/sparsifyml_nightly-1.6.0.20230828-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (847 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m847.1/847.1 kB[0m [31m3.2 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: sparsifyml-nightly
Successfully installed sparsifyml-nightly-1.6.0.20230828
INFO:sparsify.login:Logged in successfully, sparsify setup is complete.


## Run One-Shot with Sparsify

Running One-Shot requires a simple CLI command pointing to the model directory `./dense_onnx` (we previously created when exporting to ONNX) and the data directory `/data` which stores our 1,000 NPZ files. In addition, we use an optimization level of `0.5` (default). This argument controls how much sparsification is applied to the model (ranging from 0.1 to 0.9), with higher values resulting in faster and more compressed models with only a marginal drop in accuracy. For further inquiry, refer to the [one-shot guide](https://github.com/neuralmagic/sparsify/blob/main/docs/one-shot-experiment-guide.md).

In [5]:
!sparsify.run one-shot --use-case nlp-text-classification --model ./dense_onnx/model.onnx --data ./data --optim-level 0.5

INFO:sparsify.utils.helpers:Successfully authenticated with Neural Magic Account API key
INFO:sparsify.login:sparsifyml version 1.6 is already installed, skipping installation from neuralmagic pypi server
2023-08-28 13:52:00 deepsparse.utils.onnx INFO     Generating input 'X', type = float32, shape = [1, 3, 32, 32]
INFO:deepsparse.utils.onnx:Generating input 'X', type = float32, shape = [1, 3, 32, 32]
INFO:sparsifyml.one_shot.sparsification.obcq.fast_obcq_modifier:Folded 0 Conv-BatchNormalization blocks
INFO:sparsifyml.one_shot.sparsification.obcq.base_obcq_modifier:FastOBCQModifier: starting compression on layers: ['/encoder/layer.0/attention/self/query/MatMul', '/encoder/layer.0/attention/self/value/MatMul', '/encoder/layer.0/attention/output/dense/MatMul', '/encoder/layer.0/intermediate/dense/MatMul', '/encoder/layer.0/output/dense/MatMul', '/encoder/layer.1/attention/self/key/MatMul', '/encoder/layer.1/attention/self/query/MatMul', '/encoder/layer.1/attention/self/value/MatMul', '/

In [6]:
!mv deployment sparse_onnx
!cp dense_onnx/tokenizer.json sparse_onnx/
!cp dense_onnx/config.json sparse_onnx/

## Creating an Inference Pipeline for Sentence Embeddings

Let's now create a sentence embedding inference pipeline using PyTorch and the Hugging Face Pipeline for pooling in post-processing. We'll need this in order to evaluate the performance of the models using the `evaluate` library and the upcoming `weaviate` deployment using sentence embeddings.

In [26]:
from transformers import Pipeline
import torch.nn.functional as F
import torch


def mean_pooling(model_output, attention_mask):
    token_embeddings = model_output[0]
    input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
    return torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp(input_mask_expanded.sum(1), min=1e-9)

class SentenceEmbeddingPipeline(Pipeline):
    def _sanitize_parameters(self, **kwargs):
        preprocess_kwargs = {}
        return preprocess_kwargs, {}, {}

    def preprocess(self, inputs):
        encoded_inputs = self.tokenizer(inputs, padding=True, truncation=True, return_tensors='pt')
        return encoded_inputs

    def _forward(self, model_inputs):
        outputs = self.model(**model_inputs)
        return {"outputs": outputs, "attention_mask": model_inputs["attention_mask"]}

    def postprocess(self, model_outputs):
        # Perform pooling
        sentence_embeddings = mean_pooling(model_outputs["outputs"], model_outputs['attention_mask'])
        # Normalize embeddings
        sentence_embeddings = F.normalize(sentence_embeddings, p=2, dim=1)
        return sentence_embeddings

Initialize model pipelines, and double check if they work for both the dense and sparse models.

In [34]:
dense = "dense_onnx"
sparse = "sparse_onnx"

dense_model = DeepSparseModelForFeatureExtraction.from_pretrained(dense, export=False)
tokenizer = get_preprocessor(dense)

sparse_model = DeepSparseModelForFeatureExtraction.from_pretrained(sparse, export=False)
tokenizer = get_preprocessor(sparse)

dense_pipe = SentenceEmbeddingPipeline(model=dense_model, tokenizer=tokenizer)
sparse_pipe = SentenceEmbeddingPipeline(model=sparse_model, tokenizer=tokenizer)

sample_text = "I love sparse sentence embedding models"

dense_infer = dense_pipe(sample_text)
sparse_infer = sparse_pipe(sample_text)

# print an excerpt from the sentence embedding
print(dense_infer[0][:5])
print(sparse_infer[0][:5])

Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.
Model is dynamic and has no shapes defined, skipping reshape..
Model is dynamic and has no shapes defined, skipping reshape..


tensor([ 0.0041, -0.1259,  0.0277,  0.0237,  0.0413])
tensor([-0.0051, -0.0665,  0.0314,  0.0038, -0.0135])


## Evaluate the Dense vs. Sparse Model for Accuracy on STSB

In [9]:
from datasets import load_dataset
from evaluate import load
import torch

eval_dataset = load_dataset("glue","stsb",split="validation")
metric = load('glue', 'stsb')

def compute_sentence_similarity(sentence_1, sentence_2,pipeline):
    embedding_1 = pipeline(sentence_1)
    embedding_2 = pipeline(sentence_2)

    return torch.nn.functional.cosine_similarity(embedding_1, embedding_2, dim=1)

def evaluate_stsb(example):
  default = compute_sentence_similarity(example["sentence1"], example["sentence2"], dense_pipe)
  sparse = compute_sentence_similarity(example["sentence1"], example["sentence2"], sparse_pipe)
  return {
      'reference': (example["label"] - 1) / (5 - 1),
      'default': float(default),
      'sparse': float(sparse),
      }

# run evaluation
result = eval_dataset.map(evaluate_stsb)

# compute metrics
default_acc = metric.compute(predictions=result["default"], references=result["reference"])
sparse = metric.compute(predictions=result["sparse"], references=result["reference"])

print(f"dense model: pearson={default_acc['pearson']}%")
print(f"sparse model: pearson={sparse['pearson']}%")
print(f"The sparse model achieves {round(sparse['pearson']/default_acc['pearson'],2)*100:.2f}% accuracy of the dense model")

Downloading builder script:   0%|          | 0.00/5.75k [00:00<?, ?B/s]



Map:   0%|          | 0/1500 [00:00<?, ? examples/s]

dense model: pearson=0.8696194668251013%
sparse model: pearson=0.8483653311696123%
The sparse model achieves 98.00% accuracy of the dense model


## Benchmark the Dense vs. Sparse Model for Latency

In [12]:
from time import perf_counter
import numpy as np

payload = "Greetings, I'm Jane the robot, residing in the vibrant city of Seattle, USA. " \
"My journey involves crafting innovative solutions as a Software Architect, " \
"driving technological progress through collaborative endeavors and cutting-edge research. " \
"My experience spans across diverse domains, from optimizing supply chain logistics " \
"to enhancing medical diagnostics. Passionate about exploring AI ethics and " \
"the human-machine partnership, I'm constantly evolving to pioneer the future of technology."


print(f'Payload sequence length: {len(tokenizer(payload)["input_ids"])}')

def measure_latency(pipe):
    latencies = []
    # warm up
    for _ in range(10):
        _ = pipe(payload)
    # Timed run
    for _ in range(100):
        start_time = perf_counter()
        _ =  pipe(payload)
        latency = perf_counter() - start_time
        latencies.append(latency)
    # Compute run statistics
    time_avg_ms = 1000 * np.mean(latencies)
    time_std_ms = 1000 * np.std(latencies)
    time_p95_ms = 1000 * np.percentile(latencies,95)
    return f"P95 latency (ms) - {time_p95_ms}; Average latency (ms) - {time_avg_ms:.2f} +\- {time_std_ms:.2f};", time_p95_ms

vanilla_model=measure_latency(dense_pipe)
quantized_model=measure_latency(sparse_pipe)

print(f"dense model: {vanilla_model[0]}")
print(f"quantized model: {quantized_model[0]}")
print(f"Improvement through one-shot: {round(vanilla_model[1]/quantized_model[1],2)}x")


Payload sequence length: 90
dense model: P95 latency (ms) - 55.83691190004174; Average latency (ms) - 49.65 +\- 2.85;
quantized model: P95 latency (ms) - 46.095011049874294; Average latency (ms) - 31.63 +\- 5.87;
Improvement through one-shot: 1.21x


## Connect to Weaviate Client

Replace `url` and `api_key` with your Weaviate credentials.

In [95]:
import torch
from transformers import AutoModel, AutoTokenizer
import weaviate
import time

# initialize weaviate client for importing and searching
client = weaviate.Client(
    url = "https://sparse-minilm-j8lqfvbq.weaviate.network",  # Replace with your endpoint
    auth_client_secret=weaviate.AuthApiKey(api_key="5ulTIjYyHCK3LvT6dJzlV8qaeEozWsHQVwX5"),  # Replace w/ your Weaviate instance API key
)

# Preprocess Dataset

Let's preprocess the 20 Newsgroups dataset for cleaning by removing headers, filtering out short posts, and replacing characters.

In [96]:
import os
import random

def get_post_filenames(limit_objects=100):
    file_names = []
    i=0
    for root, dirs, files in os.walk("./data/20news-bydate-test"):
        for filename in files:
            path = os.path.join(root, filename)
            file_names += [path]

    random.shuffle(file_names)
    limit_objects = min(len(file_names), limit_objects)

    file_names = file_names[:limit_objects]

    return file_names

def read_posts(filenames=[]):
    posts = []
    for filename in filenames:
        f = open(filename, encoding="utf-8", errors='ignore')
        post = f.read()

        # strip the headers (the first occurrence of two newlines)
        post = post[post.find('\n\n'):]

        # remove posts with less than 10 words to remove some of the noise
        if len(post.split(' ')) < 10:
               continue

        post = post.replace('\n', ' ').replace('\t', ' ')
        if len(post) > 1000:
            post = post[:1000]
        posts += [post]

    return posts

## Vectorize Posts from Dataset Using MiniLM

In [97]:
def vectorize_posts(posts=[]):
    post_vectors = []
    before = time.perf_counter()
    for i, post in enumerate(posts):
        vec = sparse_pipe(post)
        post_vectors.append(vec)
        if i % 25 == 0 and i != 0:
            print("So far {} objects vectorized in {:.3f}s".format(i, time.perf_counter() - before))
    after = time.perf_counter()

    print("Vectorized {} items in {:.3f}s".format(len(posts), after - before))

    return post_vectors

## Create Weaviate Schema

In [98]:
def init_weaviate_schema(client):
    # a simple schema containing just a single class for our posts
    schema = {
        "classes": [{
                "class": "Post",
                "vectorizer": "none", # explicitly tell Weaviate not to vectorize anything, we are providing the vectors ourselves through our MiniLM model
                "properties": [{
                    "name": "content",
                    "dataType": ["text"],
                }]
        }]
    }

    # cleanup from previous runs
    client.schema.delete_all()
    client.schema.create(schema)

In [99]:
def import_posts_with_vectors(posts, vectors, client):
    if len(posts) != len(vectors):
        raise Exception("len of posts ({}) and vectors ({}) does not match".format(len(posts), len(vectors)))

    for i, post in enumerate(posts):
        try:
            client.data_object.create(
                data_object={"content": post},
                class_name='Post',
                vector=vectors[i]
            )
        except:
            print(res)

In [100]:
def search(query="", limit=3):
    vec_took_start = time.perf_counter()
    vec = sparse_pipe(query)
    vec_took = time.perf_counter() - vec_took_start

    search_took_start = time.perf_counter()
    near_vec = {"vector": vec}
    res = client \
        .query.get("Post", ["content", "_additional {certainty}"]) \
        .with_near_vector(near_vec) \
        .with_limit(limit) \
        .do()
    search_took = time.perf_counter() - search_took_start

    total_time = vec_took + search_took

    print("\nQuery \"{}\" with {} results took {:.3f}s ({:.3f}s to vectorize and {:.3f}s to search)" \
          .format(query, limit, total_time, vec_took, search_took))

    for post in res["data"]["Get"]["Post"]:
        print("{:.4f}: {}".format(post["_additional"]["certainty"], post["content"]))
        print('---')


In [102]:
init_weaviate_schema(client)
posts = read_posts(get_post_filenames(100))
vectors = vectorize_posts(posts)
import_posts_with_vectors(posts, vectors, client)

Vectorized 0 items in 0.000s


In [103]:
search("the best camera lens", 1)
search("motorcycle trip", 1)
search("which software do i need to view jpeg files", 1)
search("windows vs mac", 1)


Query "the best camera lens" with 1 results took 0.153s (0.009s to vectorize and 0.144s to search)

Query "motorcycle trip" with 1 results took 0.144s (0.005s to vectorize and 0.139s to search)

Query "which software do i need to view jpeg files" with 1 results took 0.146s (0.007s to vectorize and 0.139s to search)

Query "windows vs mac" with 1 results took 0.146s (0.005s to vectorize and 0.140s to search)
