In [None]:
!pip install datasets torch transformers sentence-transformers hnswlib

# SQuAD Benchmark HNSW

In this notebook we will work through an **A**pproximate **N**earest **N**eighbors **S**earch (ANNS) benchmark using modern embedding models and datasets. Here we will use the **S**tanford **Qu**estion and **A**nswering **D**ataset (SQuAD) and a MPNet sentence transformer model trained for question-answering.

## Building Embeddings

We start by initializing the dataset and creating both the query and context embeddings that we will be searching with. The dataset is hosted on Hugging Face *Datasets*, and we initialize like so:

In [2]:
from datasets import load_dataset

squad = load_dataset('squad', split='train')
squad

  from .autonotebook import tqdm as notebook_tqdm
Downloading builder script: 5.27kB [00:00, 3.16MB/s]                   
Downloading metadata: 2.36kB [00:00, 1.87MB/s]                   


Downloading and preparing dataset squad/plain_text (download: 33.51 MiB, generated: 85.63 MiB, post-processed: Unknown size, total: 119.14 MiB) to /home/jupyter/.cache/huggingface/datasets/squad/plain_text/1.0.0/d6ec3ceb99ca480ce37cdd35555d6cb2511d223b9150cce08a837ef62ffea453...


Downloading data files:   0%|          | 0/2 [00:00<?, ?it/s]
Downloading data:   0%|          | 0.00/8.12M [00:00<?, ?B/s][A
Downloading data:  91%|█████████▏| 7.42M/8.12M [00:00<00:00, 74.2MB/s][A
Downloading data: 16.4MB [00:00, 83.2MB/s]                            [A
Downloading data: 30.3MB [00:00, 81.8MB/s][A
Downloading data files:  50%|█████     | 1/2 [00:01<00:01,  1.25s/it]
Downloading data: 4.85MB [00:00, 72.5MB/s]                   [A
Downloading data files: 100%|██████████| 2/2 [00:01<00:00,  1.26it/s]
Extracting data files: 100%|██████████| 2/2 [00:00<00:00, 1312.36it/s]
                                                                                           

Dataset squad downloaded and prepared to /home/jupyter/.cache/huggingface/datasets/squad/plain_text/1.0.0/d6ec3ceb99ca480ce37cdd35555d6cb2511d223b9150cce08a837ef62ffea453. Subsequent calls will reuse this data.




Dataset({
    features: ['id', 'title', 'context', 'question', 'answers'],
    num_rows: 87599
})

To create the encodings, we initialize an embedding model.

In [3]:
from sentence_transformers import SentenceTransformer
import torch

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

model = SentenceTransformer('multi-qa-mpnet-base-dot-v1', device=device)
model

SentenceTransformer(
  (0): Transformer({'max_seq_length': 512, 'do_lower_case': False}) with Transformer model: MPNetModel 
  (1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': True, 'pooling_mode_mean_tokens': False, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False})
)

We will be encoding all unique contexts from SQuAD.

In [4]:
contexts = list(set(squad['context']))
len(contexts)

18891

After encoding, we will return embeddings of dimensionality `768`. The embedding dimensionality is specific to each embedding model, and we can check that this is correct via the `model.get_sentence_embedding_dimension` method.

In [5]:
dims = model.get_sentence_embedding_dimension()
dims

768

Now we encode all of the contexts. We do this in batches to avoid overloading the limited RAM of our machines.

In [6]:
import numpy as np
from tqdm.auto import tqdm

batch_size = 64
# initialize zero array where we later add all context embeddings
encodings = np.zeros((len(contexts), dims))

for i in tqdm(range(0, len(contexts), batch_size)):
    # find batch size
    i_end = min(i + batch_size, len(contexts))
    # create encodings
    embeddings = model.encode(contexts[i:i_end])
    # add to encodings array
    encodings[i:i_end] = embeddings

# normalize
encodings = encodings / np.linalg.norm(encodings, axis=1, keepdims=True)
encodings.shape
# save to file
with open('squad.npy', 'wb') as f:
    np.save(f, encodings)

encodings.shape

100%|██████████| 296/296 [01:15<00:00,  3.94it/s]


(18891, 768)

We must do the same with our questions.

In [7]:
questions = list(set(squad['question']))

In [8]:
# again, we initialize the zero array where we will add query embeddings
xq_arr = np.zeros((len(questions), dims))

for i in tqdm(range(0, len(questions), batch_size)):
    # find batch size
    i_end = min(i + batch_size, len(questions))
    # create encodings
    embeddings = model.encode(questions[i:i_end])
    # add to encodings array
    xq_arr[i:i_end] = embeddings

# save to file
with open('squad_xq.npy', 'wb') as f:
    np.save(f, xq_arr)

xq_arr.shape

100%|██████████| 1365/1365 [00:39<00:00, 34.46it/s]


(87355, 768)

Now we can begin testing. First set BLAS libraries to use a single thread (eg making Numpy matmul op use a single thread)

In [None]:
import os

# setting params so BLAS libraries (numpy matmul) is only using a single thread
os.environ["OMP_NUM_THREADS"] = "1" # export OMP_NUM_THREADS=4
os.environ["OPENBLAS_NUM_THREADS"] = "1" # export OPENBLAS_NUM_THREADS=4 
os.environ["MKL_NUM_THREADS"] = "1" # export MKL_NUM_THREADS=6
os.environ["VECLIB_MAXIMUM_THREADS"] = "1" # export VECLIB_MAXIMUM_THREADS=4
os.environ["NUMEXPR_NUM_THREADS"] = "1" # export NUMEXPR_NUM_THREADS=6

We do this to be comparable to `hnswlib` that we use later. Now perform the full kNN search.

In [9]:
dist = np.matmul(xq_arr, encodings.T)

We now calculate the k-NN baseline for @1:

In [10]:
baseline = np.argmax(dist, axis=1).reshape(-1)

Everything else is calculated from this, so we now move on to performing the same operations but with HNSW.

We first initialize a HNSW index.

In [14]:
import hnswlib

index = hnswlib.Index(space='ip', dim=dims)

And then build the index using the contexts we have (number of elements should be known before)

In [15]:
index.init_index(
    max_elements=encodings.shape[0],
    ef_construction=1000,
    M=24
)
index.add_items(encodings)

Different parameters produce different performance for the HNSW index, we need to test with varying parameters to find which works best with the index.

In [18]:
import time
import pandas as pd

hnsw_perf = pd.DataFrame({
    'ef': [],
    'qps': [],
    'recall@1': []
})

index.set_num_threads(1)
ef_vals = [10,20,50,100,110,120,130,135,140,145,150,200,300,400,500,600,700,800,900,1000]

# we will test HNSW with many different ef search values
for ef in tqdm(ef_vals):
    index.set_ef(ef) # ef should always be > k
    # Query dataset, k - number of closest elements (returns 2 numpy arrays)
    t0=time.time()
    labels, distances = index.knn_query(xq_arr, k = 1)
    # calculate queries per second (QPS)
    qps=len(xq_arr)/(time.time()-t0)
    # calculate recall@k
    recall = np.sum(
        labels.reshape(-1) == baseline.reshape(-1)
    ) / len(xq_arr)
    hnsw_perf = hnsw_perf.append({
        'ef': ef, 'qps': qps, 'recall@1': recall
    }, ignore_index=True)

100%|██████████| 20/20 [39:47<00:00, 119.36s/it]


In [19]:
hnsw_perf

Unnamed: 0,ef,qps,recall@1
0,10.0,9970.085016,0.947318
1,20.0,6183.901004,0.979028
2,50.0,3039.892434,0.995158
3,100.0,1722.959084,0.99889
4,110.0,1589.131769,0.999061
5,120.0,1479.290717,0.999222
6,130.0,1382.523561,0.999325
7,135.0,1343.248343,0.999416
8,140.0,1305.419935,0.999508
9,145.0,1274.720829,0.999531


---