
# Similar Questions Retrieval

This notebook is inspired by the [similar search example of Sentence-Transformers](https://www.sbert.net/examples/applications/semantic-search/README.html#similar-questions-retrieval), and adapted to support [RAFT ANN](https://github.com/rapidsai/raft) algorithm.

The model was pre-trained on the [Natural Questions dataset](https://ai.google.com/research/NaturalQuestions). It consists of about 100k real Google search queries, together with an annotated passage from Wikipedia that provides the answer. It is an example of an asymmetric search task. As corpus, we use the smaller [Simple English Wikipedia](http://sbert.net/datasets/simplewiki-2020-11-01.jsonl.gz) so that it fits easily into memory.

The steps to install the latest stable `pylibraft` package are available in the [documentation](https://docs.rapids.ai/api/raft/stable/build).

## Setup

In [1]:
!nvidia-smi

Tue Apr  2 23:24:59 2024       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 520.61.05    Driver Version: 520.61.05    CUDA Version: 11.8     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  NVIDIA A10G         On   | 00000000:00:1E.0 Off |                    0 |
|  0%   29C    P0    58W / 300W |  12900MiB / 23028MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

In [2]:
import json
from sentence_transformers import SentenceTransformer, CrossEncoder, util
import time
import gzip
import os
import torch
import pylibraft
from pylibraft.neighbors import ivf_flat, ivf_pq, brute_force
#from pylibraft.neighbors.brute_force import knn

pylibraft.config.set_output_as(lambda device_ndarray: device_ndarray.copy_to_host())

if not torch.cuda.is_available():
  print("Warning: No GPU found. Please add GPU to your notebook")

  from .autonotebook import tqdm as notebook_tqdm


### Download data

In [3]:
# As dataset, we use Simple English Wikipedia. Compared to the full English wikipedia, it has only about 170k articles.
wikipedia_filepath = 'data/simplewiki-2020-11-01.jsonl.gz'
if not os.path.exists(wikipedia_filepath):
    util.http_get('http://sbert.net/datasets/simplewiki-2020-11-01.jsonl.gz', wikipedia_filepath)

# We split these articles into paragraphs and encode them with the bi-encoder
passages = []
with gzip.open(wikipedia_filepath, 'rt', encoding='utf8') as fIn:
    for line in fIn:
        data = json.loads(line.strip())
        for paragraph in data['paragraphs']:
            # We encode the passages as [title, text]
            passages.append([data['title'], paragraph])

# Print some records
print("Passages:", len(passages))
passages[:10]


Passages: 509663


[['Ted Cassidy',
  'Ted Cassidy (July 31, 1932 - January 16, 1979) was an American actor. He was best known for his roles as Lurch and Thing on "The Addams Family".'],
 ['Aileen Wuornos',
  'Aileen Carol Wuornos Pralle (born Aileen Carol Pittman; February 29, 1956\xa0– October 9, 2002) was an American serial killer. She was born in Rochester, Michigan. She confessed to killing six men in Florida and was executed in Florida State Prison by lethal injection for the murders. Wuornos said that the men she killed had raped her or tried to rape her while she was working as a prostitute.'],
 ['Aileen Wuornos',
  'Wuornos was diagnosed with antisocial personality disorder and borderline personality disorder.'],
 ['Aileen Wuornos',
  'The movie, "Monster" is about her life. Two documentaries were made about her.'],
 ['Aileen Wuornos',
  'Wuornos was born Aileen Carol Pittman in Rochester, Michigan. She never met her father. Wuornos was adopted by her grandparents. When she was 13 she became pre

### Create embeddings

In [4]:
bi_encoder = SentenceTransformer('nq-distilbert-base-v1')

# Compute the corpus_embeddings from scratch (which can take a while depending on the GPU)
#corpus_embeddings = bi_encoder.encode(passages, convert_to_tensor=True, show_progress_bar=True)

# To speed things up, pre-computed embeddings are downloaded.
# The provided file encoded the passages with the model 'nq-distilbert-base-v1'
embeddings_filepath = 'simplewiki-2020-11-01-nq-distilbert-base-v1.pt'
if not os.path.exists(embeddings_filepath):
    util.http_get('http://sbert.net/datasets/simplewiki-2020-11-01-nq-distilbert-base-v1.pt', embeddings_filepath)

# Load embeddings
corpus_embeddings = torch.load(embeddings_filepath)
corpus_embeddings = corpus_embeddings.float()  # Convert embedding file to float
corpus_embeddings = corpus_embeddings.to('cuda')

# Print some embeddings
print("Embeddings:", corpus_embeddings.shape)
print(corpus_embeddings)

Embeddings: torch.Size([509663, 768])
tensor([[-0.7202,  0.7744, -0.8594,  ...,  1.1328,  0.4304, -0.3044],
        [ 0.4700, -0.6401,  0.1376,  ...,  0.0387,  0.1492, -0.1414],
        [ 0.2217, -0.2505,  0.3696,  ..., -0.1370,  0.0877, -0.2290],
        ...,
        [-0.4033, -0.1987, -0.0778,  ..., -0.4033, -0.1087,  0.3325],
        [ 0.2500,  0.6196,  0.0111,  ..., -0.2964, -0.5605,  0.5493],
        [ 0.8735,  0.9517, -0.0433,  ...,  0.3572, -0.5850,  0.1927]],
       device='cuda:0')


## Vector Search

In [5]:
import cupy as cp
from pylibraft.common import DeviceResources
from pylibraft.neighbors.brute_force import knn
n_samples = 50000
n_features = 50
n_queries = 1000
dataset = cp.random.random_sample((n_samples, n_features),
                                  dtype=cp.float32)
# Search using the built index
queries = cp.random.random_sample((n_queries, n_features),
                                  dtype=cp.float32)
k = 40
distances, neighbors = knn(dataset, queries, k)
distances = cp.asarray(distances)
neighbors = cp.asarray(neighbors)

### Brute Force

In [6]:
%%time

def search_raft_knn(query, top_k = 5):
    # Encode the query using the bi-encoder and find potentially relevant passages
    question_embedding = bi_encoder.encode(query, convert_to_tensor=True)
    
    start_time = time.time()
    hits = brute_force.knn(corpus_embeddings, question_embedding[None], top_k)
    end_time = time.time()

    # Output of top-k hits
    print("Input question:", query)
    print("Results (after {:.3f} seconds):".format(end_time - start_time))
    for k in range(top_k):
        print("\t{:.3f}\t{}".format(hits[0][0, k], passages[hits[1][0, k]]))

CPU times: user 2 µs, sys: 0 ns, total: 2 µs
Wall time: 4.29 µs


In [7]:
%%time
search_raft_knn(query="Who was Grace Hopper?")

Input question: Who was Grace Hopper?
Results (after 0.009 seconds):
	181.649	['Grace Hopper', 'Hopper was born in New York, USA. Hopper graduated from Vassar College in 1928 and Yale University in 1934 with a Ph.D degree in mathematics. She joined the US Navy during the World War II in 1943. She worked on computers in the Navy for 43 years. She then worked in other private industry companies after 1949. She retired from the Navy in 1986 and died on January 1, 1992.']
	192.946	['Leona Helmsley', 'Leona Helmsley (July 4, 1920 – August 20, 2007) was an American businesswoman. She was known for having a flamboyant personality. She had a reputation for tyrannical behavior; she was nicknamed the Queen of Mean.']
	194.951	['Grace Hopper', 'Grace Murray Hopper (December 9 1906 – January 1 1992) was an American computer scientist and United States Navy officer.']
	202.192	['Nellie Bly', 'Elizabeth Cochrane Seaman (born Elizabeth Jane Cochran; May 5, 1864 – January 27, 1922), better known by he

### IVF-Flat

In [8]:
%%time
params = ivf_flat.IndexParams(n_lists=150)
flat_index = ivf_flat.build(params, corpus_embeddings)
search_params = ivf_flat.SearchParams()

def search_raft_flat(query, top_k = 5):
    # Encode the query using the bi-encoder and find potentially relevant passages
    question_embedding = bi_encoder.encode(query, convert_to_tensor=True)
    
    start_time = time.time()
    hits = ivf_flat.search(search_params, flat_index, question_embedding[None], top_k)
    end_time = time.time()

    # Output of top-k hits
    print("Input question:", query)
    print("Results (after {:.3f} seconds):".format(end_time - start_time))
    for k in range(top_k):
        print("\t{:.3f}\t{}".format(hits[0][0, k], passages[hits[1][0, k]]))

CPU times: user 762 ms, sys: 36.7 ms, total: 798 ms
Wall time: 794 ms


In [9]:
%%time
search_raft_flat(query="Who was Grace Hopper?")

Input question: Who was Grace Hopper?
Results (after 0.002 seconds):
	181.649	['Grace Hopper', 'Hopper was born in New York, USA. Hopper graduated from Vassar College in 1928 and Yale University in 1934 with a Ph.D degree in mathematics. She joined the US Navy during the World War II in 1943. She worked on computers in the Navy for 43 years. She then worked in other private industry companies after 1949. She retired from the Navy in 1986 and died on January 1, 1992.']
	192.946	['Leona Helmsley', 'Leona Helmsley (July 4, 1920 – August 20, 2007) was an American businesswoman. She was known for having a flamboyant personality. She had a reputation for tyrannical behavior; she was nicknamed the Queen of Mean.']
	194.951	['Grace Hopper', 'Grace Murray Hopper (December 9 1906 – January 1 1992) was an American computer scientist and United States Navy officer.']
	202.192	['Nellie Bly', 'Elizabeth Cochrane Seaman (born Elizabeth Jane Cochran; May 5, 1864 – January 27, 1922), better known by he

In [10]:
%%time
search_raft_flat(query="Who was Alan Turing?")

Input question: Who was Alan Turing?
Results (after 0.002 seconds):
	106.131	['Alan Turing', 'Alan Mathison Turing OBE FRS (London, 23 June 1912 – Wilmslow, Cheshire, 7 June 1954) was an English mathematician and computer scientist. He was born in Maida Vale, London.']
	158.646	['William Kahan', 'William Morton Kahan (born June 5, 1933) is a Canadian mathematician and computer scientist. He received the Turing Award in 1989 for ""his fundamental contributions to numerical analysis"." He was named an ACM Fellow in 1994, and added to the National Academy of Engineering in 2005.']
	165.094	['Alan Turing', 'A brilliant mathematician and cryptographer Alan was to become the founder of modern-day computer science and artificial intelligence; designing a machine at Bletchley Park to break secret Enigma encrypted messages used by the Nazi German war machine to protect sensitive commercial, diplomatic and military communications during World War 2. Thus, Turing made the single biggest contribut

In [11]:
%%time
search_raft_flat(query = "What is creating tides?")

Input question: What is creating tides?
Results (after 0.002 seconds):
	94.909	['Tide', "A tide is the periodic rising and falling of Earth's ocean surface caused mainly by the gravitational pull of the Moon acting on the oceans. Tides cause changes in the depth of marine and estuarine (river mouth) waters. Tides also make oscillating currents known as tidal streams (~'rip tides'). This means that being able to predict the tide is important for coastal navigation. The strip of seashore that is under water at high tide and exposed at low tide, called the intertidal zone, is an important ecological product of ocean tides."]
	159.539	['Tidal energy', "Many things affect tides. The pull of the Moon is the largest effect, and most of the energy comes from the slowing of the Earth's spin."]
	159.740	['Storm surge', 'A storm surge is a sudden rise of water hitting areas close to the coast. Storm surges are usually created by a hurricane or other tropical cyclone. The surge happens because a s

### IVF-PQ

In [12]:
%%time
params = ivf_pq.IndexParams(n_lists=150, pq_dim=96)
pq_index = ivf_pq.build(params, corpus_embeddings)
search_params = ivf_pq.SearchParams()

def search_raft_pq(query, top_k = 5):
    # Encode the query using the bi-encoder and find potentially relevant passages
    question_embedding = bi_encoder.encode(query, convert_to_tensor=True)

    hits = ivf_pq.search(search_params, pq_index, question_embedding[None], top_k)

    # Output of top-k hits
    print("Input question:", query)
    for k in range(top_k):
        print("\t{:.3f}\t{}".format(hits[0][0, k], passages[hits[1][0, k]]))

CPU times: user 3.33 s, sys: 1.19 s, total: 4.52 s
Wall time: 4.49 s


For IVF-PQ we want to reduce the memory footprint while keeping a good accuracy.

In [13]:
pq_index_mem = pq_index.pq_dim * pq_index.size * pq_index.pq_bits
print("IVF-PQ memory footprint: {:.1f} MB".format(pq_index_mem / 2**20))

original_mem = corpus_embeddings.shape[0] * corpus_embeddings.shape[1] * 4
print("Original dataset: {:.1f} MB".format(original_mem / 2**20))

print("Memory saved: {:.1f}%".format(100 * (1 - pq_index_mem / original_mem)))

IVF-PQ memory footprint: 373.3 MB
Original dataset: 1493.2 MB
Memory saved: 75.0%


In [14]:
%%time
search_raft_pq(query="Who was Grace Hopper?")

Input question: Who was Grace Hopper?
	188.568	['Leona Helmsley', 'Leona Helmsley (July 4, 1920 – August 20, 2007) was an American businesswoman. She was known for having a flamboyant personality. She had a reputation for tyrannical behavior; she was nicknamed the Queen of Mean.']
	194.847	['Grace Hopper', 'Hopper was born in New York, USA. Hopper graduated from Vassar College in 1928 and Yale University in 1934 with a Ph.D degree in mathematics. She joined the US Navy during the World War II in 1943. She worked on computers in the Navy for 43 years. She then worked in other private industry companies after 1949. She retired from the Navy in 1986 and died on January 1, 1992.']
	204.002	['Grace Hopper', 'Grace Murray Hopper (December 9 1906 – January 1 1992) was an American computer scientist and United States Navy officer.']
	206.905	['Abbie Hoffman', 'Abbot Howard "Abbie" Hoffman (November 30, 1936 – April 12, 1989) was an American social and political activist.']
	207.403	['Ida B. We

In [15]:
%%time
search_raft_pq(query="Who was Alan Turing?")

Input question: Who was Alan Turing?
	135.342	['Alan Turing', 'Alan Mathison Turing OBE FRS (London, 23 June 1912 – Wilmslow, Cheshire, 7 June 1954) was an English mathematician and computer scientist. He was born in Maida Vale, London.']
	165.409	['William Kahan', 'William Morton Kahan (born June 5, 1933) is a Canadian mathematician and computer scientist. He received the Turing Award in 1989 for ""his fundamental contributions to numerical analysis"." He was named an ACM Fellow in 1994, and added to the National Academy of Engineering in 2005.']
	167.513	['Rolf Noskwith', 'Rolf Noskwith (19 June 1919 – 3 January 2017) was a British businessman. During the Second World War, he worked under Alan Turing as a cryptographer at the British military base Bletchley Park in Milton Keynes, Buckinghamshire.']
	171.538	['Alan Turing', 'A brilliant mathematician and cryptographer Alan was to become the founder of modern-day computer science and artificial intelligence; designing a machine at Blet

In [16]:
%%time
search_raft_pq(query = "What is creating tides?")

Input question: What is creating tides?
	127.162	['Tide', "A tide is the periodic rising and falling of Earth's ocean surface caused mainly by the gravitational pull of the Moon acting on the oceans. Tides cause changes in the depth of marine and estuarine (river mouth) waters. Tides also make oscillating currents known as tidal streams (~'rip tides'). This means that being able to predict the tide is important for coastal navigation. The strip of seashore that is under water at high tide and exposed at low tide, called the intertidal zone, is an important ecological product of ocean tides."]
	159.660	['Storm surge', 'A storm surge is a sudden rise of water hitting areas close to the coast. Storm surges are usually created by a hurricane or other tropical cyclone. The surge happens because a storm has fast winds and low atmospheric pressure. Water is pushed on shore, and the water level rises. Strong storm surges can flood coastal towns and destroy homes. A storm surge is considered th

### CAGRA

In [17]:
%%time
params = ivf_pq.IndexParams(n_lists=150, pq_dim=96)
pq_index = ivf_pq.build(params, corpus_embeddings)
search_params = ivf_pq.SearchParams()

def search_raft_pq(query, top_k = 5):
    # Encode the query using the bi-encoder and find potentially relevant passages
    question_embedding = bi_encoder.encode(query, convert_to_tensor=True)

    hits = ivf_pq.search(search_params, pq_index, question_embedding[None], top_k)

    # Output of top-k hits
    print("Input question:", query)
    for k in range(top_k):
        print("\t{:.3f}\t{}".format(hits[0][0, k], passages[hits[1][0, k]]))

CPU times: user 3.32 s, sys: 1.23 s, total: 4.54 s
Wall time: 4.51 s


For IVF-PQ we want to reduce the memory footprint while keeping a good accuracy.

In [18]:
pq_index_mem = pq_index.pq_dim * pq_index.size * pq_index.pq_bits
print("IVF-PQ memory footprint: {:.1f} MB".format(pq_index_mem / 2**20))

original_mem = corpus_embeddings.shape[0] * corpus_embeddings.shape[1] * 4
print("Original dataset: {:.1f} MB".format(original_mem / 2**20))

print("Memory saved: {:.1f}%".format(100 * (1 - pq_index_mem / original_mem)))

IVF-PQ memory footprint: 373.3 MB
Original dataset: 1493.2 MB
Memory saved: 75.0%


In [19]:
%%time
search_raft_pq(query="Who was Grace Hopper?")

Input question: Who was Grace Hopper?
	185.808	['Grace Hopper', 'Hopper was born in New York, USA. Hopper graduated from Vassar College in 1928 and Yale University in 1934 with a Ph.D degree in mathematics. She joined the US Navy during the World War II in 1943. She worked on computers in the Navy for 43 years. She then worked in other private industry companies after 1949. She retired from the Navy in 1986 and died on January 1, 1992.']
	190.110	['Leona Helmsley', 'Leona Helmsley (July 4, 1920 – August 20, 2007) was an American businesswoman. She was known for having a flamboyant personality. She had a reputation for tyrannical behavior; she was nicknamed the Queen of Mean.']
	199.969	['Grace Hopper', 'Grace Murray Hopper (December 9 1906 – January 1 1992) was an American computer scientist and United States Navy officer.']
	205.086	['Abbie Hoffman', 'Abbot Howard "Abbie" Hoffman (November 30, 1936 – April 12, 1989) was an American social and political activist.']
	207.379	['Nellie Bl

In [20]:
%%time
search_raft_pq(query="Who was Alan Turing?")

Input question: Who was Alan Turing?
	131.158	['Alan Turing', 'Alan Mathison Turing OBE FRS (London, 23 June 1912 – Wilmslow, Cheshire, 7 June 1954) was an English mathematician and computer scientist. He was born in Maida Vale, London.']
	162.830	['William Kahan', 'William Morton Kahan (born June 5, 1933) is a Canadian mathematician and computer scientist. He received the Turing Award in 1989 for ""his fundamental contributions to numerical analysis"." He was named an ACM Fellow in 1994, and added to the National Academy of Engineering in 2005.']
	172.157	['Alan Turing', 'A brilliant mathematician and cryptographer Alan was to become the founder of modern-day computer science and artificial intelligence; designing a machine at Bletchley Park to break secret Enigma encrypted messages used by the Nazi German war machine to protect sensitive commercial, diplomatic and military communications during World War 2. Thus, Turing made the single biggest contribution to the Allied victory in th

In [21]:
%%time
search_raft_pq(query = "What is creating tides?")

Input question: What is creating tides?
	130.882	['Tide', "A tide is the periodic rising and falling of Earth's ocean surface caused mainly by the gravitational pull of the Moon acting on the oceans. Tides cause changes in the depth of marine and estuarine (river mouth) waters. Tides also make oscillating currents known as tidal streams (~'rip tides'). This means that being able to predict the tide is important for coastal navigation. The strip of seashore that is under water at high tide and exposed at low tide, called the intertidal zone, is an important ecological product of ocean tides."]
	172.524	['Storm surge', 'A storm surge is a sudden rise of water hitting areas close to the coast. Storm surges are usually created by a hurricane or other tropical cyclone. The surge happens because a storm has fast winds and low atmospheric pressure. Water is pushed on shore, and the water level rises. Strong storm surges can flood coastal towns and destroy homes. A storm surge is considered th

In [22]:
from pylibraft.neighbors import cagra

In [23]:
params = cagra.IndexParams(intermediate_graph_degree=128, graph_degree=64, build_algo="nn_descent")
cagra_index = cagra.build(params, corpus_embeddings)
search_params = cagra.SearchParams()

In [24]:
def search_raft_cagra(query, top_k = 5):
    # Encode the query using the bi-encoder and find potentially relevant passages
    question_embedding = bi_encoder.encode(query, convert_to_tensor=True)

    hits = cagra.search(search_params, cagra_index, question_embedding[None], top_k)

    # Output of top-k hits
    print("Input question:", query)
    for k in range(top_k):
        print("\t{:.3f}\t{}".format(hits[0][0, k], passages[hits[1][0, k]]))

In [25]:
%time 
search_raft_cagra(query="Who was Grace Hopper?")

CPU times: user 8 µs, sys: 2 µs, total: 10 µs
Wall time: 5.01 µs
Input question: Who was Grace Hopper?
	181.649	['Grace Hopper', 'Hopper was born in New York, USA. Hopper graduated from Vassar College in 1928 and Yale University in 1934 with a Ph.D degree in mathematics. She joined the US Navy during the World War II in 1943. She worked on computers in the Navy for 43 years. She then worked in other private industry companies after 1949. She retired from the Navy in 1986 and died on January 1, 1992.']
	192.946	['Leona Helmsley', 'Leona Helmsley (July 4, 1920 – August 20, 2007) was an American businesswoman. She was known for having a flamboyant personality. She had a reputation for tyrannical behavior; she was nicknamed the Queen of Mean.']
	194.951	['Grace Hopper', 'Grace Murray Hopper (December 9 1906 – January 1 1992) was an American computer scientist and United States Navy officer.']
	202.192	['Nellie Bly', 'Elizabeth Cochrane Seaman (born Elizabeth Jane Cochran; May 5, 1864 – Jan