
# Semantic Search


In this walkthrough we will see how to use Qdrant for semantic search. To begin we must install the required prerequisite libraries:

### Install libraries

In [1]:
!pip install -qU \
  datasets==2.12.0 \
  sentence-transformers==2.2.2 \
  qdrant-client==1.2.0

### Import libraries

In [2]:
import torch
import time
from datasets import load_dataset
from sentence_transformers import SentenceTransformer
from qdrant_client import QdrantClient
from qdrant_client.http import models

C:\Users\karti\AppData\Local\Programs\Python\Python311\Lib\site-packages\numpy\.libs\libopenblas.FB5AE2TYXYH2IJRDKGDGQ3XBKLKTF43H.gfortran-win_amd64.dll
C:\Users\karti\AppData\Local\Programs\Python\Python311\Lib\site-packages\numpy\.libs\libopenblas64__v0.3.21-gcc_10_3_0.dll


## Data Preprocessing

The dataset preparation process requires a few steps:

1. We download the squad(Stanford Question Answering Dataset) dataset from Hugging Face Datasets.

2. The question content of the dataset is embedded into vectors.

3. We reformat into a `(id, vector, payload)` structure. Points are a central entity that Qdrant operates with. They contain records consisting of a vector, an id, and payload.


### Load dataset

In [3]:
dataset = load_dataset("squad", split="train[0:87000]")
dataset

Found cached dataset squad (C:/Users/karti/.cache/huggingface/datasets/squad/plain_text/1.0.0/d6ec3ceb99ca480ce37cdd35555d6cb2511d223b9150cce08a837ef62ffea453)


Dataset({
    features: ['id', 'title', 'context', 'question', 'answers'],
    num_rows: 87000
})

The dataset contains ~87K questions posed by crowdworkers on a set of Wikipedia articles.

### Questions from dataset

In [4]:
# printing starting 5 question content from dataset
dataset[:5]["question"]

['To whom did the Virgin Mary allegedly appear in 1858 in Lourdes France?',
 'What is in front of the Notre Dame Main Building?',
 'The Basilica of the Sacred heart at Notre Dame is beside to which structure?',
 'What is the Grotto at Notre Dame?',
 'What sits on top of the Main Building at Notre Dame?']

We can extract all questions into a single `questions` list and remove duplicates as we do not want same repeated query results.

### Store questions in a list

In [5]:
# Initialize an empty list to store the questions
questions = []

# Iterate over each record in the dataset
for record in dataset:
    question = record["question"]  # Extract the question from the current record
    questions.append(record["question"])  # Append the question to the list of questions

# remove duplicates to eliminate redundancy and ensure that each question is unique
questions = list(set(questions))

# Print the first 5 unique questions using '\n' as a separator
print("First 5 unique questions--")
print("\n".join(questions[:5]))

First 5 unique questions--
Do all regions perceive that term "black people" the same?
Who is the director of the Genome Center at Washington University?
Who argued in 2003 that all clades are by definition monophyletic groups?
In which color model is green one of the additive primary colors?
How long did the Hollywood round air for in season eight of American Idol?


With our questions ready to go we can move on to demoing steps **2** and **3** above.


To create our embeddings we will use the `MiniLM-L6` sentence transformer model. This is a very efficient semantic similarity embedding model from the `sentence-transformers` library.

### Instantiate SentenceTransformer model

In [6]:
# Check if a CUDA-enabled GPU is available
device = "cuda" if torch.cuda.is_available() else "cpu"
if device != "cuda":
    print(
        f"You are using {device}. This is much slower than using "
        "a CUDA-enabled GPU. If on Colab you can change this by "
        "clicking Runtime > Change runtime type > GPU."
    )

# Instantiate the SentenceTransformer model
model = SentenceTransformer("all-MiniLM-L6-v2", device=device)
model

SentenceTransformer(
  (0): Transformer({'max_seq_length': 256, 'do_lower_case': False}) with Transformer model: BertModel 
  (1): Pooling({'word_embedding_dimension': 384, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False})
  (2): Normalize()
)

There are *three* interesting bits of information in the above model printout. Those are:

* `max_seq_length` is `256`. That means that the maximum number of tokens (like words) that can be encoded into a single vector embedding is `256`. Anything beyond this *must* be truncated.

* `word_embedding_dimension` is `384`. This number is the dimensionality of vectors output by this model. It is important that we know this number later when initializing our Qdrant vector collection.

* `Normalize()`. This final normalization step indicates that all vectors produced by the model are normalized. That means that models that we would typical measure similarity for using *cosine similarity* can also make use of the *dotproduct* similarity metric. In fact, with normalized vectors *cosine* and *dotproduct* are equivalent.

Moving on, we can create a sentence embedding using this model like so:

### Building Embeddings

In [7]:
# example on how to encode query
query = "which city is the most populated in the world?"

# encoding the query into a vector
encoded_query = model.encode(query)

# getting the dimensionality of the vector
encoded_query.shape

(384,)

Encoding this single sentence leaves us with a `384` dimensional sentence embedding (aligned to the `word_embedding_dimension` above).

To prepare this for `upsert` to Qdrant, all we do is this:

### Building Upsert Format

In [8]:
# Unique identifier for the vector
_id = "0"

# encoding the query into a vector
encoded_query = model.encode(query)

# Additional payload associated with the vector
payload = {"question": query}

# List of tuples representing the points to be upserted
points = [(_id, encoded_query, payload)]

Later when we do upsert our data to Qdrant, we will be doing so by using Batch uploading process from our models module.

We begin by initializing our connection to Qdrant

### Initializing Qdrant client

In [9]:
# Initialize Qdrant client
client = QdrantClient(":memory:")

Now we create a new collection called `semantic-search`. It's important that we align the collection `dimension` and `metric` parameters with those required by the `MiniLM-L6` model.

## Creating a Collection

Now the data is ready, we can set up our collection to store it.

In [10]:
question_collection = "semantic-search"

collections = client.get_collections()
print(collections)

# only create collection if it doesn't exist
if question_collection not in collections:
    client.recreate_collection(
        collection_name=question_collection,
        vectors_config=models.VectorParams(
            size=model.get_sentence_embedding_dimension(),  # specifying dimensionality of vectors output by model
            distance=models.Distance.COSINE,
        ),
    )
collections = client.get_collections()
print(collections)

collections=[]
collections=[CollectionDescription(name='semantic-search')]


Now we upsert the data.

### Upserting vectors

Create an index and payload for your vectors.

Note: Qdrant can only take in native Python iterables like lists and tuples. This is why you will notice the .tolist() method attached to our data matrix below.

In [11]:
# Start the timer to measure total time elapsed.
start_time = time.time()

ids = list(range(len(questions)))  # creating index for vectors
encoded_queries = model.encode(questions)  # encoding the questions into vectors

payload = []
for i in range(len(questions)):  # question as payload
    payload.append({"question": questions[i]})

client.upsert(
    collection_name=question_collection,
    points=models.Batch(ids=ids, vectors=encoded_queries.tolist(), payloads=payload),
)

elapsed_time = time.time() - start_time
print("Elapsed time:", elapsed_time, "seconds")
print(
    "vector count in collection- ",
    client.get_collection(collection_name=question_collection).vectors_count,
)

Elapsed time: 156.37646007537842 seconds
vector count in collection-  86764


## Making Queries

Now that our collection is populated we can begin making queries. We are performing a semantic search for *similar questions*, so we should embed and search with another question. Let's begin.

In [12]:
query = "who is the inventor of light bulb?"
# create the query vector
encoded_query = model.encode(query).tolist()


# now query
def search(encoded_query):
    try:
        return client.search(
            collection_name=question_collection,
            query_vector=encoded_query,
            limit=5,
        )
    except Exception as e:
        print({e})
        return []


query_results = search(encoded_query)
if len(query_results) == 0:
    print("No results found. Check the client search arguments and collection upserts.")
else:
    # printing the 5 results of the query
    print("Top 5 Results--")
    for i, result in enumerate(query_results):
        print(f"• {i+1}: {result}")

Top 5 Results--
• 1: id=49359 version=0 score=0.7751039089283264 payload={'question': 'Which inventors patented the tungsten filament lamp?'} vector=None
• 2: id=12305 version=0 score=0.7499778499609704 payload={'question': 'How many inventors came up with electric lamps before Thomas Edison?'} vector=None
• 3: id=23455 version=0 score=0.7359869975580777 payload={'question': 'Who patented an incandescent light bulb in Russia in 1874?'} vector=None
• 4: id=27396 version=0 score=0.7244679924060471 payload={'question': 'What company invented the tantalum light filament?'} vector=None
• 5: id=49122 version=0 score=0.6791491146434465 payload={'question': 'Who first patented a method to produce high-brightness blue LEDs?'} vector=None


In the returned response `query_results` we can see the most relevant questions to our particular query. We can reformat this response to be a little easier to read:

### Getting scores of query results

In [13]:
print("S.no." + " " * 1 + "Score   Similar Questions")
# printing only the scores and questions of the 5 results
for i, result in enumerate(query_results):
    print(f"• {i+1} : {round(result.score, 2)}--> {result.payload['question']}")

S.no. Score   Similar Questions
• 1 : 0.78--> Which inventors patented the tungsten filament lamp?
• 2 : 0.75--> How many inventors came up with electric lamps before Thomas Edison?
• 3 : 0.74--> Who patented an incandescent light bulb in Russia in 1874?
• 4 : 0.72--> What company invented the tantalum light filament?
• 5 : 0.68--> Who first patented a method to produce high-brightness blue LEDs?


These are good results, let's try and modify the words being used to see if we still surface similar results.

### Modifying query, keeping the sentiment same and checking the result

In [14]:
query = "which person is accredited with the invention of light bulb?"
encoded_query = model.encode(query).tolist()

# now query
query_results = search(encoded_query)

if len(query_results) == 0:
    print("no results found,Check the client search arguments and collection upserts.")
else:
    print("S.no." + " " * 1 + "Score    Similar Questions")
    for i, result in enumerate(query_results):
        print(f"• {i+1} : {round(result.score, 2)} --> {result.payload['question']}")

S.no. Score    Similar Questions
• 1 : 0.71 --> Which inventors patented the tungsten filament lamp?
• 2 : 0.7 --> How many inventors came up with electric lamps before Thomas Edison?
• 3 : 0.67 --> Who patented an incandescent light bulb in Russia in 1874?
• 4 : 0.63 --> Who first patented a method to produce high-brightness blue LEDs?
• 5 : 0.62 --> What company invented the tantalum light filament?


Here we used different terms in our query than that of the returned documents. 

Despite these very different terms and *lack* of term overlap between query and returned documents — we get highly relevant results — this is the power of *semantic search*.

You can go ahead and ask more questions above. When you're done, delete the collection to save resources:

### Delete collection to save resources

In [15]:
client.delete_collection(collection_name=question_collection)

True

---