# L4: Measuring Search Relevance

In [1]:
import warnings
warnings.filterwarnings('ignore')

### WANDS Annotation Dataset
We need to download the <font color=red>**three csv files**</font> to the <font color=red>**data/**</font> folder of this project

https://github.com/wayfair/WANDS/tree/main/dataset  

It is a gold standard (possibly) annotation dataset  

The data is stored in the dataset folder in three files:  
  
1. **product.csv** - Stores all candidate products, columns include:  
a. product_id - ID of a product  
b. product_name - String of product name  
c. product_class - Category which product falls under  
d. category_hierarchy - Parent categories of product, delimited by /  
e. product_description - String description of product  
f. product_features - | delimited string of attribute:value pairs which describe the product   
g. rating_count - Number of user ratings for product  
h. average_rating - Average rating the product received  
i. review_count - Number of user reviews for product  
    
2. **query.csv** - Stores search queries, columns include:  
a. query_id - unique ID for each query  
b. query - query string  
c. query_class - category to which the query falls under  
  
3. **label.csv** - Stores annotated (product,relevance judgement) pairs, columns include  
a. id - Unique ID for each annotation    
b. query_id - ID of the query this annotation is for    
c. product_id - ID of the product this annotation applies to  
d. label - Relevance label, one of 'Exact', 'Partial', or 'Irrelevant'  

In [2]:
import pandas as pd

products_df = pd.read_csv(
    "data/wands/product.csv",
    sep="\t", # Tab separated columns
    index_col="product_id", # Consider it as an index
    keep_default_na=False,  # Some products do not have a description
)
products_df.head()

Unnamed: 0_level_0,product_name,product_class,category hierarchy,product_description,product_features,rating_count,average_rating,review_count
product_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
0,solid wood platform bed,Beds,Furniture / Bedroom Furniture / Beds & Headboa...,"good , deep sleep can be quite difficult to ha...",overallwidth-sidetoside:64.7|dsprimaryproducts...,15.0,4.5,15.0
1,all-clad 7 qt . slow cooker,Slow Cookers,Kitchen & Tabletop / Small Kitchen Appliances ...,"create delicious slow-cooked meals , from tend...",capacityquarts:7|producttype : slow cooker|pro...,100.0,2.0,98.0
2,all-clad electrics 6.5 qt . slow cooker,Slow Cookers,Kitchen & Tabletop / Small Kitchen Appliances ...,prepare home-cooked meals on any schedule with...,features : keep warm setting|capacityquarts:6....,208.0,3.0,181.0
3,all-clad all professional tools pizza cutter,"Slicers, Peelers And Graters",Browse By Brand / All-Clad,this original stainless tool was designed to c...,overallwidth-sidetoside:3.5|warrantylength : l...,69.0,4.5,42.0
4,baldwin prestige alcott passage knob with roun...,Door Knobs,Home Improvement / Doors & Door Hardware / Doo...,the hardware has a rich heritage of delivering...,compatibledoorthickness:1.375 '' |countryofori...,70.0,5.0,42.0


<p style="background-color:#edfbff; padding:15px; border-width:3px; border-color:#efe6ef; border-style:solid; border-radius:6px"> For semantic search, we will choose <b>product_name</b> and <b>product_description</b> for better applicability

In [3]:
# We have large strings in product_description column
# This is the 6th string in the column (for demonstration)
print(len(products_df['product_description'][5]))

1642


In [4]:
num_products = 5000

In [5]:
from sentence_transformers import SentenceTransformer

model = SentenceTransformer("all-MiniLM-L6-v2")

product_name_embeddings = model.encode(products_df["product_name"][0:num_products].tolist())
product_name_embeddings.shape

(5000, 384)

In [6]:
product_description_embeddings = model.encode(products_df["product_description"][0:num_products].tolist())
product_description_embeddings.shape

(5000, 384)

### How `SentenceTransformer` Generates Embeddings per Sentence (RECAP)

When using `SentenceTransformer`, the model encodes each product name (or sentence) as a single vector, capturing the entire sentence’s meaning. Here’s a breakdown of the process using the examples `"Winter Jacket"` and `"Cozy Sweater"`.

#### Step-by-Step Process

1. **Tokenization**: The model splits each product name into tokens based on its vocabulary.
   - `"Winter Jacket"` → `["winter", "jacket"]`
   - `"Cozy Sweater"` → `["cozy", "sweater"]`
   
   Each word (or subword, in the case of uncommon words) is represented by tokens that are mapped to unique IDs corresponding to the model’s pretrained embeddings.

2. **Embedding Calculation**:
   - After tokenization, each token is represented by an embedding from the model's vocabulary.
   - The model combines these embeddings, typically by averaging or through weighted attention mechanisms, to produce a **single fixed-size vector** for each sentence.

#### Final Output
For each product name, a single vector represents the combined meaning of the sentence, reflecting the relationships between the tokens.

### Example:
- `"Winter Jacket"` results in **one embedding vector** representing the whole phrase.
- `"Cozy Sweater"` results in **one embedding vector** for the entire phrase.

The model does <font color=red>**not create separate embeddings**</font> for each word within a product name.


## Building the collection

In [7]:
products_df.index.shape

(42994,)

In [9]:
from qdrant_client import QdrantClient, models

client = QdrantClient("http://localhost:6333")

In [10]:
# Check the existence of collection
client.collection_exists(collection_name="wands-products")

False

In [11]:
client.delete_collection("wands-products") # Delete any pre-existing collection named wands-products

False

In [12]:
client.collection_exists(collection_name="wands-products")

False

In [13]:
# Create the collection
client.create_collection(
    collection_name="wands-products",
    vectors_config={
        "product_name": models.VectorParams(size=384, distance=models.Distance.COSINE, ),
        "product_description": models.VectorParams(size=384, distance=models.Distance.COSINE,),
    },
    optimizers_config=models.OptimizersConfigDiff(default_segment_number=2, indexing_threshold=1000,),
)

True

In [14]:
client.collection_exists(collection_name="wands-products")

True

In [15]:
# Check the details of the collection
print(client.get_collection(collection_name="wands-products"))

status=<CollectionStatus.GREEN: 'green'> optimizer_status=<OptimizersStatusOneOf.OK: 'ok'> vectors_count=None indexed_vectors_count=0 points_count=0 segments_count=2 config=CollectionConfig(params=CollectionParams(vectors={'product_description': VectorParams(size=384, distance=<Distance.COSINE: 'Cosine'>, hnsw_config=None, quantization_config=None, on_disk=None, datatype=None, multivector_config=None), 'product_name': VectorParams(size=384, distance=<Distance.COSINE: 'Cosine'>, hnsw_config=None, quantization_config=None, on_disk=None, datatype=None, multivector_config=None)}, shard_number=1, sharding_method=None, replication_factor=1, write_consistency_factor=1, read_fan_out_factor=None, on_disk_payload=True, sparse_vectors=None), hnsw_config=HnswConfig(m=16, ef_construct=100, full_scan_threshold=10000, max_indexing_threads=0, on_disk=False, payload_m=None), optimizer_config=OptimizersConfig(deleted_threshold=0.2, vacuum_min_vector_number=1000, default_segment_number=2, max_segment_siz

<p style="background-color:orange; padding:15px; border-width:3px; border-color:#efe6ef; border-style:solid; border-radius:6px"> The "<b>ids</b>=products_df.<b>index</b>.tolist()" part should insert entire 42k indices into the vector db, whereas, the embeddings have only 5000 entries <b>points count = 5000</b>. So, the part was not confirmed whether the excess indices cause any indexing issue in the db, although the functionalities work alright

In [16]:
client.upload_collection(
    collection_name="wands-products",
    vectors={
        "product_name": product_name_embeddings,
        "product_description": product_description_embeddings,
    },
    payload=products_df.to_dict(orient="records"), # Converting from an existing df
    ids=products_df.index.tolist(), # Index of the existing df
    batch_size=64, # The insert will be performed as batches of 64
)

In [17]:
client.count("wands-products") # Total count of the products

CountResult(count=5000)

In [18]:
# Check the details of the collection
print(client.get_collection(collection_name="wands-products"))

status=<CollectionStatus.GREEN: 'green'> optimizer_status=<OptimizersStatusOneOf.OK: 'ok'> vectors_count=None indexed_vectors_count=8704 points_count=5000 segments_count=2 config=CollectionConfig(params=CollectionParams(vectors={'product_description': VectorParams(size=384, distance=<Distance.COSINE: 'Cosine'>, hnsw_config=None, quantization_config=None, on_disk=None, datatype=None, multivector_config=None), 'product_name': VectorParams(size=384, distance=<Distance.COSINE: 'Cosine'>, hnsw_config=None, quantization_config=None, on_disk=None, datatype=None, multivector_config=None)}, shard_number=1, sharding_method=None, replication_factor=1, write_consistency_factor=1, read_fan_out_factor=None, on_disk_payload=True, sparse_vectors=None), hnsw_config=HnswConfig(m=16, ef_construct=100, full_scan_threshold=10000, max_indexing_threads=0, on_disk=False, payload_m=None), optimizer_config=OptimizersConfig(deleted_threshold=0.2, vacuum_min_vector_number=1000, default_segment_number=2, max_segme

In [19]:
import time

time.sleep(1.0)
collection = client.get_collection("wands-products")
while collection.status != models.CollectionStatus.GREEN: # Green = Vector DB Index ready
    time.sleep(1.0)
    collection = client.get_collection("wands-products")
    
collection # Now we access the entire set of Collections when everything is ready

CollectionInfo(status=<CollectionStatus.GREEN: 'green'>, optimizer_status=<OptimizersStatusOneOf.OK: 'ok'>, vectors_count=None, indexed_vectors_count=8704, points_count=5000, segments_count=2, config=CollectionConfig(params=CollectionParams(vectors={'product_description': VectorParams(size=384, distance=<Distance.COSINE: 'Cosine'>, hnsw_config=None, quantization_config=None, on_disk=None, datatype=None, multivector_config=None), 'product_name': VectorParams(size=384, distance=<Distance.COSINE: 'Cosine'>, hnsw_config=None, quantization_config=None, on_disk=None, datatype=None, multivector_config=None)}, shard_number=1, sharding_method=None, replication_factor=1, write_consistency_factor=1, read_fan_out_factor=None, on_disk_payload=True, sparse_vectors=None), hnsw_config=HnswConfig(m=16, ef_construct=100, full_scan_threshold=10000, max_indexing_threads=0, on_disk=False, payload_m=None), optimizer_config=OptimizersConfig(deleted_threshold=0.2, vacuum_min_vector_number=1000, default_segmen

## Test queries

<div>
<img src="img/l4_relevancy_1.jpg" width="400"/>
</div>

<p style="background-color:#edfbff; padding:15px; border-width:3px; border-color:#efe6ef; border-style:solid; border-radius:6px"> Some relevancy based metrics : <br>
    <b><font color="blue">Precision</b> = Correct positive predictions out of all the <b>predicted positives</b> <br>
    <b><font color="blue">Recall</b> = Correct positive predictions out of all the <b>actual positives</b> <br>
    <b><font color="blue">MRR</b> = Average of the <b>reciprocal ranks</b> of the first relevant result across all queries <br>
    <b><font color="blue">NDCG</b> = Measures the ranking quality by giving higher-weight to relevant documents at <b>top ranks</b>

In [20]:
queries_df = pd.read_csv(
    "data/WANDS/query.csv", 
    sep="\t", 
    index_col="query_id",
)
queries_df.head()

Unnamed: 0_level_0,query,query_class
query_id,Unnamed: 1_level_1,Unnamed: 2_level_1
0,salon chair,Massage Chairs
1,smart coffee table,Coffee & Cocktail Tables
2,dinosaur,Kids Wall Décor
3,turquoise pillows,Accent Pillows
4,chair and a half recliner,Recliners


## Ground truth

In [21]:
labels_df = pd.read_csv(
    "data/WANDS/label.csv", 
    sep="\t", 
)
labels_df.sample(n=5)

Unnamed: 0,id,query_id,product_id,label
228745,229312,127,35528,Irrelevant
191198,191765,446,4616,Irrelevant
67418,67569,26,39562,Exact
165338,165905,416,29435,Partial
216603,217170,52,31211,Partial


Now we have  
- `products_df` mapping all the products and their details
- `queries_df` mapping all the queries only
- `labels_df` mapping all ground truths, combining relevant `query-product-label`

In [22]:
# Printing the query_id 2
print(f'Query id 2 : {queries_df.loc[2]}')

# To see a particular query in multiple entries showing different level of matches
print(labels_df[(labels_df['query_id'] == 2) & (labels_df['label'] == 'Exact')])

Query id 2 : query                 dinosaur
query_class    Kids Wall Décor
Name: 2, dtype: object
          id  query_id  product_id  label
252      252         2        4205  Exact
253      253         2        4202  Exact
254      254         2        4204  Exact
255      255         2       36622  Exact
256      256         2       29777  Exact
...      ...       ...         ...    ...
406      406         2       42516  Exact
21668  21668         2       18868  Exact
21669  21669         2       24099  Exact
21670  21670         2       21055  Exact
21671  21671         2       23033  Exact

[151 rows x 4 columns]


In [23]:
relevancy_scores = {
    "Exact": 10,
    "Partial": 5,
    "Irrelevant": 0,
}

labels_df["score"] = labels_df["label"].map(relevancy_scores.get) # .map() applies the function relevancy_scores.get() on each row of the labels_df["label"]
labels_df["query_id"] = labels_df["query_id"].map(lambda x: f"query_{x}") # Takes the row value "x", then applies it with a prefix "query_" to make it query_x
labels_df["product_id"] = labels_df["product_id"].map(lambda x: f"doc_{x}")
labels_df.sample(n=5) # Shows a random sample of 5 rows

Unnamed: 0,id,query_id,product_id,label,score
84255,84406,query_128,doc_37815,Partial,5
91721,91872,query_152,doc_31116,Partial,5
136678,136852,query_331,doc_36119,Partial,5
68873,69024,query_27,doc_3263,Partial,5
218514,219081,query_77,doc_42465,Partial,5


## ranx

<p style="background-color:#edfbff; padding:15px; border-width:3px; border-color:#efe6ef; border-style:solid; border-radius:6px"> The evaluation metrics operate on <b>Qrels = Query Relvancies</b> and <b>Runs = Output of the models</b> and then cross compare to give us the result of relevant metrics <br> </p>
The <font color="blue">Qrels</font> will format like : <br>
<ul>
    <li>query_0 : [all the docs and their <b>True</b> relevance scores to query_0]</li>
    <li>query_1 : [all the docs and their <b>True</b> relevance scores to query_1]</li>
    <li>...</li>
    <li>query_n : [all the docs and their <b>True</b> relevance scores to query_n]</li>
</ul>

The <font color="blue">runs</font> will format like the same with only <b>Predicted</b> relevance scores 

In [24]:
from ranx import Qrels

# Create a Qrels (Query Relevance) dictionary from df 
qrels = Qrels.from_df(
    labels_df.astype({"query_id": "str", "product_id": "str"}),
    q_id_col="query_id",
    doc_id_col="product_id", 
    score_col="score",
)

In [25]:
print(f'Number of docs in query_0 : {len(qrels["query_0"])}')

import random

num_items_to_print = 10 # Take a random sample of 10 query-document-labels from the Qrels

# Print only random 10 query-document-labels labels against query 0 
for key in random.sample(list(qrels['query_0'].keys()), num_items_to_print):
    print(f'qrels["query_0"][{key}] = {qrels["query_0"][key]}')

Number of docs in query_0 : 3302
qrels["query_0"][doc_22463] = 0
qrels["query_0"][doc_21387] = 5
qrels["query_0"][doc_8593] = 5
qrels["query_0"][doc_34978] = 5
qrels["query_0"][doc_34140] = 5
qrels["query_0"][doc_5352] = 5
qrels["query_0"][doc_13501] = 5
qrels["query_0"][doc_21610] = 5
qrels["query_0"][doc_18274] = 5
qrels["query_0"][doc_4761] = 5


### Running all the queries

In [26]:
# Encode every query and store them in a separate column in the queries_df
queries_df["query_embedding"] = model.encode(queries_df["query"].tolist()).tolist()
queries_df.sample(n=5)

Unnamed: 0_level_0,query,query_class,query_embedding
query_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
427,lowes tile,Floor & Wall Tile,"[-0.05529312044382095, 0.06466194987297058, -0..."
168,printers,Carts & Stands,"[-0.08087825775146484, 0.03678595647215843, -0..."
309,moen 5995 arbor one,Kitchen Faucets,"[-0.11051099747419357, -0.020648468285799026, ..."
405,queen ann style living room chair,Accent Chairs,"[0.038474176079034805, 0.0024251751601696014, ..."
162,turquoise chair,Accent Chairs,"[-0.008627337403595448, 0.04856478422880173, 0..."


In [27]:
# Just a single search for demonstration
single_query_result = client.search(
        collection_name="wands-products",
        query_vector=models.NamedVector( # models.NamedVector is used to specify the name of the vector field to search against
            name="product_name", # Similarity search for product_name column
            vector=queries_df.loc[0]["query_embedding"] # possibly query_0
        ),
        # Do not inundate with unnecessary metadata in the results
        with_vectors=False,
        with_payload=False,
        limit=5,
)

print(f'Query = {queries_df.loc[0]["query"]}')
print(f'Top 5 results : {single_query_result}') # Returns a list of ScoredPoint objects consisting of the matched documents
print(f'\nMacthing Name : {products_df.loc[single_query_result[3].id]["product_name"]}') # Only accessing the 4th result 
print(f'Macthing Description : {products_df.loc[single_query_result[3].id]["product_description"]}') # Only accessing the 4th result 

Query = salon chair
Top 5 results : [ScoredPoint(id=4410, version=68, score=0.7512184, payload=None, vector=None, shard_key=None, order_value=None), ScoredPoint(id=4034, version=63, score=0.7441742, payload=None, vector=None, shard_key=None, order_value=None), ScoredPoint(id=251, version=3, score=0.72697634, payload=None, vector=None, shard_key=None, order_value=None), ScoredPoint(id=2187, version=34, score=0.72043705, payload=None, vector=None, shard_key=None, order_value=None), ScoredPoint(id=975, version=15, score=0.6929848, payload=None, vector=None, shard_key=None, order_value=None)]

Macthing Name : fashion casual lift chair office work chair beauty salon chair
Macthing Description : it is a new product with absolutely no damage .


In [None]:
from collections import defaultdict

# defaultdict(dict) simplifies the nested dictionary structure and ensures
# reliable access without manually initializing each new key, making the code more robust and easier to read
name_run_dict = defaultdict(dict)
for id, row in queries_df.iterrows():
    query_id = f"query_{id}"

    #  Initiates a search query in the wands-products collection, based on the input query_vector
    results = client.search(
        collection_name="wands-products",
        query_vector=models.NamedVector( # models.NamedVector is used to specify the name of the vector field to search against
            name="product_name", # At first we store the similarity results for product_name column
            vector=row["query_embedding"] # Each row represents one query text (query_id)
        ),
        # Do not inundate with 
        with_vectors=False,
        with_payload=False,
        limit=100,
    )

    for point in results:
        document_id = f"doc_{point.id}"
        name_run_dict[query_id][document_id] = point.score  
    
name_run_dict

In [29]:
from ranx import Run

product_name_run = Run(name_run_dict, name="product_name")

In [30]:
description_run_dict = defaultdict(dict)
for id, row in queries_df.iterrows():
    query_id = f"query_{id}"

    #  Initiates a search query in the wands-products collection, based on the input query_vector
    results = client.search(
        collection_name="wands-products",
        query_vector=models.NamedVector( # models.NamedVector is used to specify the name of the vector field to search against
            name="product_description", # Now we store the similarity results for product_description column
            vector=row["query_embedding"]
        ),
        with_vectors=False,
        with_payload=False,
        limit=100, # Top 100 most similar items
    )

    for point in results:
        document_id = f"doc_{point.id}"
        description_run_dict[query_id][document_id] = point.score 

product_description_run = Run(
    description_run_dict, 
    name="product_description"
)

<p style="background-color:#edfbff; padding:15px; border-width:3px; border-color:#efe6ef; border-style:solid; border-radius:6px"> We created <b>run dictionaries</b> for <b>product_name</b> and <b>product_description_run</b> columns respectively. Then the compare function can compare the <b>ground truth</b> against <b>multiple run instances</b> together and show us the <font color="blue"><b>Who's the best</b></font>

In [31]:
from ranx import compare

compare(
    qrels=qrels,
    runs=[
        product_name_run, 
        product_description_run
    ],
    metrics=[
        "precision@10", 
        "recall@10", 
        "mrr@10",
        "dcg@10", 
        "ndcg@10",
    ],
)

#    Model                P@10    Recall@10    MRR@10    DCG@10    NDCG@10
---  -------------------  ------  -----------  --------  --------  ---------
a    product_name         0.616ᵇ  0.034ᵇ       0.807ᵇ    19.132ᵇ   0.517ᵇ
b    product_description  0.457   0.024        0.688     14.497    0.388