# L4: Measuring Search Relevance

In [1]:
import warnings
warnings.filterwarnings('ignore')

### WANDS Annotation Dataset
https://github.com/wayfair/WANDS/tree/main/dataset  

It is a gold standard (possibly) annotation dataset  

The data is stored in the dataset folder in three files:  
  
1. **product.csv** - Stores all candidate products, columns include:  
a. product_id - ID of a product  
b. product_name - String of product name  
c. product_class - Category which product falls under  
d. category_hierarchy - Parent categories of product, delimited by /  
e. product_description - String description of product  
f. product_features - | delimited string of attribute:value pairs which describe the product   
g. rating_count - Number of user ratings for product  
h. average_rating - Average rating the product received  
i. review_count - Number of user reviews for product  
    
2. **query.csv** - Stores search queries, columns include:  
a. query_id - unique ID for each query  
b. query - query string  
c. query_class - category to which the query falls under  
  
3. **label.csv** - Stores annotated (product,relevance judgement) pairs, columns include  
a. id - Unique ID for each annotation    
b. query_id - ID of the query this annotation is for    
c. product_id - ID of the product this annotation applies to  
d. label - Relevance label, one of 'Exact', 'Partial', or 'Irrelevant'  

In [3]:
import pandas as pd

products_df = pd.read_csv(
    "data/wands/product.csv",
    sep="\t", # Tab separate columns
    index_col="product_id", # Consider it as an index
    keep_default_na=False,  # Some products do not have a description
)
products_df.head()

Unnamed: 0_level_0,product_name,product_class,category hierarchy,product_description,product_features,rating_count,average_rating,review_count
product_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
0,solid wood platform bed,Beds,Furniture / Bedroom Furniture / Beds & Headboa...,"good , deep sleep can be quite difficult to ha...",overallwidth-sidetoside:64.7|dsprimaryproducts...,15.0,4.5,15.0
1,all-clad 7 qt . slow cooker,Slow Cookers,Kitchen & Tabletop / Small Kitchen Appliances ...,"create delicious slow-cooked meals , from tend...",capacityquarts:7|producttype : slow cooker|pro...,100.0,2.0,98.0
2,all-clad electrics 6.5 qt . slow cooker,Slow Cookers,Kitchen & Tabletop / Small Kitchen Appliances ...,prepare home-cooked meals on any schedule with...,features : keep warm setting|capacityquarts:6....,208.0,3.0,181.0
3,all-clad all professional tools pizza cutter,"Slicers, Peelers And Graters",Browse By Brand / All-Clad,this original stainless tool was designed to c...,overallwidth-sidetoside:3.5|warrantylength : l...,69.0,4.5,42.0
4,baldwin prestige alcott passage knob with roun...,Door Knobs,Home Improvement / Doors & Door Hardware / Doo...,the hardware has a rich heritage of delivering...,compatibledoorthickness:1.375 '' |countryofori...,70.0,5.0,42.0


<p style="background-color:#edfbff; padding:15px; border-width:3px; border-color:#efe6ef; border-style:solid; border-radius:6px"> For semantic search, we will choose <b>product_name</b> and <b>product_description</b> for better applicability

In [4]:
num_products = 5000

In [5]:
from sentence_transformers import SentenceTransformer

model = SentenceTransformer("all-MiniLM-L6-v2")

product_name_embeddings = model.encode(products_df["product_name"][0:num_products].tolist())
product_name_embeddings.shape

(5000, 384)

In [6]:
product_description_embeddings = model.encode(products_df["product_description"][0:num_products].tolist())
product_description_embeddings.shape

(5000, 384)

### How `SentenceTransformer` Generates Embeddings per Sentence (RECAP)

When using `SentenceTransformer`, the model encodes each product name (or sentence) as a single vector, capturing the entire sentence’s meaning. Here’s a breakdown of the process using the examples `"Winter Jacket"` and `"Cozy Sweater"`.

#### Step-by-Step Process

1. **Tokenization**: The model splits each product name into tokens based on its vocabulary.
   - `"Winter Jacket"` → `["winter", "jacket"]`
   - `"Cozy Sweater"` → `["cozy", "sweater"]`
   
   Each word (or subword, in the case of uncommon words) is represented by tokens that are mapped to unique IDs corresponding to the model’s pretrained embeddings.

2. **Embedding Calculation**:
   - After tokenization, each token is represented by an embedding from the model's vocabulary.
   - The model combines these embeddings, typically by averaging or through weighted attention mechanisms, to produce a **single fixed-size vector** for each sentence.

#### Final Output
For each product name, a single vector represents the combined meaning of the sentence, reflecting the relationships between the tokens.

### Example:
- `"Winter Jacket"` results in **one embedding vector** representing the whole phrase.
- `"Cozy Sweater"` results in **one embedding vector** for the entire phrase.

The model does <font color=red>**not create separate embeddings**</font> for each word within a product name.


## Building the collection

In [None]:
from qdrant_client import QdrantClient, models

client = QdrantClient("http://localhost:6333")
client.delete_collection("wands-products")
client.create_collection(
    collection_name="wands-products",
    vectors_config={
        "product_name": models.VectorParams(size=384, distance=models.Distance.COSINE, ),
        "product_description": models.VectorParams(size=384, distance=models.Distance.COSINE,),
    },
    optimizers_config=models.OptimizersConfigDiff(default_segment_number=2, indexing_threshold=1000,),
)

In [7]:
client.upload_collection(
    collection_name="wands-products",
    vectors={
        "product_name": product_name_embeddings,
        "product_description": product_description_embeddings,
    },
    payload=products_df.to_dict(orient="records"), # Converting from an existing df
    ids=products_df.index.tolist(), # Index of the existing df
    batch_size=64, # The insert will be performed as batches of 64
)

NameError: name 'client' is not defined

In [None]:
client.count("wands-products") # Total count of the products

In [None]:
import time

time.sleep(1.0)
collection = client.get_collection("wands-products")
while collection.status != models.CollectionStatus.GREEN: # Green = Vector DB Index ready
    time.sleep(1.0)
    collection = client.get_collection("wands-products")
    
collection # Now we access the entire set of Collections when everything is ready

## Test queries

!

In [None]:
queries_df = pd.read_csv(
    "shared_data/WANDS/query.csv", 
    sep="\t", 
    index_col="query_id",
)
queries_df.head()

## Ground truth

In [None]:
labels_df = pd.read_csv(
    "shared_data/WANDS/label.csv", 
    sep="\t", 
)
labels_df.sample(n=5)

In [None]:
relevancy_scores = {
    "Exact": 10,
    "Partial": 5,
    "Irrelevant": 0,
}

labels_df["score"] = labels_df["label"].map(relevancy_scores.get)
labels_df["query_id"] = labels_df["query_id"].map(lambda x: f"query_{x}")
labels_df["product_id"] = labels_df["product_id"].map(lambda x: f"doc_{x}")
labels_df.sample(n=5)

## ranx

In [None]:
from ranx import Qrels

qrels = Qrels.from_df(
    labels_df.astype({"query_id": "str", "product_id": "str"}),
    q_id_col="query_id",
    doc_id_col="product_id", 
    score_col="score",
)

### Running all the queries

In [None]:
queries_df["query_embedding"] = model.encode(
    queries_df["query"].tolist()
).tolist()
queries_df.sample(n=5)

In [None]:
from collections import defaultdict

name_run_dict = defaultdict(dict)
for id, row in queries_df.iterrows():
    query_id = f"query_{id}"
    
    results = client.search(
        collection_name="wands-products",
        query_vector=models.NamedVector(
            name="product_name", 
            vector=row["query_embedding"]
        ),
        with_vectors=False,
        with_payload=False,
        limit=100,
    )

    for point in results:
        document_id = f"doc_{point.id}"
        name_run_dict[query_id][document_id] = point.score  
    
name_run_dict

In [None]:
from ranx import Run

product_name_run = Run(name_run_dict, name="product_name")

In [None]:
description_run_dict = defaultdict(dict)
for id, row in queries_df.iterrows():
    query_id = f"query_{id}"
    
    results = client.search(
        collection_name="wands-products",
        query_vector=models.NamedVector(
            name="product_description", 
            vector=row["query_embedding"]
        ),
        with_vectors=False,
        with_payload=False,
        limit=100,
    )

    for point in results:
        document_id = f"doc_{point.id}"
        description_run_dict[query_id][document_id] = point.score 

product_description_run = Run(
    description_run_dict, 
    name="product_description"
)

In [None]:
from ranx import compare

compare(
    qrels=qrels,
    runs=[
        product_name_run, 
        product_description_run
    ],
    metrics=[
        "precision@10", 
        "recall@10", 
        "mrr@10",
        "dcg@10", 
        "ndcg@10",
    ],
)