# E-Commerce Search Engine Exercise

> A minimal, reproducible demo that trains a sentence‑transformer model, stores the embeddings in a FAISS index,
> and evaluates retrieval quality with MAP@10 (exact & partial matches).
> The Notebook also covers fine-tuning and re-ranking by means of a cross encoder

## Table of Contents
- [1. Imports, Data Loading, preliminary Transformations](#1-imports-data-loading-preliminary-transformations)
- [2. Embedding Model and FAISS Vector Store](#2-embedding-model-and-faiss-vector-store)
- [3. Helper Functions and first Test](#3-helper-functions-and-first-test)
- [4. Compute MAP@10 Score – Only “Exact” matches](#4-compute-map10-score-only-exact-matches)
- [5. Compute MAP@10 Score – “Exact” and “Partial” matches](#5-compute-map10-score-exact-and-partial-matches)
- [6. Summary and Ideas for Improvement](#6-summary-and-ideas-for-improvement)
- [7. Appendix 1: Fine - tuning](#7-appendix-1-fine-tuning)
- [8. Appendix 2: Re-ranking](#8-appendix-2-re-ranking)

## 1. Imports, Data Loading, preliminary Transformations
<a id="1-imports-data-loading-preliminary-transformations"></a>

In [42]:
# --------------------------------------------------------------------------- #
# Imports
# --------------------------------------------------------------------------- #
import pandas as pd
from sentence_transformers import SentenceTransformer
from tqdm.auto import tqdm
import numpy as np
import faiss
import warnings
warnings.filterwarnings('ignore')
import numpy as np

from typing import List, Iterable
import logging

# --------------------------------------------------------------------------- #
# Logging
# --------------------------------------------------------------------------- #
logging.basicConfig(
    level=logging.INFO,
    format="[%(levelname)s] %(asctime)s – %(message)s",
    datefmt="%Y-%m-%d %H:%M:%S",
)
logger = logging.getLogger(__name__)

# --------------------------------------------------------------------------- #
# Data Loading
# --------------------------------------------------------------------------- #
# !git clone https://github.com/wayfair/WANDS.git
query_df = pd.read_csv("WANDS/dataset/query.csv", sep='\t')
print("Query DF: ",query_df.shape)
# query_df.head()

product_df = pd.read_csv("WANDS/dataset/product.csv", sep='\t')
print("Product DF: ",product_df.shape)
# product_df.head()

# get manually labeled groundtruth lables
label_df = pd.read_csv("WANDS/dataset/label.csv", sep='\t')
print("Label DF: ",label_df.shape)
# label_df.head()

# --------------------------------------------------------------------------- #
# Build a single passage for each product
# --------------------------------------------------------------------------- #
# Creation of a combned product name
def build_passage(row):
    parts = [
        str(row["product_name"] or ""),
        str(row["product_description"] or ""),
        str(row["product_features"] or ""),
    ]
    # keep only non‑empty, lowercase, collapse spaces
    parts = [p.strip().lower() for p in parts if p.strip()]
    return " ".join(parts)

product_df["passage"] = product_df.apply(build_passage, axis=1)

# Drop products with an empty passages
product_df = product_df[product_df["passage"] != ""]
logger.info(f"Products with non‑empty passages: {len(product_df)}")

Query DF:  (480, 3)
Product DF:  (42994, 9)
Label DF:  (233448, 4)


[INFO] 2025-11-03 13:26:49 – Products with non‑empty passages: 42994


## 2. Embedding Model and FAISS Vector Store
<a id="2-embedding-model-and-faiss-vector-store"></a>

In [43]:
# --------------------------------------------------------------------------- #
# Embedding model - base model
# Converts text into numerical vectors that capture semantic meaning 
# --------------------------------------------------------------------------- #
MODEL_NAME = "all-MiniLM-L6-v2"   #this model is not too big in size and has good performance, 384 dimensions 
model = SentenceTransformer(MODEL_NAME, device="cpu")   

# # --------------------------------------------------------------------------- #
# # Uncomment only if you want to work with **fine-tuned model**
# # In that case , comment out base model above
# # --------------------------------------------------------------------------- #
# from torch.utils.data import DataLoader
# from sentence_transformers import SentenceTransformer, losses
# MODEL_NAME = "fine_tuned_wands_model3"
# model = SentenceTransformer(MODEL_NAME, device="cpu")

# --------------------------------------------------------------------------- #
# Encode passages
# --------------------------------------------------------------------------- #
product_ids = product_df["product_id"].tolist()
passages    = product_df["passage"].tolist()

embeddings = model.encode(
    passages,
    batch_size=64,
    show_progress_bar=True,
    normalize_embeddings=True   
)

[INFO] 2025-11-03 13:26:49 – Load pretrained SentenceTransformer: all-MiniLM-L6-v2
Batches: 100%|██████████| 672/672 [06:18<00:00,  1.78it/s]


In [44]:
# --------------------------------------------------------------------------- #
# Build FAISS index
# FAISS Stores a large collection of already‑computed vectors and allows fast similarity queries. Vector size should be the same, "Flat" each vector stored as-is (no paritioning)
# --------------------------------------------------------------------------- #
dim = embeddings.shape[1] # The dimensionality of each vector, created by the embedding model , embeddings.shape[0] would be number of documents
index = faiss.IndexFlatIP(dim)  # Inner product ≈ cosine after normalisation; index will return results ranked by cosine similarity

# Add all vectors – FAISS expects a numpy array of shape (N, dim)
index.add(np.vstack(embeddings).astype("float32"))

# --------------------------------------------------------------------------- #
# Mapping from FAISS id to product_id
# --------------------------------------------------------------------------- #
faiss_id_to_pid = {idx: pid for idx, pid in enumerate(product_ids)}

## 3. Helper Functions and first Test
<a id="3-helper-functions-and-first-test"></a>

In [None]:
# --------------------------------------------------------------------------- #
# Query helper
# Find similar Products: !!!IMPORTANT!!!!
# This function is heavily used later. Similarity based on embeddings.
# --------------------------------------------------------------------------- #
def search_query(query_text: str, k: int = 10):
    """
    Return the top‑k product_ids for a given query string.

    Parameters
    ----------
    query_text : str
        The raw query supplied by the user.
    k : int, optional (default=10)
        How many results to retrieve.

    Returns
    -------
    List[int]
        Ranked list of product_ids.
    """
    query_vec = model.encode([query_text], normalize_embeddings=True)
    distances, indices = index.search(query_vec, k)   # shape (1, k)
    ids = [faiss_id_to_pid[idx] for idx in indices[0]]
    return ids

# --------------------------------------------------------------------------- #
# Quick sanity‑check
# --------------------------------------------------------------------------- #
test_query = "armchair"
top_ids = search_query(test_query,k=5)
logger.info(f"Top {len(top_ids)} results for query: '{test_query}'")
print(f"Top products for '{query_df}':")
for pid in top_ids:
    name = product_df.loc[product_df["product_id"] == pid, "product_name"].values[0]
    logger.info(f"  {pid} – {name}")

Batches: 100%|██████████| 1/1 [00:00<00:00, 150.11it/s]
[INFO] 2025-11-03 13:33:12 – Top 5 results for query: 'armchair'
[INFO] 2025-11-03 13:33:12 –   24318 – gail 29.5 '' wide armchair
[INFO] 2025-11-03 13:33:12 –   32917 – 41 '' wide armchair
[INFO] 2025-11-03 13:33:12 –   28200 – wilmont 20.08 '' wide armchair
[INFO] 2025-11-03 13:33:12 –   22900 – alison 32.3 '' wide armchair
[INFO] 2025-11-03 13:33:12 –   1140 – charnley 47 '' wide chenille armchair


Top products for 'armchair':


-> Search_query seems to work, as results returned make sense. See "Ideas for Improvement" for more thoughts on better queries. 

## 4. Compute MAP@10 Score – Only “Exact” matches
<a id="4-compute-map10-score-only-exact-matches"></a>

In [46]:
# --------------------------------------------------------------------------- #
# MAP@K calculation
# --------------------------------------------------------------------------- #
def map_at_k(true_ids, predicted_ids, k=10):
    """
    Calculate the Mean Average Precision at K (MAP@K).

    Parameters:
    true_ids (list): List of relevant product IDs.
    predicted_ids (list): List of predicted product IDs.
    k (int): Number of top elements to consider.
    Returns:
    float: MAP@K score.
    """
    #if either list is empty, return 0
    if not len(true_ids) or not len(predicted_ids):
        return 0.0

    score = 0.0
    num_hits = 0.0

    for i, p_id in enumerate(predicted_ids[:k]):
        if p_id in true_ids and p_id not in predicted_ids[:i]:
            num_hits += 1.0
            score += num_hits / (i + 1.0)

    return score / min(len(true_ids), k)

In [47]:
# --------------------------------------------------------------------------- #
# Exact‑match helper
# --------------------------------------------------------------------------- #
grouped_label_df = label_df.groupby('query_id') #created grouped object for further exact matches calculation on query basis

# grouped_label_df = label_df_eval.groupby('query_id')  #only uncomment after **fine-tuning** instead line above  

def get_exact_matches_for_query(query_id): #implementing a function to retrieve exact match product IDs for a query_id
    query_group = grouped_label_df.get_group(query_id)
    exact_matches = query_group.loc[query_group['label'] == 'Exact']['product_id'].values
    return exact_matches

# --------------------------------------------------------------------------- #
# Add predictions & relevance lists to query_df
# --------------------------------------------------------------------------- #
query_df['top_product_ids'] = query_df['query'].apply(search_query) #applying the function to obtain top product IDs and adding top K product IDs to the dataframe 
query_df['relevant_ids'] = query_df['query_id'].apply(get_exact_matches_for_query) #adding the list of exact match product_IDs from labels_df

# --------------------------------------------------------------------------- #
# Compute MAP@10 per query
# --------------------------------------------------------------------------- #
query_df['map@k'] = query_df.apply(lambda x: map_at_k(x['relevant_ids'], x['top_product_ids'], k=10), axis=1)
# query_df.head()
overall_map10 = query_df["map@k"].mean()   # calculate the MAP across the entire query set
logger.info(f"MAP@10 (exact matches only) = {overall_map10:.4f}")

Batches: 100%|██████████| 1/1 [00:00<00:00, 189.94it/s]
Batches: 100%|██████████| 1/1 [00:00<00:00, 179.73it/s]
Batches: 100%|██████████| 1/1 [00:00<00:00, 165.32it/s]
Batches: 100%|██████████| 1/1 [00:00<00:00, 217.02it/s]
Batches: 100%|██████████| 1/1 [00:00<00:00, 167.79it/s]
Batches: 100%|██████████| 1/1 [00:00<00:00, 202.31it/s]
Batches: 100%|██████████| 1/1 [00:00<00:00, 195.78it/s]
Batches: 100%|██████████| 1/1 [00:00<00:00, 194.17it/s]
Batches: 100%|██████████| 1/1 [00:00<00:00, 174.68it/s]
Batches: 100%|██████████| 1/1 [00:00<00:00, 180.18it/s]
Batches: 100%|██████████| 1/1 [00:00<00:00, 171.88it/s]
Batches: 100%|██████████| 1/1 [00:00<00:00, 159.13it/s]
Batches: 100%|██████████| 1/1 [00:00<00:00, 171.83it/s]
Batches: 100%|██████████| 1/1 [00:00<00:00, 205.09it/s]
Batches: 100%|██████████| 1/1 [00:00<00:00, 177.91it/s]
Batches: 100%|██████████| 1/1 [00:00<00:00, 196.93it/s]
Batches: 100%|██████████| 1/1 [00:00<00:00, 151.30it/s]
Batches: 100%|██████████| 1/1 [00:00<00:00, 184.

## 5. Compute MAP@10 Score - "Exact" and "Partial" matches
<a id="5-compute-map10-score-exact-and-partial-matches"></a>

In [48]:
# --------------------------------------------------------------------------- #
# Pre‑processing helpers
# --------------------------------------------------------------------------- #

query_df_weighted = query_df.copy()

def get_relevant_weights(query_id: int):
    """
    Return a dictionary {product_id: weight} for all relevant products
    of a given query.  Weight is 1.0 for 'Exact', 0.5 for 'Partial'.
    """
    grp = grouped_label_df.get_group(query_id)
    # keep only Exact/Partial labels
    rel = grp[grp['label'].isin(['Exact', 'Partial'])]
    # convert to dict: product_id -> weight
    return dict(zip(rel['product_id'], rel['label'].map({'Exact': 1.0,
                                                         'Partial': 0.5})))

# --------------------------------------------------------------------------- #
# Weighted MAP@K
# --------------------------------------------------------------------------- #

def weighted_map_at_k(relevant_weights, predicted_ids, k=10):
    """
    weighted_map_at_k(relevant_weights, predicted_ids, k=10)

    Parameters
    ----------
    relevant_weights : dict
        product_id → relevance weight (1.0 for Exact, 0.5 for Partial)
    predicted_ids   : list[int]
        list of product ids in ranked order
    k               : int
        number of top results to consider (default 10)

    Returns
    -------
    float
        weighted MAP@k for this single query
    """
    if not relevant_weights or not predicted_ids:
        return 0.0

    score = 0.0
    hit_count = 0.0          # cumulative weighted hits up to i
    considered = set()       # to avoid counting the same product twice

    for i, pid in enumerate(predicted_ids[:k], start=1):
        if pid in relevant_weights and pid not in considered:
            weight = relevant_weights[pid]
            hit_count += weight
            # precision up to position i uses *cumulative weighted* hits
            precision_at_i = hit_count / i
            score += precision_at_i * weight   # weight appears twice:
            # 1. as part of precision_at_i
            # 2. as a multiplicative factor to make exact 1.0, partial 0.5
            considered.add(pid)

    # Normalise by the total available relevance weight for this query
    # (min(#relevant_items, k) would be the unweighted version).
    max_relevant_weight = min(sum(relevant_weights.values()), k)  # 1.0×#Exact + 0.5×#Partial
    if max_relevant_weight == 0:
        return 0.0
    return score / max_relevant_weight

In [None]:
# --------------------------------------------------------------------------- #
# Build mapping: query_id → {product_id: weight}
# --------------------------------------------------------------------------- #
query_to_weights = {qid: get_relevant_weights(qid) for qid in label_df['query_id'].unique()}
# query_to_weights = {qid: get_relevant_weights(qid) for qid in label_df_eval['query_id'].unique()}  #only uncomment after **fine-tuning** instead line above  

# --------------------------------------------------------------------------- #
# Evaluate each query
# --------------------------------------------------------------------------- #
def eval_query(row):
    qid          = row['query_id']
    pred_ids     = row['top_product_ids']           # ranked list: !!!IMPORTANT!!! this was generated using search_query, based on embeddings and vector store!
    rel_weights  = query_to_weights.get(qid, {})
    return weighted_map_at_k(rel_weights, pred_ids, k=10)

query_df_weighted ['map@k_weighted'] = query_df_weighted.apply(eval_query, axis=1)

# --------------------------------------------------------------------------- #
# Final mean over all queries
# --------------------------------------------------------------------------- #
mean_weighted_map10 = query_df_weighted['map@k_weighted'].mean()
logger.info(f"Weighted MAP@10 = {mean_weighted_map10:.4f}")
# query_df_weighted.head()  

Weighted MAP@10 : 0.4235


## 6. Summary and Ideas for Improvement
<a id="6-summary-and-ideas-for-improvement"></a>

Summary:
<ul type="disc">
  <li> Changes for improvement of MAP score:</li>
</ul>
<p style="margin-left: 2rem;">- Including an embedding model and vector store improves the performance (MAP-Score: 0.34) over Tf-IDF as it caputes the semantic meaning of text instead of only relying on frequencies of words</p>
<p style="margin-left: 2rem;">- Including the partial matches increases the MAP score to 0.40 but the partial match value of 0.5 is random for each query. A LLM could be used to get a more precise value (or re-ranking by means of cross-encoder, see discussion below). Variation in MAP-Score through different partial match values is reasonable: Partial Score 0.1 -> MAP-Score 0.286, artial Score 0.9 -> MAP-Score 0.63</p> 
<p style="margin-left: 2rem;">- Summarizing query and query_class and all product features leads to worse MAP-Scores: Exact: 0.31, Exact+Partial: 0.38, only summarizing all product features leads to similar MAP-scores as the base version: Exact: 0.35 , Exact+Partial:0.39</p>

<ul type="disc">
  <li> Ideas for improvement:</li>
</ul>
<p style="margin-left: 2rem;">- Allow queries that are whole sentence, this allows for more context to be matched with product features and descriptions: e.g. "I would like to find an armchair that is cushioned well and warm for the winter"</p>
<p style="margin-left: 2rem;">- Fine-tune the embedding model as it was trained on geenric web-corpus and not domain-specific vacabulary such as product names, features, or query style</p>    
<p style="margin-left: 2rem;">- Apply re-ranking of results using a cross-encoder. Cosine-similarity ranking only considers vector distance. Cross-encoder can score query-passage pairs with a full
attention over both texts, capturing subtle matching cues. A lightweight way to fix that is to take the top‑N candidates from FAISS (e.g., N=50–200) and re‑score them with a cross‑encoder that looks at the joint query‑passage representation.</p>


Remark:
- this notebook utilized code assistants and the provided notebook code, but overall code&logic and structure as well as code customizations were created by owner

## 7. Appendix 1: Fine - tuning
<a id="7-appendix-1-fine-tuning"></a>

In the following, samples are taking from the original datset to conduct fine-tuning using the same embedded model as above. This makes the embeddings aware of the WAND dataset context. The samples are then removed from the original dataset before the code above (after imports and data loading) can be re-run.  

In [None]:
# --------------------------------------------------------------------------- #
# Imports
# --------------------------------------------------------------------------- #
import random
from sentence_transformers import InputExample

# --------------------------------------------------------------------------- #
# Feature-Value Pairs (Training Samples)
# --------------------------------------------------------------------------- #
train_pairs = []
for _, row in label_df.iterrows():
    query_text = query_df.loc[query_df['query_id'] == row['query_id'], 'query'].values[0]
    product_text = product_df.loc[product_df['product_id'] == row['product_id'], 'passage'].values[0]
    label_value = 1.0 if row['label'] == 'Exact' else 0.5 if row['label'] == 'Partial' else 0.0
    train_pairs.append((row['query_id'], row['product_id'], label_value))

N_SUBSAMPLE = 5000  #Because laptop heat up, and time considerations, only a few samples are created
if len(train_pairs) > N_SUBSAMPLE:
    sampled_train_pairs = random.sample(train_pairs, N_SUBSAMPLE)
else:
    sampled_train_pairs = train_pairs

print(f"Using {len(sampled_train_pairs)} pairs for fine‑tuning.")

train_examples = [
    InputExample(
        texts=[
            query_df.loc[query_df['query_id'] == qid, 'query'].values[0],
            product_df.loc[product_df['product_id'] == pid, 'passage'].values[0],
        ],
        label=lbl
    )
    for qid, pid, lbl in sampled_train_pairs
]

# Identify which query_id–product_id pairs were used for training
train_query_ids  = [q for q, p, _ in sampled_train_pairs]
train_product_ids = [p for q, p, _ in sampled_train_pairs]

# Remove those pairs from label_df
mask = ~label_df.apply(lambda r: (r['query_id'], r['product_id']) in 
                       {(q, p) for q, p, _ in sampled_train_pairs}, axis=1)
label_df_eval = label_df[mask].reset_index(drop=True)

print(f"Evaluation label_df size reduced from {len(label_df)} to {len(label_df_eval)}")

Using 5000 pairs for fine‑tuning.
Evaluation label_df size reduced from 233448 to 228378


"label_df_eval"  - This Dataframe has then to be inserted above at two locations (CTRL+F to search for these)

In [None]:
# --------------------------------------------------------------------------- #
# More Imports and Instantiation of fine-tuning. model
# --------------------------------------------------------------------------- #
from torch.utils.data import DataLoader
from sentence_transformers import SentenceTransformer, losses

model = SentenceTransformer("all-MiniLM-L6-v2")

train_dataloader = DataLoader(train_examples, shuffle=True, batch_size=32)
train_loss = losses.MultipleNegativesRankingLoss(model)

In [None]:
# fitting the model, with 5000 samples it only runs around 2 minutes, laptop heat up after ca. 5-6 min
model.fit(
    train_objectives=[(train_dataloader, train_loss)],
    epochs=2,
    warmup_steps=100,
    show_progress_bar=True
)

100%|██████████| 314/314 [02:04<00:00,  2.52it/s]

{'train_runtime': 124.4874, 'train_samples_per_second': 80.329, 'train_steps_per_second': 2.522, 'train_loss': 1.4076224163079718, 'epoch': 2.0}





In [None]:
#save the model under a new name, this is saved within the same folder. This name is referenced in Chapter 2 when the fine-tuned model is used.
model.save("fine_tuned_wands_model3")

                                                                     

Result: 
- Exact MAP@10: 0.34 (same)
- Exact+Partial MAP@10: 0.4235 (increase)
An increased value makes sense because the model was fine-tuned on Exact+Partial. With only 5000 samples an improvement of 5.8% can be achieved. 

## 8. Appendix 2: Re-ranking
<a id="8-appendix-1-re-ranking"></a> 

Re-ranking by using a CrossEncoder is common among machine learning tasks and often increases accuracy. 

In [27]:
from sentence_transformers import CrossEncoder
cross = CrossEncoder("cross-encoder/ms-marco-MiniLM-L-12-v2", device="cpu")

In [31]:
def rerank_query(query_text, top_k=10, n_candidates=50):
    # 1. Fast FAISS pass (top-n_candidates)
    query_vec = model.encode([query_text], normalize_embeddings=True)
    _, indices = index.search(query_vec, n_candidates)
    candidate_ids = [faiss_id_to_pid[idx] for idx in indices[0]]
    candidate_passages = [product_df.loc[product_df['product_id']==pid, 'passage'].values[0]
    for pid in candidate_ids]
    # 2. Cross‑encoder scoring
    pairs = [[query_text, passage] for passage in candidate_passages]
    scores = cross.predict(pairs) # higher = better match
    # 3. Sort and keep top_k
    ranked = sorted(zip(candidate_ids, scores), key=lambda x: x[1], reverse=True)
    return [pid for pid, _ in ranked[:top_k]]

In [None]:
# Random sub-samples are used for demonstration purposes only. Will work with total query_df as well. 
N_SAMPLES = 200
sampled_idx = random.sample(list(query_df.index), N_SAMPLES)
sampled_df = query_df.loc[sampled_idx].copy()

tqdm.pandas(desc="Reranking queries")

# Only 25 candidates are chose for re-ranking. This means 25 matches are taken and re-ranked, kepping the 10 needed for MAP@10 score.
sampled_df['top_product_ids_cr'] = sampled_df['query'].progress_apply(
    lambda q: rerank_query(q, top_k=10, n_candidates=25))

sampled_df['map@k_cr'] = sampled_df.apply(
    lambda row: map_at_k(row['relevant_ids'], row['top_product_ids_cr'], k=10),
    axis=1)
print(f"MAP@10 after re‑rank (sample of {N_SAMPLES} queries): {sampled_df['map@k_cr'].mean():.4f}")


Reranking queries: 100%|██████████| 200/200 [03:18<00:00,  1.01it/s]

MAP@10 after re‑rank (sample of 200 queries): 0.3999





Result:
- Re-ranking was only applied to the Exact Map@10 score due to time constraints.
- Accuracy improved a lot from 0.34 -->0.4