# 3) Baseline Search & Metrics

In this notebook we create a query set and run a `best_match` query with a reasonable set of search fields and weights.

In addition to the ESCI product data we are making use of the judgements that we map to numeric values.

For each query we calculate search metrics and an average over all queries.

This gives us a baseline that we can use as our foundation when exploring different hybrid search configurations in the next notebook.

In [62]:
import pandas as pd
import numpy as np
import mercury as mr
import requests
import json
import pytrec_eval
from opensearchpy import OpenSearch

In [63]:
DATA_DIR = '/Users/danielwrigley/work/Testing/git_repos/esci-data/shopping_queries_dataset/'

In [64]:
df_examples = pd.read_parquet(DATA_DIR + '/shopping_queries_dataset_examples.parquet')

## Query set
A query set has the columns:

* `query_set_id`
* `query`

There is currently no date corresponding to the query set. And currently the sampling is not done based on frequency.

In [65]:
# We only use English queries for now
df_queries_us = df_examples[df_examples['product_locale'] == 'us']

In [66]:
np.random.seed(10)

In [67]:
# Sample query sets

query_sets = [("sampled_queries", 200), ("top_queries", 20)]

res = []

for query_set_id, n_query_set in query_sets:
    # todo: sampling proportional to frequency
    query_set = np.random.choice(df_queries_us["query"].unique(), n_query_set, replace=False)

    df = pd.DataFrame({"query": query_set})
    df["query_set_id"] = query_set_id
    res.append(df)
df_query_set = pd.concat(res)
df_query_set.head(10)

Unnamed: 0,query,query_set_id
0,runtz,sampled_queries
1,trooper bandana shoe,sampled_queries
2,tcl a1x phone case straight talk,sampled_queries
3,bose headphones replacement cord,sampled_queries
4,uniball vision elite,sampled_queries
5,definitely not paid enough for this,sampled_queries
6,raid deep reach fogger,sampled_queries
7,usb camera,sampled_queries
8,reusable produce bags,sampled_queries
9,latex dental dam,sampled_queries


## Judgments
The judgments dataset has a row per query instance and document and has the following columns:

* datetime: date of query/document instance
* query_id: identifier of query instance
* query: the query
* document: identifier of a document result
* judgment: Here we use the proposed ESCI mapping for DCG: `{"E": 0, "S": 1, "C": 2, "I": 3}` --> this looks odd. These look more like labels since `E` means "Exact Match" and would be the lowest score according to that wording

In [68]:
# Select judgments
# Map esci_label to score
# create judgments per day in range
# create noise in score

label_num = {"E": 0, "S": 1, "C": 2, "I": 3}
label_score = [3, 2, 1, 0]
label_p_noise = 0.1

def label_to_score(label):
    return label_score[label_num[label]]

df_judge = df_examples[df_examples["query"].isin(set(df_query_set["query"].values))].copy()
df_judge["judgment"] = df_judge.esci_label.apply(lambda x: label_to_score(x))
df_judge["document"] = df_judge.product_id
df_judge = df_judge[["query", "document", "judgment"]].reset_index(drop=True)
df_judge.head(20)

Unnamed: 0,query,document,judgment
0,$30 roblox gift card not digital,B07RX6FBFR,3
1,$30 roblox gift card not digital,B09194H44R,0
2,$30 roblox gift card not digital,B08R5N6W6B,2
3,$30 roblox gift card not digital,B07Y693ND1,0
4,$30 roblox gift card not digital,B07RZ75JW3,2
5,$30 roblox gift card not digital,B07RZ74VLR,2
6,$30 roblox gift card not digital,B07M9XQ9YB,0
7,$30 roblox gift card not digital,B078RJ1KZ6,0
8,$30 roblox gift card not digital,B01N6RK9UE,0
9,$30 roblox gift card not digital,B016Y2BVKA,3


# Transform the judgments to the qrels format that `trec_eval` can work with

### Group by queries and export to a file with the index to have queries and query ids

In [69]:
df_queries = df_judge.groupby(by='query', as_index=False).agg({
    'judgment': ['count']
})
df_query_idx = df_queries['query']
name = 'queries.txt'

df_query_idx.to_csv(name, sep="\t", header=False)

### Go through the queries and update the original ratings with the query ids

In [70]:
df_query_idx = pd.DataFrame(df_query_idx)

In [71]:
df_query_idx = df_query_idx.reset_index().rename(columns={'index': 'idx'})

df_merged = pd.merge(df_judge, df_query_idx, on='query', how='left')
df_ratings = df_merged[['idx', 'document', 'judgment']]
df_ratings.columns = ['idx', 'docid', 'rating']

In [72]:
name = 'ratings.qrels'

df_ratings.to_csv(name, sep="\t", header=False, index=False)
df_ratings

Unnamed: 0,idx,docid,rating
0,0,B07RX6FBFR,3
1,0,B09194H44R,0
2,0,B08R5N6W6B,2
3,0,B07Y693ND1,0
4,0,B07RZ75JW3,2
...,...,...,...
4060,219,B00JX10Q2O,2
4061,219,B00KY41UHO,3
4062,219,B00QXJOUL2,3
4063,219,B00UY14WCM,3


## Query OpenSearch with the Baseline Configuration

We use a simple `multi_match` query with a couple of fields and field weights.

This will serve as our baseline. We get the first 10 results for each query we have in `df_judge` and store the results in a format that `trec_eval`can work with later when evaluating the results.

This gives us a quantification of the quality of the baseline configuration.

We will use this to compare our hybrid search configurations against.

In [73]:
url = "http://localhost:9200/ecommerce/_search"

headers = {
    'Content-Type': 'application/json'
}

df_relevance = pd.DataFrame()

for query in df_query_idx.itertuples():

    payload = {
      "_source": {
        "excludes": [
          "title_embedding"
        ]
      },
      "query": {
        "multi_match" : {
          "type":       "best_fields",
          "fields":     [
              "product_id^100",
            "product_bullet_point^3",
            "product_color^2",
            "product_brand^5",
            "product_description",
            "product_title^10"
          ],
          "operator":   "and",
          "query":      query[2]
        }
      }
    }

    response = requests.request("POST", url, headers=headers, data=json.dumps(payload)).json()
    
    position = 0
    for hit in response['hits']['hits']:
        # create a new row for the DataFrame and append it
        row = { 'query_id' : str(query[1]), 'query_string': query[2], 'product_id' : hit["_id"], 'position' : str(position), 'relevance' : hit["_score"], 'run': 'default' }
    
        new_row_df = pd.DataFrame([row])
        df_relevance = pd.concat([df_relevance, new_row_df], ignore_index=True)
        position += 1
    
# work with two for loops:
# 1) one to iterate over the list of queries and have a query id instead of a query
# 2) another one to iterate over the result sets to have the position of the result in the result set 

# DataFrame with columns:
# query_id: the id of the query as the trec_eval tool needs a numeric id rather than a query string as an identifier
# product_id: the id of the product in the hit list
# position: the position of the product in the result set
# relevance: relevance as given by the search engine
# run: the name of the query run

## Transform data to meet the `pytrec_eval` requirements

### Convert string ids to integer values

In [74]:
df_relevance

Unnamed: 0,query_id,query_string,product_id,position,relevance,run
0,1,(fiction without frontiers),1787581780,0,309.121900,default
1,1,(fiction without frontiers),B082VGLV18,1,309.121900,default
2,1,(fiction without frontiers),B07GJVWWWR,2,309.121900,default
3,1,(fiction without frontiers),B08C5MQFCY,3,309.121900,default
4,1,(fiction without frontiers),1787583325,4,298.643860,default
...,...,...,...,...,...,...
1614,219,yarn purple and pink,B079J4V4G8,5,49.612446,default
1615,219,yarn purple and pink,B07HG28KLX,6,49.244354,default
1616,219,yarn purple and pink,B08B9K99R9,7,44.207966,default
1617,219,yarn purple and pink,B01ILAACNU,8,43.978270,default


In [75]:
name = 'baseline_results'
df_relevance.to_csv(name, sep="\t", header=False, index=False)

## Transform data to meet the `pytrec_eval` requirements

### Convert string ids to integer values

In [76]:
unique_ids = pd.Series(pd.concat([df_relevance['product_id'], df_ratings['docid']]).unique())

# Create a mapping of each unique identifier to an integer
id_to_int = {id_val: idx for idx, id_val in enumerate(unique_ids, start=1)}

# Map the identifiers in both DataFrames
df_relevance['product_id_int'] = df_relevance['product_id'].map(id_to_int)
df_ratings['docid_int'] = df_ratings['docid'].map(id_to_int)

In [77]:
df_relevance.head(3)

Unnamed: 0,query_id,query_string,product_id,position,relevance,run,product_id_int
0,1,(fiction without frontiers),1787581780,0,309.1219,default,1
1,1,(fiction without frontiers),B082VGLV18,1,309.1219,default,2
2,1,(fiction without frontiers),B07GJVWWWR,2,309.1219,default,3


In [78]:
df_ratings

Unnamed: 0,idx,docid,rating,docid_int
0,0,B07RX6FBFR,3,1613
1,0,B09194H44R,0,1614
2,0,B08R5N6W6B,2,1615
3,0,B07Y693ND1,0,1616
4,0,B07RZ75JW3,2,1617
...,...,...,...,...
4060,219,B00JX10Q2O,2,5022
4061,219,B00KY41UHO,3,5023
4062,219,B00QXJOUL2,3,5024
4063,219,B00UY14WCM,3,5025


In [79]:
# Drop the docid column as it is not needed
df_pytrec_qrels = df_ratings.drop(columns=['docid'])

df_pytrec_qrels['docid_int'] = df_pytrec_qrels['docid_int'].astype(str)

# Initialize an empty dictionary to store the final qrel structure
qrel = {}

# Group by 'idx'
for idx, group in df_pytrec_qrels.groupby('idx'):
    # Create a dictionary for each group where 'docid' is the key and 'rating' is the value
    qrel[str(idx)] = dict(zip(group['docid_int'], group['rating']))

In [80]:
df_pytrec_results = df_relevance.drop(columns=['position', 'run', 'product_id'])

df_pytrec_results['relevance'] = df_pytrec_results['relevance'].astype(int)
df_pytrec_results['product_id_int'] = df_pytrec_results['product_id_int'].astype(str)

# Initialize an empty dictionary to store the final 'run' structure
run = {}

# Group by 'query_id' 
for query_id, group in df_pytrec_results.groupby('query_id'):
    # Create a dictionary for each group where 'product_id' is the key and 'relevance' is the value
    run[query_id] = dict(zip(group['product_id_int'], group['relevance']))

In [81]:
evaluator = pytrec_eval.RelevanceEvaluator(
    qrel, {'map', 'ndcg'})

data = evaluator.evaluate(run)

print(json.dumps(data, indent=1))

{
 "1": {
  "map": 0.14285714285714285,
  "ndcg": 0.2998871210371002
 },
 "10": {
  "map": 0.7219907407407408,
  "ndcg": 0.8479117780988195
 },
 "100": {
  "map": 0.4,
  "ndcg": 0.5577409001984587
 },
 "101": {
  "map": 0.625,
  "ndcg": 0.7441140557478908
 },
 "102": {
  "map": 0.18181818181818182,
  "ndcg": 0.2275100331754748
 },
 "103": {
  "map": 0.3279761904761905,
  "ndcg": 0.5427847748201621
 },
 "104": {
  "map": 0.022222222222222223,
  "ndcg": 0.11033110561045088
 },
 "105": {
  "map": 0.1,
  "ndcg": 0.29740901205864334
 },
 "106": {
  "map": 0.0,
  "ndcg": 0.0
 },
 "107": {
  "map": 0.0125,
  "ndcg": 0.06812988141911934
 },
 "109": {
  "map": 0.06212797619047619,
  "ndcg": 0.18379934077708113
 },
 "112": {
  "map": 0.0,
  "ndcg": 0.0
 },
 "113": {
  "map": 0.5414930555555556,
  "ndcg": 0.6627929946637657
 },
 "114": {
  "map": 0.04513888888888889,
  "ndcg": 0.15263019522688698
 },
 "115": {
  "map": 0.046875,
  "ndcg": 0.15499424445272175
 },
 "116": {
  "map": 0.2581845238095

In [82]:
ndcg_sum = 0
map_sum = 0
num_queries = len(data)

# Iterate over the dictionary and sum the 'ndcg' and 'map' values
for query, metrics in data.items():
    ndcg_sum += metrics['ndcg']
    map_sum += metrics['map']

# Calculate the averages
average_ndcg = ndcg_sum / num_queries
average_map = map_sum / num_queries

# Print the results
print("Baseline metrics")
print(f"Average NDCG: {average_ndcg}")
print(f"Average MAP: {average_map}")

Baseline metrics
Average NDCG: 0.29311923637469645
Average MAP: 0.18515902813023388
