# 4 Best Hybrid Search Configuration

This notebook runs different hybrid search configurations, calculates the metrics for each configuration and compares the results to the metrics calcuated after the baseline run from the previous notebook. 

We are using the same query set to have a fair comparison.

## Get queries

In [74]:
import pandas as pd
import numpy as np
import requests
import json
import mercury as mr
import itertools
from tqdm.notebook import tqdm_notebook

app = mr.App(title="Let's Run a Hybrid Search", static_notebook=True)

## Load Query Sets and Ratings

Use the query sets and created ratings/judgements from notebook 3.

In [75]:
# Set the boolean value accordingly to use either the small (small_b = True) or the large dataset (small_b = False).
small_b = False

if small_b:
    df_train = pd.read_csv('../data/query_train_small.csv')
    df_test = pd.read_csv('../data/query_test_small.csv')
if not small_b:
    df_train = pd.read_csv('../data/query_train.csv')
    df_test = pd.read_csv('../data/query_test.csv')
df_query_set = pd.concat([df_train, df_test])

In [76]:
# Import the ratings generated in the previous notebook
df_ratings = pd.read_csv('../data/ratings.csv', sep="\t", names=['query', 'docid', 'rating', 'idx'])#, index=False)
df_ratings

Unnamed: 0,query,docid,rating,idx
0,#8 tags without string,B0751KS4ZW,0,0
1,#8 tags without string,B07541MJRV,2,0
2,#8 tags without string,B075WX3LFF,2,0
3,#8 tags without string,B076BBJYRN,2,0
4,#8 tags without string,B0772W8DPP,0,0
...,...,...,...,...
93904,眼镜框,B06XCPKJTV,3,4999
93905,眼镜框,B01LZROY37,2,4999
93906,眼镜框,B01A2O6AGQ,0,4999
93907,眼镜框,B01030BYCE,3,4999


In [77]:
df_queries = df_ratings.groupby(by='query', as_index=False).agg({
    'rating': ['count']
})
df_query_idx = df_queries['query']

In [78]:
df_query_idx = pd.DataFrame(df_query_idx)

df_query_idx = df_query_idx.reset_index()

df_query_idx = df_query_idx.rename(columns={'index': 'idx'})
df_query_idx

Unnamed: 0,idx,query
0,0,#8 tags without string
1,1,$1 dollar toys not fidgets
2,2,$30 roblox gift card not digital
3,3,$60 ps4 that’s not gonna be on amazon
4,4,'m team jesus i'm not religious shirt
...,...,...
4995,4995,zoom eyepiece for telescope
4996,4996,zumba shoes
4997,4997,zwave front door lock kwikset
4998,4998,zyrtec


## Query OpenSearch with the Hybrid Search Configurations

Let's make sure that we can execute hybrid search queries by creating a pipeline and using it in a query

In [79]:
keyword_weight = 0.3

In [80]:
neural_weight = round(1.0 - keyword_weight, 2)
print(f"Keyword Weight is {keyword_weight} and Neural Weight is {neural_weight}")

Keyword Weight is 0.3 and Neural Weight is 0.7


In [81]:
# Get model_id
# We are assuming that the installation has only one model. Change this if you have more models 
# and need to pick a specific one

url = "http://localhost:9200/_plugins/_ml/models/_search"

headers = {
    'Content-Type': 'application/json'
}

payload = {
  "query": {
    "match_all": {}
  },
  "size": 1
}

response = requests.request("POST", url, headers=headers, data=json.dumps(payload))

model_id = response.json()['hits']['hits'][0]['_source']['model_id']

In [82]:
normalization = 'l2'
combination = 'arithmetic_mean'
keyword = 0.3
vector = 0.7
pipeline = 'hybrid-search-pipeline'

url = "http://localhost:9200/_search/pipeline/" + pipeline

print(f"Setting default model id to: {model_id}")
payload = {
  "request_processors": [
    {
      "neural_query_enricher" : {
        "description": "Sets the default model ID at index and field levels",
        "default_model_id": model_id,
        "neural_field_default_id": {
           "title_embeddings": model_id
        }
      }
    }
  ],
  "phase_results_processors": [
    {
      "normalization-processor": {
        "normalization": {
          "technique": "min_max"
        },
        "combination": {
          "technique": "arithmetic_mean",
          "parameters": {
            "weights": [
              keyword_weight,
              neural_weight
            ]
          }
        }
      }
    }
  ]    
}


response = requests.request("PUT", url, headers=headers, data=json.dumps(payload))
mr.JSON(response.json(), level=4)

Setting default model id to: i6jHTZMBflg_ePyfu9EK


In [83]:
url = "http://localhost:9200/ecommerce/_search?search_pipeline=" + pipeline
    
# iterate over all query strings and send a hybrid search query to OpenSearch with the set pipeline
query = "iphone case"

payload = {
  "_source": {
    "excludes": [
      "title_embedding"
    ]
  },
  "query": {
    "hybrid": {
      "queries": [
        {
          "multi_match" : {
              "type":       "best_fields",
              "fields":     [
                "product_id^100",
                "product_bullet_point^3",
                "product_color^2",
                "product_brand^5",
                "product_description",
                "product_title^10"
              ],
              "operator":   "and",
              "query":      query
            }
        },
        {
          "neural": {
            "title_embedding": {
              "query_text": query,
              "k": 100
            }
          }
        }
      ]
    }
  },
  "size": 100
}

response = requests.request("POST", url, headers=headers, data=json.dumps(payload)).json()

In [84]:
mr.JSON(response, level=2)

## Try out all Hybrid Search Configurations

Our global hybrid search optimization notebook tries out 66 parameter combinations for hybrid search with the following set:
* normalization technique: [`l2`, `min_max`]
* combination technique: [`arithmetic_mean`, `harmonic_mean`, `geometric_mean`]
* keyword search weight: [`0.0`, `0.1`, `0.2`, `0.3`, `0.4`, `0.5`, `0.6`, `0.7`, `0.8`, `0.9`, `1.0`]
* neural search weight: [`1.0`, `0.9`, `0.8`, `0.7`, `0.6`, `0.5`, `0.4`, `0.3`, `0.2`, `0.1`, `0.0`]

Neural and keyword search weights always add up to `1.0`, so a keyword search weight of `0.1` automatically comes with a neural search weight of `0.9`, a keyword search weight of `0.2` comes with a neural search weight of `0.8`, etc.

### Create a DataFrame with all possible combinations of hybrid search configurations

In [85]:
# Define the possible values for each column
normalization_values = ['min_max', 'l2']
combination_values = ['arithmetic_mean', 'harmonic_mean', 'geometric_mean']
keyword_values = [round(i * 0.1, 1) for i in range(11)]

# Create all possible combinations of normalization, combination, and keyword
combinations = list(itertools.product(normalization_values, combination_values, keyword_values))

# Calculate the vector as 1.0 - keyword
data = [(norm, comb, kw, 1.0 - kw) for norm, comb, kw in combinations]

# Create DataFrame
df_hybrid_search_params = pd.DataFrame(data, columns=['normalization', 'combination', 'keyword', 'vector'])

# Create a column with a pipeline name made up of its components
df_hybrid_search_params['pipeline'] = df_hybrid_search_params.normalization.apply(str) + \
    df_hybrid_search_params.combination.apply(str) + df_hybrid_search_params.keyword.apply(str)

df_hybrid_search_params.head()

Unnamed: 0,normalization,combination,keyword,vector,pipeline
0,min_max,arithmetic_mean,0.0,1.0,min_maxarithmetic_mean0.0
1,min_max,arithmetic_mean,0.1,0.9,min_maxarithmetic_mean0.1
2,min_max,arithmetic_mean,0.2,0.8,min_maxarithmetic_mean0.2
3,min_max,arithmetic_mean,0.3,0.7,min_maxarithmetic_mean0.3
4,min_max,arithmetic_mean,0.4,0.6,min_maxarithmetic_mean0.4


### Iterate over all hybrid search configurations

We execute each query from the training data set against each of the hybrid search configuration and store the 100 first results in a DataFrame for the upcoming metrics calculation

In [86]:
from concurrent.futures import ThreadPoolExecutor

In [87]:
# filtered_queries = df_query_idx[df_query_idx['query'].isin(df_train['query_string'])].head(100)
filtered_queries = df_query_idx[df_query_idx['query'].isin(df_train['query_string'])]

def process_config(config):
    results = []
    norm = config[1]
    combi = config[2]
    keywordness = round(config[3],2)
    neuralness = round(config[4], 2)
    pipeline_name = config[5]

    # Set pipeline 
     
    url = "http://localhost:9200/ecommerce/_search"
    
    # iterate over all query strings and send a hybrid search query to OpenSearch with the set pipeline
    for query in filtered_queries.itertuples():
    
        payload = {
          "_source": {
            "excludes": [
              "title_embedding"
            ]
          },
          "query": {
            "hybrid": {
              "queries": [
                {
                  "multi_match" : {
                      "type":       "best_fields",
                      "fields":     [
                        "product_id^100",
                        "product_bullet_point^3",
                        "product_color^2",
                        "product_brand^5",
                        "product_description",
                        "product_title^10"
                      ],
                      "operator":   "and",
                      "query":      query[2]
                    }
                },
                {
                  "neural": {
                    "title_embedding": {
                      "query_text": query[2],
                      "k": 100
                    }
                  }
                }
              ]
            }
          },
          "search_pipeline": {
            "request_processors": [
              {
                "neural_query_enricher" : {
                  "description": "one of many search pipelines for experimentation",
                  "default_model_id": model_id,
                  "neural_field_default_id": {
                    "title_embeddings": model_id
                  }
                }
              }
            ],
            "phase_results_processors": [
              {
                "normalization-processor": {
                  "normalization": {
                    "technique": norm
                  },
                  "combination": {
                    "technique": combi,
                    "parameters": {
                      "weights": [
                        keywordness,
                        neuralness
                      ]
                    }
                  }
                }
              }
            ]    
          },
          "size": 100
        }
    
        response = requests.request("POST", url, headers=headers, data=json.dumps(payload)).json()
        #print(response)
        # store results per pipeline_id
        position = 0
        for hit in response['hits']['hits']:
            # create a new row for the DataFrame and append it
            row = { 'query_id' : str(query[1]), 'query_string': query[2], 'product_id' : hit["_id"], 'position' : str(position), 'relevance' : hit["_score"], 'run': pipeline_name }
            # work with two for loops:
            # 1) one to iterate over the list of queries and have a query id instead of a query
            # 2) another one to iterate over the result sets to have the position of the result in the result set 
            
            # DataFrame with columns:
            # query_id: the id of the query as the trec_eval tool needs a numeric id rather than a query string as an identifier
            # product_id: the id of the product in the hit list
            # position: the position of the product in the result set
            # relevance: relevance as given by the search engine
            # run: the name of the query pipeline
            results.append(row)
    
            position += 1
    return results  # Returns a list of rows for each config

In [88]:
output_file = '../data/results_large_qs-2024-12-03_DCG_fix.csv'

with ThreadPoolExecutor() as executor:
    for config_results in tqdm_notebook(executor.map(process_config, df_hybrid_search_params.itertuples())):
        # Append each config's results to CSV file
        pd.DataFrame(config_results).to_csv(output_file, mode='a', header=False, index=False)


0it [00:00, ?it/s]

In [89]:
df_relevance = pd.read_csv(output_file, names=["query_id", "query_string", "product_id", "position", "relevance", "run"])

In [90]:
df_relevance.head(3)

Unnamed: 0,query_id,query_string,product_id,position,relevance,run
0,0,#8 tags without string,B07BZ7PHYL,0,1.0,min_maxarithmetic_mean0.0
1,0,#8 tags without string,B082VY8HHP,1,0.936527,min_maxarithmetic_mean0.0
2,0,#8 tags without string,B00006IBQ5,2,0.912902,min_maxarithmetic_mean0.0


There are 100 results per query, so there are _number of queries_ * 100 rows per pipeline in the resulting DataFrame

In [91]:
df_relevance[df_relevance['run'] == "min_maxarithmetic_mean0.0"].shape[0]

400000

In [92]:
df_relevance.shape[0]

26400000

# Calculate Metrics per Pipeline

In [93]:
df_ratings.columns = ['query_string', 'product_id', 'rating', 'query_id']
df_ratings.head(3)

Unnamed: 0,query_string,product_id,rating,query_id
0,#8 tags without string,B0751KS4ZW,0,0
1,#8 tags without string,B07541MJRV,2,0
2,#8 tags without string,B075WX3LFF,2,0


In [94]:
# Make sure ids are strings - otherwise the merge operation might cause an error
df_relevance['query_id'] = df_relevance['query_id'].astype(str)
df_relevance['position'] = df_relevance['position'].astype(int)
df_ratings['query_id'] = df_ratings['query_id'].astype(str)
# Remove duplicates from the ratings DataFrame
df_unique_ratings = df_ratings.drop_duplicates(subset=['product_id', 'query_id'])

In [95]:
# Merge results on query_id and product_id so that the resulting DataFrame has the ratings together with the search results
# Validations helps us make sure that we have only one rating for each query-doc pair. We have identical query-doc pairs per
# search pipeline but we can only have one rating for these.
df_merged = df_relevance.merge(df_unique_ratings, on=['query_id', 'product_id'], how='left', validate='many_to_one')
df_merged = df_merged.drop(columns=['query_string_y'])

df_merged.head(3)
df_merged = df_merged.rename(columns={"query_string_x": "query_string"})

In [96]:
# Count the rows without ratings - the higher the count is the less reliable the results will be
nan_count_rating = df_merged['rating'].isna().sum()
print(f"There are {df_merged.shape[0]} rows and {nan_count_rating} do not contain a rating")

There are 26400000 rows and 24703488 do not contain a rating


## Calculate Metrics

Iterate over the queries in the query set, calculate the three metrics dcg@10, ndcg@10 and precision@10 and store the results for every query in a DataFrame

In [97]:
df_ratings.head(3)

Unnamed: 0,query_string,product_id,rating,query_id
0,#8 tags without string,B0751KS4ZW,0,0
1,#8 tags without string,B07541MJRV,2,0
2,#8 tags without string,B075WX3LFF,2,0


In [98]:
# import from shared utils file metrics.py
from utils import metrics

metrics = [
    ("dcg", metrics.dcg_at_10, None),
    ("ndcg", metrics.ndcg_at_10, None),
    ("prec@10", metrics.precision_at_k, None),
    ("ratio_of_ratings", metrics.ratio_of_ratings, None)
]

In [99]:
reference = {query: df for query, df in df_ratings.groupby("query_string")}

df_metrics = []
for m_name, m_function, ref_search in metrics:
    for (query_string, run), df_gr in df_merged.groupby(["query_string", "run"]):
        metric = m_function(df_gr, reference=reference[query_string])
        df_metrics.append(pd.DataFrame({
            "query": [query_string],
            "pipeline": [run],
            "metric": [m_name],
            "value": [metric],
        }))
df_metrics = pd.concat(df_metrics)

In [100]:
df_metrics

Unnamed: 0,query,pipeline,metric,value
0,#8 tags without string,l2arithmetic_mean0.0,dcg,7.902232
0,#8 tags without string,l2arithmetic_mean0.1,dcg,26.818740
0,#8 tags without string,l2arithmetic_mean0.2,dcg,26.818740
0,#8 tags without string,l2arithmetic_mean0.3,dcg,26.818740
0,#8 tags without string,l2arithmetic_mean0.4,dcg,26.592616
...,...,...,...,...
0,眼镜框,min_maxharmonic_mean0.6,ratio_of_ratings,0.000000
0,眼镜框,min_maxharmonic_mean0.7,ratio_of_ratings,0.000000
0,眼镜框,min_maxharmonic_mean0.8,ratio_of_ratings,0.000000
0,眼镜框,min_maxharmonic_mean0.9,ratio_of_ratings,0.000000


In [101]:
df_metrics.to_csv('../data/metrics_query_train_large_qs-2024-12-03_DCG_fix.csv', index=False)

## Calculate Metrics per Pipeline by Averaging the Query Metrics

In [102]:
df_metrics_per_pipeline = df_metrics.pivot_table(index="pipeline", columns="metric", values="value", aggfunc=lambda x: x.mean().round(2))
df_metrics_per_pipeline = df_metrics_per_pipeline.reset_index()

### Top five Pipelines by NDCG@10 Descending

In [103]:
df_metrics_per_pipeline.sort_values(by='ndcg', ascending=False).head(5)

metric,pipeline,dcg,ndcg,prec@10,ratio_of_ratings
8,l2arithmetic_mean0.8,9.23,0.25,0.27,0.28
3,l2arithmetic_mean0.3,9.28,0.25,0.27,0.28
4,l2arithmetic_mean0.4,9.38,0.25,0.28,0.29
5,l2arithmetic_mean0.5,9.34,0.25,0.28,0.29
6,l2arithmetic_mean0.6,9.3,0.25,0.27,0.29


### Top five Pipelines by DCG@10 Descending

In [104]:
df_metrics_per_pipeline.sort_values(by='dcg', ascending=False).head(5)

metric,pipeline,dcg,ndcg,prec@10,ratio_of_ratings
4,l2arithmetic_mean0.4,9.38,0.25,0.28,0.29
5,l2arithmetic_mean0.5,9.34,0.25,0.28,0.29
6,l2arithmetic_mean0.6,9.3,0.25,0.27,0.29
3,l2arithmetic_mean0.3,9.28,0.25,0.27,0.28
7,l2arithmetic_mean0.7,9.26,0.25,0.27,0.28


### Top five Pipelines by Precision@10 Descending

In [105]:
df_metrics_per_pipeline.sort_values(by='prec@10', ascending=False).head(5)

metric,pipeline,dcg,ndcg,prec@10,ratio_of_ratings
4,l2arithmetic_mean0.4,9.38,0.25,0.28,0.29
5,l2arithmetic_mean0.5,9.34,0.25,0.28,0.29
38,min_maxarithmetic_mean0.5,9.09,0.24,0.27,0.28
8,l2arithmetic_mean0.8,9.23,0.25,0.27,0.28
42,min_maxarithmetic_mean0.9,9.08,0.25,0.27,0.28


In [106]:
df_merged.to_csv('../data/results_and_ratings_query_set_large_qs-2024-12-03_DCG_fix.csv')

In [107]:
# Use a query from the query set to see the results by pipeline

query = '#8 tags without string'

df_merged[(df_merged['query_string'] == query) & (df_merged['run'] == 'min_maxarithmetic_mean0.0')]

Unnamed: 0,query_id,query_string,product_id,position,relevance,run,rating
0,0,#8 tags without string,B07BZ7PHYL,0,1.000000,min_maxarithmetic_mean0.0,
1,0,#8 tags without string,B082VY8HHP,1,0.936527,min_maxarithmetic_mean0.0,
2,0,#8 tags without string,B00006IBQ5,2,0.912902,min_maxarithmetic_mean0.0,
3,0,#8 tags without string,B073SMS37Z,3,0.808007,min_maxarithmetic_mean0.0,
4,0,#8 tags without string,B07JNL4DYC,4,0.785583,min_maxarithmetic_mean0.0,
...,...,...,...,...,...,...,...
95,0,#8 tags without string,B07K68NQKM,95,0.011963,min_maxarithmetic_mean0.0,
96,0,#8 tags without string,B0788MTQLS,96,0.009620,min_maxarithmetic_mean0.0,
97,0,#8 tags without string,B07FZKH3R4,97,0.006934,min_maxarithmetic_mean0.0,
98,0,#8 tags without string,B07P2DJ6K3,98,0.005196,min_maxarithmetic_mean0.0,


In [108]:
df_metrics[(df_metrics['query'] == query) & (df_metrics['pipeline'] == 'min_maxarithmetic_mean0.0')]

Unnamed: 0,query,pipeline,metric,value
0,#8 tags without string,min_maxarithmetic_mean0.0,dcg,7.902232
0,#8 tags without string,min_maxarithmetic_mean0.0,ndcg,0.157726
0,#8 tags without string,min_maxarithmetic_mean0.0,prec@10,0.4
0,#8 tags without string,min_maxarithmetic_mean0.0,ratio_of_ratings,0.4


## Evaluate the Best Hybrid Search Configuration

We identified the best hybrid search configuration by running our training set against the different combinations.

Now we take the winning configuration and execute the test set against this configuration. We can use the calculated numbers then to compare it with our baseline from notebook 3.

In [109]:
# Set parameters for best hybrid search config:

norm = "l2"
combi = "arithmetic_mean"
keywordness = 0.4
neuralness = 0.6
pipeline_name = "l2arithmetic_mean0.4"

In [110]:
df_relevance_test = pd.DataFrame()
url = "http://localhost:9200/ecommerce/_search"

# iterate over all query strings in the test set and send a hybrid search query to OpenSearch with the set configuration
for query in tqdm_notebook(df_query_idx[df_query_idx['query'].isin(df_test['query_string'])].itertuples()):

    payload = {
      "_source": {
        "excludes": [
          "title_embedding"
        ]
      },
      "query": {
        "hybrid": {
          "queries": [
            {
              "multi_match" : {
                  "type":       "best_fields",
                  "fields":     [
                    "product_id^100",
                    "product_bullet_point^3",
                    "product_color^2",
                    "product_brand^5",
                    "product_description",
                    "product_title^10"
                  ],
                  "operator":   "and",
                  "query":      query[2]
                }
            },
            {
              "neural": {
                "title_embedding": {
                  "query_text": query[2],
                  "k": 100
                }
              }
            }
          ]
        }
      },
      "search_pipeline": {
        "request_processors": [
          {
            "neural_query_enricher" : {
              "description": "one of many search pipelines for experimentation",
              "default_model_id": model_id,
              "neural_field_default_id": {
                "title_embeddings": model_id
              }
            }
          }
        ],
        "phase_results_processors": [
          {
            "normalization-processor": {
              "normalization": {
                "technique": norm
              },
              "combination": {
                "technique": combi,
                "parameters": {
                  "weights": [
                    keywordness,
                    neuralness
                  ]
                }
              }
            }
          }
        ]    
      },
      "size": 100
    }

    response = requests.request("POST", url, headers=headers, data=json.dumps(payload)).json()
    # store results per pipeline_id
    position = 0
    for hit in response['hits']['hits']:
        # create a new row for the DataFrame and append it
        row = { 'query_id' : str(query[1]), 'query_string': query[2], 'product_id' : hit["_id"], 'position' : str(position), 'relevance' : hit["_score"], 'run': pipeline_name }

        new_row_df = pd.DataFrame([row])
        df_relevance_test = pd.concat([df_relevance_test, new_row_df], ignore_index=True)
        #print("%(id)s %(title)s: %(name)s" % hit["_source"])
        position += 1
    
    # work with two for loops:
    # 1) one to iterate over the list of queries and have a query id instead of a query
    # 2) another one to iterate over the result sets to have the position of the result in the result set 
    
    # DataFrame with columns:
    # query_id: the id of the query as the trec_eval tool needs a numeric id rather than a query string as an identifier
    # product_id: the id of the product in the hit list
    # position: the position of the product in the result set
    # relevance: relevance as given by the search engine
    # run: the name of the query pipeline

0it [00:00, ?it/s]

In [111]:
df_relevance_test.head(3)

Unnamed: 0,query_id,query_string,product_id,position,relevance,run
0,8,- we are not such things,0812994507,0,0.268424,l2arithmetic_mean0.4
1,8,- we are not such things,B00L4B5MWA,1,0.06973,l2arithmetic_mean0.4
2,8,- we are not such things,B08VJM1568,2,0.06718,l2arithmetic_mean0.4


In [112]:
# Make sure ids are strings - otherwise the merge operation might cause an error
df_relevance_test['query_id'] = df_relevance_test['query_id'].astype(str)
df_relevance_test['position'] = df_relevance_test['position'].astype(int)

In [113]:
# Merge results on query_id and product_id so that the resulting DataFrame has the ratings together with the search results
# Validations helps us make sure that we have only one rating for each query-doc pair. We have identical query-doc pairs per
# search pipeline but we can only have one rating for these.
df_merged_test = df_relevance_test.merge(df_unique_ratings, on=['query_id', 'product_id'], how='left', validate='many_to_one')
df_merged_test = df_merged_test.drop(columns=['query_string_y'])

df_merged_test.head(3)
df_merged_test = df_merged_test.rename(columns={"query_string_x": "query_string"})

In [114]:
df_metrics_test_set = []
for m_name, m_function, ref_search in metrics:
    for (query_string, run), df_gr in df_merged_test.groupby(["query_string", "run"]):
        metric = m_function(df_gr, reference=reference[query_string])
        df_metrics_test_set.append(pd.DataFrame({
            "query": [query_string],
            "pipeline": [run],
            "metric": [m_name],
            "value": [metric],
        }))
df_metrics_test_set = pd.concat(df_metrics_test_set)

In [115]:
df_metrics_test_set.head(3)

Unnamed: 0,query,pipeline,metric,value
0,- we are not such things,l2arithmetic_mean0.4,dcg,8.160558
0,02cool spray water bottle for drinking not fla...,l2arithmetic_mean0.4,dcg,9.853369
0,0pi nail polish i'm not really a waitress,l2arithmetic_mean0.4,dcg,7.0


In [116]:
df_metrics_test_set.to_csv('../data/metrics_query_test_large_qs-2024-12-03_DCG_fix.csv', index=False)

In [117]:
df_metrics_test_per_pipeline = df_metrics_test_set.pivot_table(index="pipeline", columns="metric", values="value", aggfunc=lambda x: x.mean().round(2))
df_metrics_test_per_pipeline = df_metrics_test_per_pipeline.reset_index()

In [118]:
df_metrics_test_per_pipeline

metric,pipeline,dcg,ndcg,prec@10,ratio_of_ratings
0,l2arithmetic_mean0.4,9.3,0.25,0.27,0.28


## Metrics for Large Query Set

Compared to the baseline of the previous notebook this is an improvement:

| Metric    | Baseline BM25 | Global Hybrid Search Optimizer 
| -------- | ------- | ------- |
| DCG  | 6.03    | 6.21    |
| NDCG | 0.26    | 0.28    |
| Precision    | 0.30     | 0.32    |


## Run Test Set for All Pipeline Configurations 

In [119]:
filtered_queries = df_query_idx[df_query_idx['query'].isin(df_test['query_string'])]

In [120]:
filtered_queries.shape[0]

1000

In [121]:
output_file = '../data/results_all_pipelines_test_large_qs-2024-12-03_DCG_fix.csv'

with ThreadPoolExecutor() as executor:
    for config_results in tqdm_notebook(executor.map(process_config, df_hybrid_search_params.itertuples())):
        # Append each config's results to CSV file
        pd.DataFrame(config_results).to_csv(output_file, mode='a', header=False, index=False)

0it [00:00, ?it/s]

In [122]:
df_relevance = pd.read_csv(output_file, names=["query_id", "query_string", "product_id", "position", "relevance", "run"])

In [123]:
df_relevance.head(3)

Unnamed: 0,query_id,query_string,product_id,position,relevance,run
0,8,- we are not such things,B00L4B5MWA,0,1.0,min_maxarithmetic_mean0.0
1,8,- we are not such things,B08VJM1568,1,0.788532,min_maxarithmetic_mean0.0
2,8,- we are not such things,B07RWSH4BP,2,0.783474,min_maxarithmetic_mean0.0


In [124]:
df_relevance[df_relevance['run'] == "min_maxarithmetic_mean0.0"].shape[0]

100000

In [125]:
df_relevance.shape[0]

6600000

# Calculate Metrics per Pipeline

In [126]:
df_ratings.columns = ['query_string', 'product_id', 'rating', 'query_id']
df_ratings.head(3)

Unnamed: 0,query_string,product_id,rating,query_id
0,#8 tags without string,B0751KS4ZW,0,0
1,#8 tags without string,B07541MJRV,2,0
2,#8 tags without string,B075WX3LFF,2,0


In [127]:
# Make sure ids are strings - otherwise the merge operation might cause an error
df_relevance['query_id'] = df_relevance['query_id'].astype(str)
df_relevance['position'] = df_relevance['position'].astype(int)

In [128]:
# Merge results on query_id and product_id so that the resulting DataFrame has the ratings together with the search results
# Validations helps us make sure that we have only one rating for each query-doc pair. We have identical query-doc pairs per
# search pipeline but we can only have one rating for these.
df_merged = df_relevance.merge(df_unique_ratings, on=['query_id', 'product_id'], how='left', validate='many_to_one')
df_merged = df_merged.drop(columns=['query_string_y'])

df_merged.head(3)
df_merged = df_merged.rename(columns={"query_string_x": "query_string"})

In [129]:
# Count the rows without ratings - the higher the count is the less reliable the results will be
nan_count_rating = df_merged['rating'].isna().sum()
print(f"There are {df_merged.shape[0]} rows and {nan_count_rating} do not contain a rating")

There are 6600000 rows and 6181241 do not contain a rating


## Calculate Metrics

Iterate over the queries in the query set, calculate the three metrics dcg@10, ndcg@10 and precision@10 and store the results for every query in a DataFrame

In [130]:
df_ratings.head(3)

Unnamed: 0,query_string,product_id,rating,query_id
0,#8 tags without string,B0751KS4ZW,0,0
1,#8 tags without string,B07541MJRV,2,0
2,#8 tags without string,B075WX3LFF,2,0


In [131]:
df_merged_test[df_merged_test['run'] == 'l2arithmetic_mean0.4'].head(10)

Unnamed: 0,query_id,query_string,product_id,position,relevance,run,rating
0,8,- we are not such things,0812994507,0,0.268424,l2arithmetic_mean0.4,3.0
1,8,- we are not such things,B00L4B5MWA,1,0.06973,l2arithmetic_mean0.4,
2,8,- we are not such things,B08VJM1568,2,0.06718,l2arithmetic_mean0.4,
3,8,- we are not such things,B07RWSH4BP,3,0.067119,l2arithmetic_mean0.4,
4,8,- we are not such things,B08WRDMYTH,4,0.065025,l2arithmetic_mean0.4,2.0
5,8,- we are not such things,B08TV3XGZ4,5,0.063589,l2arithmetic_mean0.4,
6,8,- we are not such things,B07MRKPHKR,6,0.063589,l2arithmetic_mean0.4,
7,8,- we are not such things,1937578313,7,0.062948,l2arithmetic_mean0.4,
8,8,- we are not such things,B078JNBVFY,8,0.062948,l2arithmetic_mean0.4,
9,8,- we are not such things,035813143X,9,0.062112,l2arithmetic_mean0.4,


In [132]:
df_merged[df_merged['run'] == 'l2arithmetic_mean0.4'].head(10)

Unnamed: 0,query_id,query_string,product_id,position,relevance,run,rating
3700000,8,- we are not such things,0812994507,0,0.268424,l2arithmetic_mean0.4,3.0
3700001,8,- we are not such things,B00L4B5MWA,1,0.06973,l2arithmetic_mean0.4,
3700002,8,- we are not such things,B08VJM1568,2,0.06718,l2arithmetic_mean0.4,
3700003,8,- we are not such things,B07RWSH4BP,3,0.067119,l2arithmetic_mean0.4,
3700004,8,- we are not such things,B08WRDMYTH,4,0.065025,l2arithmetic_mean0.4,2.0
3700005,8,- we are not such things,B08TV3XGZ4,5,0.063589,l2arithmetic_mean0.4,
3700006,8,- we are not such things,B07MRKPHKR,6,0.063589,l2arithmetic_mean0.4,
3700007,8,- we are not such things,1937578313,7,0.062948,l2arithmetic_mean0.4,
3700008,8,- we are not such things,B078JNBVFY,8,0.062948,l2arithmetic_mean0.4,
3700009,8,- we are not such things,035813143X,9,0.062112,l2arithmetic_mean0.4,


In [133]:
df_metrics = []
for m_name, m_function, ref_search in metrics:
    for (query_string, run), df_gr in df_merged.groupby(["query_string", "run"]):
        metric = m_function(df_gr, reference=reference[query_string])
        df_metrics.append(pd.DataFrame({
            "query": [query_string],
            "pipeline": [run],
            "metric": [m_name],
            "value": [metric],
        }))
df_metrics = pd.concat(df_metrics)

In [134]:
df_metrics[df_metrics['pipeline'] == 'l2arithmetic_mean0.4']

Unnamed: 0,query,pipeline,metric,value
0,- we are not such things,l2arithmetic_mean0.4,dcg,8.160558
0,02cool spray water bottle for drinking not fla...,l2arithmetic_mean0.4,dcg,9.853369
0,0pi nail polish i'm not really a waitress,l2arithmetic_mean0.4,dcg,7.000000
0,1 14 brown leather belt without buckle,l2arithmetic_mean0.4,dcg,3.000000
0,1 flexible 8x10 mirror not sheet,l2arithmetic_mean0.4,dcg,2.446395
...,...,...,...,...
0,z590 motherboards,l2arithmetic_mean0.4,ratio_of_ratings,0.600000
0,zinc liquid supplement,l2arithmetic_mean0.4,ratio_of_ratings,0.700000
0,zoom eyepiece for telescope,l2arithmetic_mean0.4,ratio_of_ratings,0.500000
0,zumba shoes,l2arithmetic_mean0.4,ratio_of_ratings,0.300000


In [135]:
df_metrics.head(5)

Unnamed: 0,query,pipeline,metric,value
0,- we are not such things,l2arithmetic_mean0.0,dcg,1.29203
0,- we are not such things,l2arithmetic_mean0.1,dcg,1.29203
0,- we are not such things,l2arithmetic_mean0.2,dcg,8.160558
0,- we are not such things,l2arithmetic_mean0.3,dcg,8.160558
0,- we are not such things,l2arithmetic_mean0.4,dcg,8.160558


In [136]:
df_metrics.to_csv('../data/metrics_query_test_large_qs_all_pipelines-2024-12-03_DCG_fix.csv', index=False)

## Calculate Metrics per Pipeline by Averaging the Query Metrics

In [137]:
df_metrics_per_pipeline = df_metrics.pivot_table(index="pipeline", columns="metric", values="value", aggfunc=lambda x: x.mean().round(2))
df_metrics_per_pipeline = df_metrics_per_pipeline.reset_index()

### Top five Pipelines by NDCG@10 Descending

In [138]:
df_metrics_per_pipeline.sort_values(by='ndcg', ascending=False).head(5)

metric,pipeline,dcg,ndcg,prec@10,ratio_of_ratings
42,min_maxarithmetic_mean0.9,9.12,0.25,0.26,0.28
7,l2arithmetic_mean0.7,9.23,0.25,0.27,0.28
39,min_maxarithmetic_mean0.6,9.2,0.25,0.27,0.28
41,min_maxarithmetic_mean0.8,9.21,0.25,0.27,0.28
9,l2arithmetic_mean0.9,9.2,0.25,0.27,0.28


### Top five Pipelines by DCG@10 Descending

In [139]:
df_metrics_per_pipeline.sort_values(by='dcg', ascending=False).head(5)

metric,pipeline,dcg,ndcg,prec@10,ratio_of_ratings
5,l2arithmetic_mean0.5,9.31,0.25,0.27,0.28
4,l2arithmetic_mean0.4,9.3,0.25,0.27,0.28
6,l2arithmetic_mean0.6,9.27,0.25,0.27,0.28
7,l2arithmetic_mean0.7,9.23,0.25,0.27,0.28
40,min_maxarithmetic_mean0.7,9.22,0.25,0.27,0.28


### Top five Pipelines by Precision@10 Descending

In [140]:
df_metrics_per_pipeline.sort_values(by='prec@10', ascending=False).head(5)

metric,pipeline,dcg,ndcg,prec@10,ratio_of_ratings
39,min_maxarithmetic_mean0.6,9.2,0.25,0.27,0.28
3,l2arithmetic_mean0.3,9.17,0.25,0.27,0.28
4,l2arithmetic_mean0.4,9.3,0.25,0.27,0.28
5,l2arithmetic_mean0.5,9.31,0.25,0.27,0.28
6,l2arithmetic_mean0.6,9.27,0.25,0.27,0.28


In [141]:
df_merged.to_csv('../data/results_and_ratings_query_set_large_qs_test_all_pipelines-2024-12-03_DCG_fix.csv')

In [142]:
# Use a query from the query set to see the results by pipeline

query = '- we are not such things'

df_merged[(df_merged['query_string'] == query) & (df_merged['run'] == 'min_maxarithmetic_mean0.0')]

Unnamed: 0,query_id,query_string,product_id,position,relevance,run,rating
0,8,- we are not such things,B00L4B5MWA,0,1.000000,min_maxarithmetic_mean0.0,
1,8,- we are not such things,B08VJM1568,1,0.788532,min_maxarithmetic_mean0.0,
2,8,- we are not such things,B07RWSH4BP,2,0.783474,min_maxarithmetic_mean0.0,
3,8,- we are not such things,B08WRDMYTH,3,0.609875,min_maxarithmetic_mean0.0,2.0
4,8,- we are not such things,B08TV3XGZ4,4,0.490763,min_maxarithmetic_mean0.0,
...,...,...,...,...,...,...,...
95,8,- we are not such things,B07TKNKPNX,95,0.008315,min_maxarithmetic_mean0.0,
96,8,- we are not such things,1635053595,96,0.008315,min_maxarithmetic_mean0.0,
97,8,- we are not such things,B01IMJHXKU,97,0.004558,min_maxarithmetic_mean0.0,
98,8,- we are not such things,B01BUNPH9Y,98,0.004558,min_maxarithmetic_mean0.0,


In [143]:
df_metrics[(df_metrics['query'] == query) & (df_metrics['pipeline'] == 'min_maxarithmetic_mean0.0')]

Unnamed: 0,query,pipeline,metric,value
0,- we are not such things,min_maxarithmetic_mean0.0,dcg,1.29203
0,- we are not such things,min_maxarithmetic_mean0.0,ndcg,0.100029
0,- we are not such things,min_maxarithmetic_mean0.0,prec@10,0.1
0,- we are not such things,min_maxarithmetic_mean0.0,ratio_of_ratings,0.1


In [144]:
df_merged_test

Unnamed: 0,query_id,query_string,product_id,position,relevance,run,rating
0,8,- we are not such things,0812994507,0,0.268424,l2arithmetic_mean0.4,3.0
1,8,- we are not such things,B00L4B5MWA,1,0.069730,l2arithmetic_mean0.4,
2,8,- we are not such things,B08VJM1568,2,0.067180,l2arithmetic_mean0.4,
3,8,- we are not such things,B07RWSH4BP,3,0.067119,l2arithmetic_mean0.4,
4,8,- we are not such things,B08WRDMYTH,4,0.065025,l2arithmetic_mean0.4,2.0
...,...,...,...,...,...,...,...
99995,4998,zyrtec,B086SPSTPT,95,0.055459,l2arithmetic_mean0.4,
99996,4998,zyrtec,B0000AN9L7,96,0.055442,l2arithmetic_mean0.4,
99997,4998,zyrtec,B00DTX8G1U,97,0.055439,l2arithmetic_mean0.4,
99998,4998,zyrtec,B00MWCF05E,98,0.055400,l2arithmetic_mean0.4,


In [145]:
df_merged_test.to_csv('../data/results_and_ratings_query_set_large_qs_test_best_pipeline-2024-12-03_DCG_fix.csv', index=False)

In [146]:
df_merged[df_merged['run'] == 'l2arithmetic_mean0.4'].to_csv('../data/results_and_ratings_query_set_large_qs_test_all_pipelines_l2a04-2024-12-03_DCG_fix.csv', index=False)