- Goal: Manual Search
    1. Initialize range for alpha (0 to 1)
    2. Combine sparse and dense scores for each combination of alpha (dense) and 1 - alpha (sparse)
    3. Use evaluation metric (MRR) to find best combination of weights

- Evaluation metric: NDCG (Normalized Discounted Cumulative Gain)
    - Prioritizes top-ranked documents by penalizing rankings that place relevant documents lower down in the ranking list

Questions:

- for both splade and bm25, since qrels only provides us with 1 confirmed relevant document, should we use mrr to as an evaluation metric since it only looks at the first relevant document?
    - use mrr

- how to handle weighting when one model ranks a document that the other does not rank at all? 
    - the second model gives it a zero

In [1]:
import numpy as np
import pandas as pd
from sklearn.metrics import ndcg_score
import os

Load Data

In [2]:
data_dir = "/Users/hannahzhang/Desktop/Github Repos/ERSP-TeamYang/data/"

data = []

for file in os.listdir(data_dir):
    if file.endswith(".tsv") or file.endswith(".trec"):
        data.append(data_dir + file)

print(data)


['/Users/hannahzhang/Desktop/Github Repos/ERSP-TeamYang/data/bm25-t5-dev.trec', '/Users/hannahzhang/Desktop/Github Repos/ERSP-TeamYang/data/splade_llama_rerank.tsv', '/Users/hannahzhang/Desktop/Github Repos/ERSP-TeamYang/data/qrels.dev.tsv', '/Users/hannahzhang/Desktop/Github Repos/ERSP-TeamYang/data/splade-dev.trec', '/Users/hannahzhang/Desktop/Github Repos/ERSP-TeamYang/data/collection.tsv']


Splade dataframe

In [3]:
splade_df = pd.read_csv(data[3], sep="\t", names=['Query ID', 'Q0', 'Document ID', 'Rank', 'Score', 'R0'])
splade_df = splade_df.drop(splade_df.columns[[1,5]], axis=1)
print(splade_df.head)

<bound method NDFrame.head of          Query ID  Document ID  Rank   Score
0         1048585      7187155     0  104472
1         1048585      7187160     1  100811
2         1048585      7187157     2   99206
3         1048585      7187158     3   98698
4         1048585      3100835     4   86255
...           ...          ...   ...     ...
6979995   1048565      4838288   995   66246
6979996   1048565      2133477   996   66245
6979997   1048565      5753707   997   66239
6979998   1048565      1472257   998   66238
6979999   1048565      5637117   999   66238

[6980000 rows x 4 columns]>


In [4]:
dense_df = pd.read_csv(data[1], sep="\t", names=['Query ID', 'Document ID', 'Score'])
print(dense_df.head)

<bound method NDFrame.head of          Query ID  Document ID     Score
0         1048585      7187157  0.866932
1         1048585      7187158  0.863535
2         1048585      7187155  0.861530
3         1048585      7187160  0.858853
4         1048585      7187163  0.840336
...           ...          ...       ...
6979995   1048565      4529995  0.705006
6979996   1048565      8496497  0.704949
6979997   1048565      5713758  0.699297
6979998   1048565      1778769  0.695161
6979999   1048565      5713765  0.689829

[6980000 rows x 3 columns]>


In [5]:
dense_df_sample = dense_df.iloc[980:1020,]

dense_df_sample

Unnamed: 0,Query ID,Document ID,Score
980,1048585,5758678,0.65965
981,1048585,6855987,0.659532
982,1048585,3047385,0.657969
983,1048585,1894260,0.657868
984,1048585,669482,0.657537
985,1048585,7840605,0.656992
986,1048585,8293552,0.656816
987,1048585,5509227,0.656089
988,1048585,7686270,0.655758
989,1048585,8327435,0.655513


In [6]:
for index, row in dense_df_sample.iterrows():
    if (index == 980) or (dense_df_sample.at[index, 'Query ID'] != dense_df_sample.at[(index - 1), 'Query ID']):
        counter = 1
        dense_df_sample.at[index, 'Rank'] = counter
    else:
        counter += 1
        dense_df_sample.at[index, 'Rank'] = counter + 1

dense_df_sample

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  dense_df_sample.at[index, 'Rank'] = counter


Unnamed: 0,Query ID,Document ID,Score,Rank
980,1048585,5758678,0.65965,1.0
981,1048585,6855987,0.659532,3.0
982,1048585,3047385,0.657969,4.0
983,1048585,1894260,0.657868,5.0
984,1048585,669482,0.657537,6.0
985,1048585,7840605,0.656992,7.0
986,1048585,8293552,0.656816,8.0
987,1048585,5509227,0.656089,9.0
988,1048585,7686270,0.655758,10.0
989,1048585,8327435,0.655513,11.0


In [138]:
dense_df['Rank'] = dense_df['Rank'].astype(int)
dense_df.head

<bound method NDFrame.head of          Query ID  Document ID     Score  Rank
0         1048585      7187157  0.866932     1
1         1048585      7187158  0.863535     3
2         1048585      7187155  0.861530     4
3         1048585      7187160  0.858853     5
4         1048585      7187163  0.840336     6
...           ...          ...       ...   ...
6979995   1048565      4529995  0.705006   997
6979996   1048565      8496497  0.704949   998
6979997   1048565      5713758  0.699297   999
6979998   1048565      1778769  0.695161  1000
6979999   1048565      5713765  0.689829  1001

[6980000 rows x 4 columns]>

In [10]:
qrels_df = pd.read_csv(data[2], sep="\t", names=['Query ID', '0', 'Document ID', "Relevance"])
qrels_df = qrels_df.drop(columns=['0'])

qrels_df.head

<bound method NDFrame.head of        Query ID  Document ID  Relevance
0       1102432      2026790          1
1       1102431      7066866          1
2       1102431      7066867          1
3       1090282      7066900          1
4         39449      7066905          1
...         ...          ...        ...
59268    150337      8009410          1
59269     22241      8009429          1
59270    129177      8009442          1
59271    190655      3576091          1
59272    371455      8009476          1

[59273 rows x 3 columns]>

In [26]:
print((qrels_df["Document ID"] == 7395566).sum())

1


In [11]:
dense_query_ids = []
sparse_query_ids = []


for x in dense_df['Query ID'].unique():
    dense_query_ids.append(int(x))

for x in splade_df['Query ID'].unique():
    sparse_query_ids.append(int(x))

query_ids = list(set(dense_query_ids) & set(sparse_query_ids))

print(query_ids)
print(len(query_ids))


[2, 1048585, 458771, 163860, 458774, 917536, 524332, 786477, 65583, 65584, 393268, 491585, 1048642, 622658, 852037, 327750, 163912, 589903, 950355, 262232, 786520, 688218, 65627, 557157, 196720, 622725, 884870, 458885, 262280, 622734, 884878, 524447, 393420, 1016013, 1016015, 852179, 295135, 1081569, 426214, 65770, 786674, 393462, 1081595, 983299, 1081609, 917789, 1048876, 622893, 917825, 1048917, 196949, 557401, 885081, 1016154, 786786, 196963, 557417, 426347, 98682, 1081730, 328072, 983438, 524699, 983451, 197024, 885153, 819618, 1048995, 65957, 786857, 524722, 557492, 164282, 524733, 1016254, 885184, 754113, 426442, 983499, 1016281, 393696, 786918, 295406, 754166, 983543, 786937, 1049085, 98817, 721409, 688644, 426504, 131597, 950799, 459280, 754191, 459291, 98847, 524835, 524848, 885301, 885308, 688711, 131665, 1016406, 1081946, 590433, 688739, 66154, 1049200, 66161, 426622, 1049221, 1016460, 1082002, 393881, 983708, 164528, 623281, 131768, 885433, 230082, 459481, 393954, 1016547, 

In [12]:
dict = {el:0 for el in query_ids}
print(dict)

{2: 0, 1048585: 0, 458771: 0, 163860: 0, 458774: 0, 917536: 0, 524332: 0, 786477: 0, 65583: 0, 65584: 0, 393268: 0, 491585: 0, 1048642: 0, 622658: 0, 852037: 0, 327750: 0, 163912: 0, 589903: 0, 950355: 0, 262232: 0, 786520: 0, 688218: 0, 65627: 0, 557157: 0, 196720: 0, 622725: 0, 884870: 0, 458885: 0, 262280: 0, 622734: 0, 884878: 0, 524447: 0, 393420: 0, 1016013: 0, 1016015: 0, 852179: 0, 295135: 0, 1081569: 0, 426214: 0, 65770: 0, 786674: 0, 393462: 0, 1081595: 0, 983299: 0, 1081609: 0, 917789: 0, 1048876: 0, 622893: 0, 917825: 0, 1048917: 0, 196949: 0, 557401: 0, 885081: 0, 1016154: 0, 786786: 0, 196963: 0, 557417: 0, 426347: 0, 98682: 0, 1081730: 0, 328072: 0, 983438: 0, 524699: 0, 983451: 0, 197024: 0, 885153: 0, 819618: 0, 1048995: 0, 65957: 0, 786857: 0, 524722: 0, 557492: 0, 164282: 0, 524733: 0, 1016254: 0, 885184: 0, 754113: 0, 426442: 0, 983499: 0, 1016281: 0, 393696: 0, 786918: 0, 295406: 0, 754166: 0, 983543: 0, 786937: 0, 1049085: 0, 98817: 0, 721409: 0, 688644: 0, 426504

In [13]:
alpha_values = np.arange(0, 1.01, 0.01)
print(alpha_values)

[0.   0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09 0.1  0.11 0.12 0.13
 0.14 0.15 0.16 0.17 0.18 0.19 0.2  0.21 0.22 0.23 0.24 0.25 0.26 0.27
 0.28 0.29 0.3  0.31 0.32 0.33 0.34 0.35 0.36 0.37 0.38 0.39 0.4  0.41
 0.42 0.43 0.44 0.45 0.46 0.47 0.48 0.49 0.5  0.51 0.52 0.53 0.54 0.55
 0.56 0.57 0.58 0.59 0.6  0.61 0.62 0.63 0.64 0.65 0.66 0.67 0.68 0.69
 0.7  0.71 0.72 0.73 0.74 0.75 0.76 0.77 0.78 0.79 0.8  0.81 0.82 0.83
 0.84 0.85 0.86 0.87 0.88 0.89 0.9  0.91 0.92 0.93 0.94 0.95 0.96 0.97
 0.98 0.99 1.  ]


In [35]:
def best_weights(dense_df, sparse_df, query_id, alpha_values, qrels_df):
    best_alpha = 0
    best_mrr = 0

    for alpha in alpha_values:
        # Filter by query ID
        filtered_dense_df = dense_df[dense_df["Query ID"] == query_id].copy()
        filtered_sparse_df = sparse_df[sparse_df["Query ID"] == query_id].copy()

        # Find weighted scores
        filtered_dense_df["Score"] *= (1 - alpha)
        filtered_sparse_df["Score"] *= alpha

        # Merge rankings
        merged = filtered_dense_df.merge(filtered_sparse_df, on="Document ID", how="outer", suffixes=("_dense", "_sparse")).fillna(0)
        merged["Final Score"] = merged["Score_dense"] + merged["Score_sparse"]

        # Rank documents
        ranked_results = merged.sort_values("Final Score", ascending=False)
        ranked_docs = ranked_results["Document ID"].tolist()

        # MRR
        relevant_doc = qrels_df[qrels_df["Query ID"] == query_id]["Document ID"].iloc[0]
        if relevant_doc not in ranked_docs:
            mrr_score = 0
        else:
            rank = ranked_docs.index(relevant_doc) + 1
            mrr_score = 1 / rank

        # Update alpha and MRR
        if mrr_score > best_mrr:
            best_mrr = mrr_score
            best_alpha = alpha

    return best_alpha, best_mrr


In [38]:
best_alphas = []

for query in query_ids:
    best_alpha, best_mrr = best_weights(dense_df, splade_df, query, alpha_values, qrels_df)
    print(f"Query {query}: Best alpha: {best_alpha}, Best MRR: {best_mrr}")
    best_alphas.append(best_alpha)


Query 2: Best alpha: 0.0, Best MRR: 1.0
Query 1048585: Best alpha: 0.0, Best MRR: 0.5
Query 458771: Best alpha: 0.0, Best MRR: 1.0
Query 163860: Best alpha: 0.01, Best MRR: 0.25
Query 458774: Best alpha: 0.0, Best MRR: 0.5
Query 917536: Best alpha: 0.01, Best MRR: 1.0
Query 524332: Best alpha: 0.0, Best MRR: 0.05
Query 786477: Best alpha: 0.0, Best MRR: 1.0
Query 65583: Best alpha: 0.0, Best MRR: 0.5
Query 65584: Best alpha: 0.0, Best MRR: 1.0
Query 393268: Best alpha: 0.0, Best MRR: 0.5
Query 491585: Best alpha: 0.01, Best MRR: 0.16666666666666666
Query 1048642: Best alpha: 0.01, Best MRR: 0.08333333333333333
Query 622658: Best alpha: 0.01, Best MRR: 0.022222222222222223
Query 852037: Best alpha: 0.0, Best MRR: 0.5
Query 327750: Best alpha: 0.0, Best MRR: 1.0
Query 163912: Best alpha: 0.0, Best MRR: 0.14285714285714285
Query 589903: Best alpha: 0.0, Best MRR: 0.03225806451612903
Query 950355: Best alpha: 0.01, Best MRR: 1.0
Query 262232: Best alpha: 0, Best MRR: 0
Query 786520: Best a

In [46]:
res = list(dict.fromkeys(best_alphas))

In [45]:
for i in res:
    print(float(i))

0.0
0.01
0.02
0.05
0.04
1.0
0.09
0.03
0.22
0.15
0.1
0.07
0.08
0.06
0.3
0.11
0.13
0.17
0.23
0.47000000000000003
0.14
0.16
0.29
0.12
0.18
