# Coherence-based QPP Predictors for Dense Retrieval

##### TCT-ColBERT full process

In this notebook, we demonstrate how we obtain our results for the QPP Dense Coherence-based predictors, namely pairRatio and A-pairRatio. We have obtained the retrieval results files for the TCT-ColBERT retrieval method, as seen below. However, the reader can produce their own results files and replace the corresponding csv files in the arguments.

First, install pyterrier_dr from https://github.com/terrierteam/pyterrier_dr/tree/master. 

In [None]:
%pip install --force-reinstall --no-deps git+https://github.com/terrierteam/pyterrier_dr.git@docids

In [None]:
# %pip install 'numpy<2'

In [None]:
%pip install -q sentence_transformers

In [1]:
import os
os.environ["TOKENIZERS_PARALLELISM"] = "false"

Make sure you have installed the latest version of pyterrier.

In [2]:
#%pip install --upgrade git+https://github.com/terrier-org/pyterrier.git
import pyterrier as pt
pt.init()

  from .autonotebook import tqdm as notebook_tqdm
PyTerrier 0.10.1 has loaded Terrier 5.9 (built by craigm on 2024-05-02 17:40) and terrier-helper 0.0.8

No etc/terrier.properties, using terrier.default.properties for bootstrap configuration.


In [None]:
#pt.__version__

In [None]:
#%conda install -c pytorch faiss-gpu

In [5]:
import pyterrier_dr

Here, we define the retrieval model using the corresponding index for TCT (please check that you have put the right index before proceeding).

In [6]:
model = pyterrier_dr.TctColBert('castorini/tct_colbert-v2-hnp-msmarco')
index = pyterrier_dr.NumpyIndex("/nfs/global_indices/msmarco-passage.tct-hnp", docids=True)
retr_pipeline = model >> index

  return torch._C._cuda_getDeviceCount() > 0


In [10]:
index.docids

True

In [11]:
def get_embs(row):
    return index.docnos_and_data()[1][row.docid]

Install required libraries.

In [12]:
import pandas as pd
import numpy as np
import torch
from scipy import stats
from scipy.stats import spearmanr,kendalltau
from scipy.spatial.distance import cdist
from sklearn.metrics.pairwise import cosine_similarity
from scipy.spatial import distance_matrix
from math import sqrt, log
from pyterrier.measures import *
from sklearn.cluster import KMeans
from sklearn.decomposition import PCA

## Get Exp_Results

We get the experimental results with all three evaluation metrics for the TREC DL datasets.

In [16]:
dataset = pt.get_dataset('trec-deep-learning-passages')

In [34]:
per_query_results = pt.Experiment(
    [retr_pipeline],
    dataset.get_topics("test-2019"),
    dataset.get_qrels("test-2019"),
    [MAP(rel=2)@100, nDCG@10, RR(rel=2)@10],
    perquery=True, filter_by_qrels=True,
)

In [47]:
per_query_results.head()

Unnamed: 0,name,qid,measure,value
27,Compose(<pyterrier_dr.tctcolbert_model.TctColB...,1037798,nDCG@10,0.370031
28,Compose(<pyterrier_dr.tctcolbert_model.TctColB...,1037798,AP(rel=2)@100,0.178571
29,Compose(<pyterrier_dr.tctcolbert_model.TctColB...,1037798,RR(rel=2)@10,1.0
63,Compose(<pyterrier_dr.tctcolbert_model.TctColB...,104861,nDCG@10,1.0
64,Compose(<pyterrier_dr.tctcolbert_model.TctColB...,104861,AP(rel=2)@100,0.514095


In [48]:
per_query_results_2020 = pt.Experiment(
    [retr_pipeline],
    dataset.get_topics("test-2020"),
    dataset.get_qrels("test-2020"),
    [MAP(rel=2)@100, nDCG@10, RR(rel=2)@10],
    perquery=True, filter_by_qrels=True,
)

## Get Test Topics

Get query sets as follows:

In [49]:
test_topics = dataset.get_topics("test-2019").merge(dataset.get_qrels("test-2019")[['qid']].drop_duplicates())
test_topics.head()

Unnamed: 0,qid,query
0,156493,do goldfish grow
1,1110199,what is wifi vs bluetooth
2,1063750,why did the us volunterilay enter ww1
3,130510,definition declaratory judgment
4,489204,right pelvic pain causes


In [50]:
test_topics_2020 = dataset.get_topics("test-2020").merge(dataset.get_qrels("test-2020")[['qid']].drop_duplicates())

Define our Retrieval Pipeline

In [25]:
new_pipe = retr_pipeline >> pt.apply.doc_embs(get_embs) >> pt.text.get_text(pt.get_dataset('irds:msmarco-passage'), 'text')

This line adds the embedding vectors to the query sets.

In [45]:
test_queries = model.transform(test_topics) ###get query embs
test_queries_2020 = model.transform(test_topics_2020)

### Get retrieved results using our pipeline.

In [52]:
new_pipe_100 = (retr_pipeline % 100) >> pt.apply.doc_embs(get_embs) >> pt.text.get_text(pt.get_dataset('irds:msmarco-passage'), 'text')
all_res_embs_100 = new_pipe_100.transform(test_topics)
combined_100 = test_queries.merge(all_res_embs_100)

This is how the final retrieved results look like: Doc_embs are the document embedded representations of the per query retrieved documents, while query_vec is the embedded representation of each query.

In [53]:
combined_100.head()

Unnamed: 0,qid,query,query_vec,docno,score,docid,rank,doc_embs,text
0,156493,do goldfish grow,"[0.09275106, 0.31872433, 0.20686068, -0.016680...",2928707,81.193695,2928707,0,"[0.100064516, 0.19567434, 0.30560207, -0.01379...",Goldfish Only Grow to the Size of Their Enclos...
1,156493,do goldfish grow,"[0.09275106, 0.31872433, 0.20686068, -0.016680...",1960257,81.051804,1960257,1,"[0.08526538, 0.21067119, 0.3275369, -0.0055650...",Goldfish Only Grow to the Size of Their Enclos...
2,156493,do goldfish grow,"[0.09275106, 0.31872433, 0.20686068, -0.016680...",1960255,81.024101,1960255,2,"[0.07921311, 0.20023422, 0.24291284, 0.0113921...",Rating Newest Oldest. Best Answer: Goldfish do...
3,156493,do goldfish grow,"[0.09275106, 0.31872433, 0.20686068, -0.016680...",8182162,80.989685,8182162,3,"[0.050824236, 0.21521352, 0.31407806, -0.07441...","Depending on his type and his environment, gol..."
4,156493,do goldfish grow,"[0.09275106, 0.31872433, 0.20686068, -0.016680...",2612493,80.742928,2612493,4,"[0.09987633, 0.18474764, 0.25994685, -0.015145...","In clean, uncrowded conditions in tanks or pon..."


### Dense Coherence-based QPP predictors: Calculation

Now, we show how we calculate our proposed predictors. First, we define a function for pairRatio.

In [38]:
def pair_ratio(df_embs, lim_1, lim_2, lim_3, measure_x, per_query_res):
    rows = []
    
    for qid, group in pt.tqdm(df_embs.groupby('qid'), unit='q'):
        embs_list = np.vstack(group.doc_embs.tolist())
        W_mat = pd.DataFrame(cosine_similarity(embs_list,dense_output=True))
        mean_top = W_mat.iloc[0:lim_1, 0:lim_1].values.mean()
        mean_bottom = W_mat.iloc[lim_2:lim_3, lim_2:lim_3].values.mean()
        mean_vs = mean_top/mean_bottom
        rows.append([qid,mean_vs])
    df_sim = pd.DataFrame(rows, columns=['qid', 'mean_vs'])
    merged = df_sim.merge(per_query_res, on = 'qid')
    merged = merged[merged.measure==measure_x]
    #corr_pearson = stats.pearsonr(merged['value'], merged['mean_vs'])[0]
    #corr_spearman = spearmanr(merged['value'], merged['mean_vs']).correlation
    corr_kendall = kendalltau(merged['value'], merged['mean_vs'])
    print (corr_kendall)

In the above function, lim_3 corresponds to the rank cutoff of the retrieved results, lim_1 is where the upper matrix stops, and lim_2 is where the lower matrix starts. For measure_x, replace with the metric of interest from the per_query_results, choose AP(rel=2)@100, nDCG@10, or RR(rel=2)@10. 

Test pairRatio: Here we use AP@100 for a cutoff at rank 100. Replace with other metrics to see what happens to NDCG@1- and MRR@10, or cutoffs by adjusting the limits. You can also use the results for TREC DL20 using per_query_results_2020. To get a different correlation metric, simply uncomment the lines for pearson and spearman's correlation.

Here, we demonstrate a test with a rank cutoff of 100. For a top-50 results list, the corresponding lim_1 and lim_2 will be in intervals of 5 from 5 to 35, while lim_3 would be 50.

In [39]:
for lim1 in [10, 20, 30, 40, 50, 60, 70, 80]:
    print("lim1 %d" % lim1)
    for lim2 in [10, 20, 30, 40, 50, 60, 70, 80, 90]:
        print("lim2 %d" % lim2)
        pair_ratio(combined_100, lim1, lim2, 100, 'AP(rel=2)@100', per_query_results)
        print("")

lim1 10
lim2 10


100%|██████████| 43/43 [00:00<00:00, 1030.32q/s]


SignificanceResult(statistic=0.20265780730897007, pvalue=0.055470577997765856)

lim2 20


100%|██████████| 43/43 [00:00<00:00, 1113.58q/s]


SignificanceResult(statistic=0.19379844961240308, pvalue=0.06703361454097385)

lim2 30


100%|██████████| 43/43 [00:00<00:00, 1190.69q/s]


SignificanceResult(statistic=0.20265780730897007, pvalue=0.055470577997765856)

lim2 40


100%|██████████| 43/43 [00:00<00:00, 1199.07q/s]


SignificanceResult(statistic=0.20265780730897007, pvalue=0.055470577997765856)

lim2 50


100%|██████████| 43/43 [00:00<00:00, 1192.39q/s]


SignificanceResult(statistic=0.19822812846068658, pvalue=0.06102555232367203)

lim2 60


100%|██████████| 43/43 [00:00<00:00, 1183.18q/s]


SignificanceResult(statistic=0.17829457364341084, pvalue=0.09200155630483083)

lim2 70


100%|██████████| 43/43 [00:00<00:00, 1193.80q/s]


SignificanceResult(statistic=0.16722037652270208, pvalue=0.11404310198857621)

lim2 80


100%|██████████| 43/43 [00:00<00:00, 1208.56q/s]


SignificanceResult(statistic=0.1627906976744186, pvalue=0.12394673344282209)

lim2 90


100%|██████████| 43/43 [00:00<00:00, 1209.07q/s]


SignificanceResult(statistic=0.17386489479512734, pvalue=0.1003682399820881)

lim1 20
lim2 10


100%|██████████| 43/43 [00:00<00:00, 1207.45q/s]


SignificanceResult(statistic=0.15614617940199335, pvalue=0.14004478013981814)

lim2 20


100%|██████████| 43/43 [00:00<00:00, 1190.86q/s]


SignificanceResult(statistic=0.1583610188261351, pvalue=0.1345090520109157)

lim2 30


100%|██████████| 43/43 [00:00<00:00, 1192.92q/s]


SignificanceResult(statistic=0.14285714285714285, pvalue=0.17700339443308466)

lim2 40


100%|██████████| 43/43 [00:00<00:00, 1191.39q/s]


SignificanceResult(statistic=0.1539313399778516, pvalue=0.1457541464430843)

lim2 50


100%|██████████| 43/43 [00:00<00:00, 1040.74q/s]


SignificanceResult(statistic=0.1539313399778516, pvalue=0.1457541464430843)

lim2 60


100%|██████████| 43/43 [00:00<00:00, 1199.58q/s]


SignificanceResult(statistic=0.1450719822812846, pvalue=0.1703842810882269)

lim2 70


100%|██████████| 43/43 [00:00<00:00, 1202.60q/s]


SignificanceResult(statistic=0.16943521594684385, pvalue=0.10933056125598256)

lim2 80


100%|██████████| 43/43 [00:00<00:00, 1205.82q/s]


SignificanceResult(statistic=0.13399778516057584, pvalue=0.20540001319103585)

lim2 90


100%|██████████| 43/43 [00:00<00:00, 1209.20q/s]


SignificanceResult(statistic=0.13842746400885936, pvalue=0.19081309011196146)

lim1 30
lim2 10


100%|██████████| 43/43 [00:00<00:00, 1206.09q/s]


SignificanceResult(statistic=0.1317829457364341, pvalue=0.21299024938601496)

lim2 20


100%|██████████| 43/43 [00:00<00:00, 1210.63q/s]


SignificanceResult(statistic=0.1406423034330011, pvalue=0.18381220775928653)

lim2 30


100%|██████████| 43/43 [00:00<00:00, 1202.98q/s]


SignificanceResult(statistic=0.1317829457364341, pvalue=0.21299024938601496)

lim2 40


100%|██████████| 43/43 [00:00<00:00, 1193.44q/s]


SignificanceResult(statistic=0.1317829457364341, pvalue=0.21299024938601496)

lim2 50


100%|██████████| 43/43 [00:00<00:00, 1213.38q/s]


SignificanceResult(statistic=0.12956810631229235, pvalue=0.22078093380603558)

lim2 60


100%|██████████| 43/43 [00:00<00:00, 1208.36q/s]


SignificanceResult(statistic=0.1184939091915836, pvalue=0.2627990015663074)

lim2 70


100%|██████████| 43/43 [00:00<00:00, 1210.06q/s]


SignificanceResult(statistic=0.1362126245847176, pvalue=0.19800830674339343)

lim2 80


100%|██████████| 43/43 [00:00<00:00, 1212.74q/s]


SignificanceResult(statistic=0.12070874861572535, pvalue=0.2539819504245272)

lim2 90


100%|██████████| 43/43 [00:00<00:00, 1213.16q/s]


SignificanceResult(statistic=0.10741971207087486, pvalue=0.3100363372653202)

lim1 40
lim2 10


100%|██████████| 43/43 [00:00<00:00, 1203.23q/s]


SignificanceResult(statistic=0.14728682170542634, pvalue=0.1639524008255926)

lim2 20


100%|██████████| 43/43 [00:00<00:00, 1204.25q/s]


SignificanceResult(statistic=0.1406423034330011, pvalue=0.18381220775928653)

lim2 30


100%|██████████| 43/43 [00:00<00:00, 1207.23q/s]


SignificanceResult(statistic=0.1317829457364341, pvalue=0.21299024938601496)

lim2 40


100%|██████████| 43/43 [00:00<00:00, 1202.29q/s]


SignificanceResult(statistic=0.12513842746400886, pvalue=0.23697067930681837)

lim2 50


100%|██████████| 43/43 [00:00<00:00, 1195.29q/s]


SignificanceResult(statistic=0.1229235880398671, pvalue=0.2453729183587694)

lim2 60


100%|██████████| 43/43 [00:00<00:00, 1196.69q/s]


SignificanceResult(statistic=0.10299003322259136, pvalue=0.3304121188571688)

lim2 70


100%|██████████| 43/43 [00:00<00:00, 1202.29q/s]


SignificanceResult(statistic=0.10741971207087486, pvalue=0.3100363372653202)

lim2 80


100%|██████████| 43/43 [00:00<00:00, 1203.03q/s]


SignificanceResult(statistic=0.10963455149501661, pvalue=0.3001663937035294)

lim2 90


100%|██████████| 43/43 [00:00<00:00, 1217.41q/s]


SignificanceResult(statistic=0.09856035437430785, pvalue=0.35163509508635105)

lim1 50
lim2 10


100%|██████████| 43/43 [00:00<00:00, 1202.97q/s]


SignificanceResult(statistic=0.10963455149501661, pvalue=0.3001663937035294)

lim2 20


100%|██████████| 43/43 [00:00<00:00, 1207.53q/s]


SignificanceResult(statistic=0.1273532668881506, pvalue=0.22877385768686798)

lim2 30


100%|██████████| 43/43 [00:00<00:00, 1208.30q/s]


SignificanceResult(statistic=0.1140642303433001, pvalue=0.2810612862169707)

lim2 40


100%|██████████| 43/43 [00:00<00:00, 1210.69q/s]


SignificanceResult(statistic=0.10741971207087486, pvalue=0.3100363372653202)

lim2 50


100%|██████████| 43/43 [00:00<00:00, 1214.09q/s]


SignificanceResult(statistic=0.10520487264673312, pvalue=0.3201182320001086)

lim2 60


100%|██████████| 43/43 [00:00<00:00, 1203.39q/s]


SignificanceResult(statistic=0.08305647840531562, pvalue=0.4325083682789408)

lim2 70


100%|██████████| 43/43 [00:00<00:00, 1208.33q/s]


SignificanceResult(statistic=0.08527131782945736, pvalue=0.42033662205299704)

lim2 80


100%|██████████| 43/43 [00:00<00:00, 1207.14q/s]


SignificanceResult(statistic=0.09191583610188261, pvalue=0.3850495934739234)

lim2 90


100%|██████████| 43/43 [00:00<00:00, 1209.36q/s]

SignificanceResult(statistic=0.11627906976744184, pvalue=0.2718251430502646)

lim1 60
lim2 10



100%|██████████| 43/43 [00:00<00:00, 1200.45q/s]


SignificanceResult(statistic=0.11184939091915835, pvalue=0.29050817751372915)

lim2 20


100%|██████████| 43/43 [00:00<00:00, 1205.13q/s]


SignificanceResult(statistic=0.12513842746400886, pvalue=0.23697067930681837)

lim2 30


100%|██████████| 43/43 [00:00<00:00, 1191.64q/s]


SignificanceResult(statistic=0.0963455149501661, pvalue=0.3625633178937435)

lim2 40


100%|██████████| 43/43 [00:00<00:00, 1196.72q/s]


SignificanceResult(statistic=0.08748615725359911, pvalue=0.4083684458435356)

lim2 50


100%|██████████| 43/43 [00:00<00:00, 1071.35q/s]


SignificanceResult(statistic=0.08970099667774085, pvalue=0.3966055892323337)

lim2 60


100%|██████████| 43/43 [00:00<00:00, 1199.25q/s]


SignificanceResult(statistic=0.06090808416389811, pvalue=0.5648868466104321)

lim2 70


100%|██████████| 43/43 [00:00<00:00, 1203.77q/s]


SignificanceResult(statistic=0.06533776301218161, pvalue=0.5369314822780484)

lim2 80


100%|██████████| 43/43 [00:00<00:00, 1210.45q/s]


SignificanceResult(statistic=0.07641196013289035, pvalue=0.470224382087596)

lim2 90


100%|██████████| 43/43 [00:00<00:00, 1207.22q/s]


SignificanceResult(statistic=0.08970099667774085, pvalue=0.3966055892323337)

lim1 70
lim2 10


100%|██████████| 43/43 [00:00<00:00, 1203.45q/s]


SignificanceResult(statistic=0.14285714285714285, pvalue=0.17700339443308466)

lim2 20


100%|██████████| 43/43 [00:00<00:00, 1212.08q/s]


SignificanceResult(statistic=0.1539313399778516, pvalue=0.1457541464430843)

lim2 30


100%|██████████| 43/43 [00:00<00:00, 1212.89q/s]


SignificanceResult(statistic=0.10520487264673312, pvalue=0.3201182320001086)

lim2 40


100%|██████████| 43/43 [00:00<00:00, 1208.81q/s]


SignificanceResult(statistic=0.10963455149501661, pvalue=0.3001663937035294)

lim2 50


100%|██████████| 43/43 [00:00<00:00, 1195.52q/s]


SignificanceResult(statistic=0.09856035437430785, pvalue=0.35163509508635105)

lim2 60


100%|██████████| 43/43 [00:00<00:00, 1193.10q/s]


SignificanceResult(statistic=0.08305647840531562, pvalue=0.4325083682789408)

lim2 70


100%|██████████| 43/43 [00:00<00:00, 718.00q/s]


SignificanceResult(statistic=0.08084163898117386, pvalue=0.44488172534138504)

lim2 80


100%|██████████| 43/43 [00:00<00:00, 1182.53q/s]


SignificanceResult(statistic=0.07641196013289035, pvalue=0.470224382087596)

lim2 90


100%|██████████| 43/43 [00:00<00:00, 1196.04q/s]


SignificanceResult(statistic=0.08748615725359911, pvalue=0.4083684458435356)

lim1 80
lim2 10


100%|██████████| 43/43 [00:00<00:00, 1195.10q/s]


SignificanceResult(statistic=0.1184939091915836, pvalue=0.2627990015663074)

lim2 20


100%|██████████| 43/43 [00:00<00:00, 1197.40q/s]


SignificanceResult(statistic=0.1273532668881506, pvalue=0.22877385768686798)

lim2 30


100%|██████████| 43/43 [00:00<00:00, 1194.40q/s]


SignificanceResult(statistic=0.1140642303433001, pvalue=0.2810612862169707)

lim2 40


100%|██████████| 43/43 [00:00<00:00, 1200.74q/s]


SignificanceResult(statistic=0.11627906976744184, pvalue=0.2718251430502646)

lim2 50


100%|██████████| 43/43 [00:00<00:00, 1205.98q/s]


SignificanceResult(statistic=0.1184939091915836, pvalue=0.2627990015663074)

lim2 60


100%|██████████| 43/43 [00:00<00:00, 1208.21q/s]


SignificanceResult(statistic=0.08527131782945736, pvalue=0.42033662205299704)

lim2 70


100%|██████████| 43/43 [00:00<00:00, 1210.13q/s]


SignificanceResult(statistic=0.052048726467331115, pvalue=0.6228078793721249)

lim2 80


100%|██████████| 43/43 [00:00<00:00, 1204.66q/s]


SignificanceResult(statistic=0.07641196013289035, pvalue=0.470224382087596)

lim2 90


100%|██████████| 43/43 [00:00<00:00, 1209.67q/s]


SignificanceResult(statistic=0.08305647840531562, pvalue=0.4325083682789408)



In [42]:
def adjusted_pair_ratio(df_embs, lim_1, lim_2, lim_3, measure_x, per_query_res):
    rows = []
    
    for qid, group in pt.tqdm(df_embs.groupby('qid'), unit='q'):
        embs_list = np.vstack(group.doc_embs.tolist())
        query_embs = group.iloc[0].query_vec
        score_list = group.score
        W_mat = cosine_similarity(embs_list,dense_output=True)
        score_exp = np.expand_dims(score_list,axis=1)
        pair_mat = np.dot(score_exp,score_exp.T)
        weighted_mat = W_mat@pair_mat
        W_mat_new = pd.DataFrame(weighted_mat)
        mean_top = W_mat_new.iloc[0:lim_1, 0:lim_1].values.mean()
        mean_bottom = W_mat_new.iloc[lim_2:lim_3, lim_2:lim_3].values.mean()
        mean_vs = mean_top/mean_bottom
        rows.append([qid,mean_vs])   
    df_sim = pd.DataFrame(rows, columns=['qid', 'mean_vs'])
    merged = df_sim.merge(per_query_res, on = 'qid')
    merged = merged[merged.measure==measure_x]
    #corr_person = stats.pearsonr(merged['value'], merged['mean_vs'])[0]
    #corr_spearman = spearmanr(merged['value'], merged['mean_vs']).correlation
    corr_kendall = kendalltau(merged['value'], merged['mean_vs'])
    print(corr_kendall)

Test A-pairRatio in a similar way as pairRatio above.

In [43]:
for lim1 in [10, 20, 30, 40, 50, 60, 70, 80]:
    print("lim1 %d" % lim1)
    for lim2 in [10, 20, 30, 40, 50, 60, 70, 80, 90]:
        print("lim2 %d" % lim2)
        adjusted_pair_ratio(combined_100, lim1, lim2, 100, 'AP(rel=2)@100', per_query_results)
        print("")

lim1 10
lim2 10


100%|██████████| 43/43 [00:00<00:00, 609.07q/s]


SignificanceResult(statistic=0.26688815060908083, pvalue=0.011663453960855275)

lim2 20


100%|██████████| 43/43 [00:00<00:00, 552.16q/s]


SignificanceResult(statistic=0.2757475083056478, pvalue=0.009163596363952935)

lim2 30


100%|██████████| 43/43 [00:00<00:00, 513.17q/s]


SignificanceResult(statistic=0.2646733111849391, pvalue=0.012376084980840824)

lim2 40


100%|██████████| 43/43 [00:00<00:00, 537.38q/s]


SignificanceResult(statistic=0.25802879291251385, pvalue=0.014750685638928979)

lim2 50


100%|██████████| 43/43 [00:00<00:00, 527.30q/s]


SignificanceResult(statistic=0.2535991140642303, pvalue=0.01654882193551084)

lim2 60


100%|██████████| 43/43 [00:00<00:00, 568.85q/s]


SignificanceResult(statistic=0.2535991140642303, pvalue=0.01654882193551084)

lim2 70


100%|██████████| 43/43 [00:00<00:00, 555.05q/s]


SignificanceResult(statistic=0.2535991140642303, pvalue=0.01654882193551084)

lim2 80


100%|██████████| 43/43 [00:00<00:00, 552.80q/s]


SignificanceResult(statistic=0.2358803986710963, pvalue=0.025804952455033163)

lim2 90


100%|██████████| 43/43 [00:00<00:00, 600.31q/s]


SignificanceResult(statistic=0.24695459579180506, pvalue=0.019606784919992522)

lim1 20
lim2 10


100%|██████████| 43/43 [00:00<00:00, 556.74q/s]


SignificanceResult(statistic=0.2757475083056478, pvalue=0.009163596363952935)

lim2 20


100%|██████████| 43/43 [00:00<00:00, 568.91q/s]


SignificanceResult(statistic=0.2757475083056478, pvalue=0.009163596363952935)

lim2 30


100%|██████████| 43/43 [00:00<00:00, 516.78q/s]


SignificanceResult(statistic=0.26245847176079734, pvalue=0.013127016780085222)

lim2 40


100%|██████████| 43/43 [00:00<00:00, 511.50q/s]


SignificanceResult(statistic=0.26245847176079734, pvalue=0.013127016780085222)

lim2 50


100%|██████████| 43/43 [00:00<00:00, 500.78q/s]


SignificanceResult(statistic=0.2535991140642303, pvalue=0.01654882193551084)

lim2 60


100%|██████████| 43/43 [00:00<00:00, 556.95q/s]


SignificanceResult(statistic=0.25138427464008856, pvalue=0.0175180489193538)

lim2 70


100%|██████████| 43/43 [00:00<00:00, 447.85q/s]


SignificanceResult(statistic=0.24916943521594684, pvalue=0.01853668656977206)

lim2 80


100%|██████████| 43/43 [00:00<00:00, 527.58q/s]


SignificanceResult(statistic=0.24695459579180506, pvalue=0.019606784919992522)

lim2 90


100%|██████████| 43/43 [00:00<00:00, 572.55q/s]


SignificanceResult(statistic=0.22259136212624583, pvalue=0.03541776107269042)

lim1 30
lim2 10


100%|██████████| 43/43 [00:00<00:00, 564.34q/s]


SignificanceResult(statistic=0.22702104097452933, pvalue=0.0319199840249392)

lim2 20


100%|██████████| 43/43 [00:00<00:00, 526.06q/s]


SignificanceResult(statistic=0.2336655592469546, pvalue=0.02723007466276913)

lim2 30


100%|██████████| 43/43 [00:00<00:00, 567.48q/s]


SignificanceResult(statistic=0.22259136212624583, pvalue=0.03541776107269042)

lim2 40


100%|██████████| 43/43 [00:00<00:00, 572.06q/s]


SignificanceResult(statistic=0.22702104097452933, pvalue=0.0319199840249392)

lim2 50


100%|██████████| 43/43 [00:00<00:00, 834.65q/s]


SignificanceResult(statistic=0.22702104097452933, pvalue=0.0319199840249392)

lim2 60


100%|██████████| 43/43 [00:00<00:00, 933.09q/s]


SignificanceResult(statistic=0.20487264673311184, pvalue=0.052855112350349634)

lim2 70


100%|██████████| 43/43 [00:00<00:00, 884.17q/s]


SignificanceResult(statistic=0.21816168327796234, pvalue=0.039237464954529684)

lim2 80


100%|██████████| 43/43 [00:00<00:00, 738.94q/s]


SignificanceResult(statistic=0.21151716500553708, pvalue=0.04561968148630561)

lim2 90


100%|██████████| 43/43 [00:00<00:00, 511.32q/s]


SignificanceResult(statistic=0.21151716500553708, pvalue=0.04561968148630561)

lim1 40
lim2 10


100%|██████████| 43/43 [00:00<00:00, 556.25q/s]


SignificanceResult(statistic=0.22923588039867107, pvalue=0.030285059012930795)

lim2 20


100%|██████████| 43/43 [00:00<00:00, 595.71q/s]


SignificanceResult(statistic=0.23145071982281282, pvalue=0.02872260880772164)

lim2 30


100%|██████████| 43/43 [00:00<00:00, 591.91q/s]


SignificanceResult(statistic=0.22702104097452933, pvalue=0.0319199840249392)

lim2 40


100%|██████████| 43/43 [00:00<00:00, 552.99q/s]


SignificanceResult(statistic=0.2336655592469546, pvalue=0.02723007466276913)

lim2 50


100%|██████████| 43/43 [00:00<00:00, 721.61q/s]


SignificanceResult(statistic=0.20487264673311184, pvalue=0.052855112350349634)

lim2 60


100%|██████████| 43/43 [00:00<00:00, 807.11q/s]


SignificanceResult(statistic=0.19822812846068658, pvalue=0.06102555232367203)

lim2 70


100%|██████████| 43/43 [00:00<00:00, 809.28q/s]


SignificanceResult(statistic=0.19822812846068658, pvalue=0.06102555232367203)

lim2 80


100%|██████████| 43/43 [00:00<00:00, 812.77q/s]


SignificanceResult(statistic=0.20265780730897007, pvalue=0.055470577997765856)

lim2 90


100%|██████████| 43/43 [00:00<00:00, 897.12q/s]


SignificanceResult(statistic=0.20265780730897007, pvalue=0.055470577997765856)

lim1 50
lim2 10


100%|██████████| 43/43 [00:00<00:00, 732.06q/s]


SignificanceResult(statistic=0.21816168327796234, pvalue=0.039237464954529684)

lim2 20


100%|██████████| 43/43 [00:00<00:00, 728.90q/s]


SignificanceResult(statistic=0.2336655592469546, pvalue=0.02723007466276913)

lim2 30


100%|██████████| 43/43 [00:00<00:00, 729.46q/s]


SignificanceResult(statistic=0.21816168327796234, pvalue=0.039237464954529684)

lim2 40


100%|██████████| 43/43 [00:00<00:00, 737.75q/s]


SignificanceResult(statistic=0.2070874861572536, pvalue=0.05034351393549388)

lim2 50


100%|██████████| 43/43 [00:00<00:00, 752.48q/s]


SignificanceResult(statistic=0.20044296788482835, pvalue=0.058193013393482464)

lim2 60


100%|██████████| 43/43 [00:00<00:00, 734.62q/s]


SignificanceResult(statistic=0.16057585825027684, pvalue=0.1291440320672913)

lim2 70


100%|██████████| 43/43 [00:00<00:00, 739.12q/s]


SignificanceResult(statistic=0.18715393133997782, pvalue=0.07695128818374385)

lim2 80


100%|██████████| 43/43 [00:00<00:00, 734.70q/s]


SignificanceResult(statistic=0.18715393133997782, pvalue=0.07695128818374385)

lim2 90


100%|██████████| 43/43 [00:00<00:00, 737.90q/s]


SignificanceResult(statistic=0.19601328903654486, pvalue=0.06397135691589576)

lim1 60
lim2 10


100%|██████████| 43/43 [00:00<00:00, 738.32q/s]


SignificanceResult(statistic=0.20044296788482835, pvalue=0.058193013393482464)

lim2 20


100%|██████████| 43/43 [00:00<00:00, 739.32q/s]


SignificanceResult(statistic=0.22480620155038758, pvalue=0.03362999632002421)

lim2 30


100%|██████████| 43/43 [00:00<00:00, 740.83q/s]


SignificanceResult(statistic=0.19601328903654486, pvalue=0.06397135691589576)

lim2 40


100%|██████████| 43/43 [00:00<00:00, 777.20q/s]


SignificanceResult(statistic=0.19822812846068658, pvalue=0.06102555232367203)

lim2 50


100%|██████████| 43/43 [00:00<00:00, 805.49q/s]


SignificanceResult(statistic=0.16943521594684385, pvalue=0.10933056125598256)

lim2 60


100%|██████████| 43/43 [00:00<00:00, 808.03q/s]


SignificanceResult(statistic=0.1627906976744186, pvalue=0.12394673344282209)

lim2 70


100%|██████████| 43/43 [00:00<00:00, 806.11q/s]


SignificanceResult(statistic=0.17829457364341084, pvalue=0.09200155630483083)

lim2 80


100%|██████████| 43/43 [00:00<00:00, 839.26q/s]


SignificanceResult(statistic=0.17165005537098557, pvalue=0.10477333682093702)

lim2 90


100%|██████████| 43/43 [00:00<00:00, 920.59q/s]


SignificanceResult(statistic=0.17829457364341084, pvalue=0.09200155630483083)

lim1 70
lim2 10


100%|██████████| 43/43 [00:00<00:00, 808.55q/s]


SignificanceResult(statistic=0.22480620155038758, pvalue=0.03362999632002421)

lim2 20


100%|██████████| 43/43 [00:00<00:00, 848.35q/s]


SignificanceResult(statistic=0.24916943521594684, pvalue=0.01853668656977206)

lim2 30


100%|██████████| 43/43 [00:00<00:00, 874.77q/s]


SignificanceResult(statistic=0.21816168327796234, pvalue=0.039237464954529684)

lim2 40


100%|██████████| 43/43 [00:00<00:00, 884.38q/s]


SignificanceResult(statistic=0.20930232558139533, pvalue=0.04793271426816844)

lim2 50


100%|██████████| 43/43 [00:00<00:00, 747.22q/s]


SignificanceResult(statistic=0.18715393133997782, pvalue=0.07695128818374385)

lim2 60


100%|██████████| 43/43 [00:00<00:00, 801.44q/s]


SignificanceResult(statistic=0.1760797342192691, pvalue=0.09611205744619103)

lim2 70


100%|██████████| 43/43 [00:00<00:00, 943.27q/s]


SignificanceResult(statistic=0.19379844961240308, pvalue=0.06703361454097385)

lim2 80


100%|██████████| 43/43 [00:00<00:00, 863.82q/s]


SignificanceResult(statistic=0.16943521594684385, pvalue=0.10933056125598256)

lim2 90


100%|██████████| 43/43 [00:00<00:00, 764.64q/s]


SignificanceResult(statistic=0.19158361018826134, pvalue=0.07021553454588961)

lim1 80
lim2 10


100%|██████████| 43/43 [00:00<00:00, 818.91q/s]


SignificanceResult(statistic=0.22702104097452933, pvalue=0.0319199840249392)

lim2 20


100%|██████████| 43/43 [00:00<00:00, 765.71q/s]


SignificanceResult(statistic=0.24695459579180506, pvalue=0.019606784919992522)

lim2 30


100%|██████████| 43/43 [00:00<00:00, 796.16q/s]


SignificanceResult(statistic=0.2203765227021041, pvalue=0.03728599498237718)

lim2 40


100%|██████████| 43/43 [00:00<00:00, 820.22q/s]


SignificanceResult(statistic=0.21151716500553708, pvalue=0.04561968148630561)

lim2 50


100%|██████████| 43/43 [00:00<00:00, 784.50q/s]


SignificanceResult(statistic=0.20044296788482835, pvalue=0.058193013393482464)

lim2 60


100%|██████████| 43/43 [00:00<00:00, 755.72q/s]


SignificanceResult(statistic=0.17386489479512734, pvalue=0.1003682399820881)

lim2 70


100%|██████████| 43/43 [00:00<00:00, 882.82q/s]


SignificanceResult(statistic=0.17386489479512734, pvalue=0.1003682399820881)

lim2 80


100%|██████████| 43/43 [00:00<00:00, 853.96q/s]


SignificanceResult(statistic=0.17386489479512734, pvalue=0.1003682399820881)

lim2 90


100%|██████████| 43/43 [00:00<00:00, 860.61q/s]

SignificanceResult(statistic=0.19379844961240308, pvalue=0.06703361454097385)






### Top1(monoT5)

On top of our dense coherence-based predictors, we propose a baseline predictor on the supervised side. Below, we show how we obtain it.

First, install the pyterrier plugin for Mono and Duo T5 from https://github.com/terrierteam/pyterrier_t5.

In [None]:
pip install --upgrade git+https://github.com/terrierteam/pyterrier_t5.git

In [55]:
from pyterrier_t5 import MonoT5ReRanker, DuoT5ReRanker

In [56]:
monoT5 = MonoT5ReRanker()

You are using the default legacy behaviour of the <class 'transformers.models.t5.tokenization_t5.T5Tokenizer'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thoroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


Now, we update the pipeline by getting the document embeddings (first line). Then, in the second line, we define the cross-encoder pipeline for the proposed predictor.

In [57]:
new_pipe = retr_pipeline >> pt.apply.doc_embs(get_embs) >> pt.text.get_text(pt.get_dataset('irds:msmarco-passage'), 'text')
cross_encoder_pipe = new_pipe %1 >> monoT5 >> pt.apply.qpp_monot5(lambda df: df["score"])

We then transform the queries to get the final results, and we merge with the evaluation metrics to get the final correlation.

In [58]:
allres_t5 = cross_encoder_pipe.transform(test_topics)
allres_t5_2020 = cross_encoder_pipe.transform(test_topics_2020)

monoT5: 100%|██████████| 11/11 [00:08<00:00,  1.33batches/s]
monoT5: 100%|██████████| 14/14 [00:11<00:00,  1.17batches/s]


In [59]:
merged = allres_t5.merge(per_query_results)
merged = merged[merged.measure=='AP(rel=2)@100']
corr_kendall = kendalltau(merged['value'], merged['qpp_monot5'])
print(corr_kendall)

SignificanceResult(statistic=0.058693244739756366, pvalue=0.5791222556173692)


In [60]:
merged = allres_t5_2020.merge(per_query_results_2020)
merged = merged[merged.measure=='AP(rel=2)@100']
corr_kendall = kendalltau(merged['value'], merged['qpp_monot5'])
#corr_pearson = stats.pearsonr(merged['value'], merged['qpp_monot5'])[0]
#corr_spearman = spearmanr(merged['value'], merged['qpp_monot5']).correlation
print(corr_kendall)

SignificanceResult(statistic=0.27962252669275495, pvalue=0.0028428020282334435)


Simply uncomment the corresponding lines to get the pearson and spearman's correlations.