# IR Lab Tutorial: Statistical Analysis

This tutorial shows how to conduct a hypothesis test to compare two retrieval approaches.
The two runs compared in this example are loaded from the TIRA cache.

## Step 1: Ensure that libraries are imported

In [8]:
! echo $JAVA_HOME
! pip install  python-terrier

/usr/local/sdkman/candidates/java/current


In [7]:
import pyterrier as pt
pt.init()

Java started and loaded: pyterrier.java, pyterrier.terrier.java [version=5.10 (build: craigm 2024-08-22 17:33), helper_version=0.0.8]
java is now started automatically with default settings. To force initialisation early, run:
pt.java.init() # optional, forces java initialisation
  pt.init()


In [8]:
# PyTerrier must be imported after `ensure_pyterrier_is_loaded` is called.

from pyterrier import started, init

if not started():
    init()
    
! pip install ir_datasets

  if not started():




## Step 2: Load the dataset

In [9]:
! pip install ir-datasets



In [23]:
! pip install --upgrade ir-datasets
from tira.third_party_integrations import ensure_pyterrier_is_loaded
from tira.rest_api_client import Client

ensure_pyterrier_is_loaded()
tira = Client()

from pyterrier import get_dataset

dataset = get_dataset('irds:ir-lab-wise-2024/subsampled-ms-marco-deep-learning-20241201-training')
dataset




IRDSDataset('ir-lab-wise-2024/subsampled-ms-marco-deep-learning-20241201-training')

## Step 3: Create the retrieval pipeline with TIRA

In this example, we will just use two existing retrieval components from TIREx: BM25 and DirichletLM, two lexical rankers.
We load the approaches via the TIRA API.

In [28]:
from pyterrier import IterDictIndexer

indexer = IterDictIndexer(
    # Store the index in the `index` directory.
    "../data/index",
    meta={'docno': 50, 'text': 4096},
    # If an index already exists there, then overwrite it.
    overwrite=True,
)
index = indexer.index(dataset.get_corpus_iter())






Download from Zenodo: https://zenodo.org/records/14254044/files/subsampled-ms-marco-deep-learning-20241201-training-inputs.zip


Download: 100%|██████████| 9.51M/9.51M [00:00<00:00, 44.3MiB/s]


Download finished. Extract...
Extraction finished:  /home/codespace/.tira/extracted_datasets/ir-lab-wise-2024/subsampled-ms-marco-deep-learning-20241201-training/


ir-lab-wise-2024/subsampled-ms-marco-deep-learning-20241201-training documents:  38%|███▊      | 25807/68261 [00:07<00:08, 4827.00it/s]



ir-lab-wise-2024/subsampled-ms-marco-deep-learning-20241201-training documents: 100%|██████████| 68261/68261 [00:13<00:00, 5168.99it/s] 


21:30:04.293 [ForkJoinPool-1-worker-1] WARN org.terrier.structures.indexing.Indexer -- Indexed 1 empty documents


In [29]:
from pyterrier import BatchRetrieve

bm25 = BatchRetrieve(index, wmodel="BM25")


  bm25 = BatchRetrieve(index, wmodel="BM25")


In [None]:
# bm25 = tira_client.pt.from_retriever_submission(
#     approach='ir-benchmarks/tira-ir-starter/BM25 (tira-ir-starter-pyterrier)',
#     dataset='argsme-touche-2020-task-1-20230209-training',
# )
# bm25

<tira.pyterrier_util.TiraSourceTransformer at 0x75bfad047dd0>

## Step 4: Measure effectiveness

Now let us measure the nDCG@10 effectiveness of both systems on the Touché 2020 task 1 dataset.

In [None]:
from pyterrier.pipelines import Experiment

experiment = Experiment(
    retr_systems=[
        dlm,
        bm25,
    ],
    topics=dataset.get_topics("query"),
    qrels=dataset.get_qrels(),
    eval_metrics=["ndcg_cut_10"],
    names=[
        "DirichletLM",
        "BM25",
    ],
    perquery=True,
)
experiment.sample(n=10)

ValueError: For topics in dataset trec-robust-2004, there are no variants, but you specified query

This data frame shows the nDCG@10 values measured for each query and both systems (DrichletLM and BM25). \
So we have pairs of measurements where the same metric (i.e., nDCG@10) is measured using the same input (e.g., query #1) but for two different systems.
Let's re-arrange the data frame so that BM25 and DirichletLM values are in separate columns, not rows.

In [28]:
experiment_bm25 = experiment[experiment["name"] == "BM25"]\
    .drop(columns=["name"])
experiment_dlm = experiment[experiment["name"] == "DirichletLM"]\
    .drop(columns=["name"])

experiment_paired = experiment_bm25.merge(
    experiment_dlm,
    on=["qid", "measure"],
    suffixes=("_bm25", "_dlm"),
)
experiment_paired.head(n=10)

Unnamed: 0,qid,measure,value_bm25,value_dlm
0,1,ndcg_cut_10,0.661871,0.8805
1,10,ndcg_cut_10,0.158507,0.63322
2,11,ndcg_cut_10,0.309352,0.752969
3,12,ndcg_cut_10,0.061113,0.19279
4,13,ndcg_cut_10,0.31488,0.434739
5,14,ndcg_cut_10,0.355866,0.408224
6,15,ndcg_cut_10,0.094788,0.542364
7,16,ndcg_cut_10,0.208744,0.443535
8,17,ndcg_cut_10,0.0,0.686715
9,18,ndcg_cut_10,0.540948,0.699474


## Step 5: Conduct hypothesis tests

On this _paired_ measurement data, we can now conduct _paired_ t-tests to test for statistical significance of given hypotheses.
Remember that the choice of your test depends (amongst other factors) on how the hypothesis is formulated.

Let us test some hypotheses to get a feeling of what this means:

#### Hypothesis 1: BM25 has a significantly different nDCG@10 on Touché 2020 task 1 than DirichletLM.

Significance test: two-sided paired t-test \
Significance level: $\alpha = 0.05$ (i.e., the effect is only considered significant if $p < 0.05$)

In [29]:
from scipy.stats import ttest_rel

# Hypothesis 1: Two-sided paired t-test to check if BM25 and DirichletLM are significantly different
p_value_1 = ttest_rel(
    experiment_paired["value_bm25"],  # BM25 values
    experiment_paired["value_dlm"],   # DirichletLM values
    alternative='two-sided'           # Two-sided test
).pvalue

# Significance level
alpha = 0.05

print("Hypothesis 1: Two-sided t-test p-value:", p_value_1)
if p_value_1 < alpha:
    print("Reject the null hypothesis - BM25 and DirichletLM have significantly different nDCG@10.")
else:
    print("Fail to reject the null hypothesis - BM25 and DirichletLM do not have significantly different nDCG@10.")


1.0865032406710116e-08

The above value is called $p$, the probability of the corresponding null hypothesis (the probability that the effect would be observed by chance). \
Because this is lower than our significance level $\alpha$, we can reject the null hypothesis and confirm the hypothesis 1. \
Indeed, BM25 and DirichletLM lead to significantly different nDCG@10 scores.

Now it would be great to find out which is better. \
One way could be to formulate a hypothesis with a predefined "direction". In this example we assume BM25 to be better.

#### Hypothesis 2: BM25 has a significantly higher nDCG@10 on Touché 2020 task 1 than DirichletLM.

Significance test: one-sided paired t-test \
Significance level: $\alpha = 0.05$ (or $p < 0.05$)

In [30]:
from scipy.stats import ttest_rel

ttest_rel(
    experiment_paired["value_bm25"],
    experiment_paired["value_dlm"],
    alternative='greater',
).pvalue

0.9999999945674838

This time, the probability $p$ of the null hypothesis is much higher than our significance level $\alpha$. \
So we cannot reject the null hypothesis and fail to confirm hypothesis 2.

Last, we test the opposite direction: BM25 could be worse w.r.t. nDCG@10 than DirichletLM.

#### Hypothesis 2: BM25 has a significantly lower nDCG@10 on Touché 2020 task 1 than DirichletLM.

Significance test: one-sided paired t-test \
Significance level: $\alpha = 0.05$ (or $p < 0.05$)

In [31]:
from scipy.stats import ttest_rel

ttest_rel(
    experiment_paired["value_bm25"],
    experiment_paired["value_dlm"],
    alternative='less',
).pvalue

5.432516203355058e-09

Here, $p$ is less than than our significance level $\alpha$. We reject the null hypothesis and confirm hypothesis 3.