# IN4325 Information Retrieval CoreIR Project

### Authors

- Vishruty Mittal (5584825)
- Philip Kuo (5675553)
- Lian Wu (5354285)

### Our Job for the Project

Given a standard test collection (MS-Marco) and task (either document ranking or passage ranking), implement 2 reasonable baselines (required: one probabilistic model and one learning to rank model), index the test collection and evaluate the 200 test queries (the relevance judgments for those 200 queries is available from the same web page) with the test collection's commonly employed evaluation metrics. When you employ a machine learning-based approach, ensure that you do not mix up training/validation and test datasets.

Given these initial results, conduct an error analysis on either a sample of validation or training queries for at least one of your models: sample a few topics that have achieved very low retrieval effectiveness and explore why this has likely happened. This requires looking at the top retrieved documents and using your knowledge of the retrieval model to hypothesize what is the likely cause. A good starting point for this analysis is provided by the topic failure analysis template (Figure 1 in the linked report) of the Reliable Information Access Final Workshop Report.

Based on the error analysis, propose two improvements to at least one of your baselines (e.g. via neural net approaches, pseudo-relevance feedback, informed document priors, semantic approaches, external data sources such as Wikipedia … take a dive into the literature - many runs submitted to the MS-Marco leaderboard are accompanied by papers; alternatively, look at the TREC 2019 papers that made use of this dataset) and empirically evaluate those new models. Report for how many queries your improved retrieval model outperformed the baseline. Were you able to improve topics that you had identified as being not served well by your baseline? Discuss your findings. Feel free to add additional points of analysis (e.g. the impact of smoothing).

- [Course Project Webpage (BrightSpace)](https://brightspace.tudelft.nl/d2l/le/content/400612/Home)
- [Our older Google Colab doc](https://colab.research.google.com/drive/1EnYPVEXYDGvlzxqPzzIhEpqxhtFmeX2T?usp=sharing)

### Referencing/Learning Resources

- [PyTerrier demonstration for msmarco_passage](http://data.terrier.org/indices/msmarco_passage/retrieval.html)
- [PyTerrier CIKM 2021 Tutorial Notebook](https://github.com/terrier-org/cikm2021tutorial/blob/main/notebooks/notebook3.ipynb)
- [Available Datasets](https://pyterrier.readthedocs.io/en/latest/datasets.html#available-datasets)

In [1]:
!sudo apt update && sudo apt upgrade -y
!sudo apt-get install -y openjdk-11-jre
!sudo apt-get install -y openjdk-11-jdk
!pip install -q python-terrier
import pyterrier as pt
if not pt.started():
    pt.init()

from pyterrier.measures import *
dataset = pt.get_dataset('msmarco_passage')

Hit:1 http://security.debian.org/debian-security buster/updates InRelease
Hit:2 http://deb.debian.org/debian buster InRelease
Hit:3 http://deb.debian.org/debian buster-updates InRelease



All packages are up to date.




0 upgraded, 0 newly installed, 0 to remove and 0 not upgraded.



The following additional packages will be installed:
  at-spi2-core ca-certificates-java dbus dbus-user-session
  dconf-gsettings-backend dconf-service dmsetup fonts-dejavu-extra
  glib-networking glib-networking-common glib-networking-services
  gsettings-desktop-schemas java-common libapparmor1 libargon2-1 libasound2
  libasound2-data libatk-bridge2.0-0 libatk-wrapper-java
  libatk-wrapper-java-jni libatspi2.0-0 libcap2 libcolord2 libcryptsetup12
  libdconf1 libdevmapper1.02.1 libdrm-amdgpu1 libdrm-common libdrm-intel1
  libdrm-nouveau2 libdrm-radeon1 libdrm2 libepoxy0 libfontenc1 libgif7 libgl1
  libgl1-mesa-dri libglapi-mesa libglvnd0 libglx-mesa0 libglx0 libgtk-3-0
  libgtk-3-bin libgtk-3-common li

## Ranking MSMARCO Passages with Different Improvements

- [MSMARCO Passage Ranking Variants](http://data.terrier.org/msmarco_passage.dataset.html#terrier_unstemmed)

In [None]:
# For TF_IDF

tfidf = pt.BatchRetrieve.from_dataset('msmarco_passage', 'terrier_unstemmed', wmodel='TF_IDF')

Downloading msmarco_passage index to /root/.pyterrier/corpora/msmarco_passage/index/terrier_unstemmed
  warn("Downloading index of > 2GB.")
data.direct.bf:   4%|▎         | 25.7M/734M [00:02<00:57, 12.9MiB/s]

In [None]:
# For BM25

bm25 = pt.BatchRetrieve.from_dataset('msmarco_passage', 'terrier_unstemmed', wmodel='BM25')

In [None]:
# For BM25 with stemming

bm25_terrier_stemmed = pt.BatchRetrieve.from_dataset('msmarco_passage', 'terrier_stemmed', wmodel='BM25')

In [None]:
# For BM25 with stemming + docT5query
# Repo: Document Expansion by Query Prediction (https://github.com/castorini/docTTTTTquery)

bm25_terrier_stemmed_docT5query = pt.BatchRetrieve.from_dataset('msmarco_passage', 'terrier_stemmed_docT5query', wmodel='BM25')

13:47:54.016 [main] WARN org.terrier.structures.BaseCompressingMetaIndex - OutOfMemoryError: Structure meta reading data file directly from disk


In [None]:
# For BM25 with stemming + DeepCT

bm25_terrier_stemmed_deepct = pt.BatchRetrieve.from_dataset('msmarco_passage', 'terrier_stemmed_deepct', wmodel='BM25')

13:47:56.443 [main] WARN org.terrier.structures.BaseCompressingMetaIndex - OutOfMemoryError: Structure meta reading data file directly from disk


## Running PyTerrier Analytics

- [Documentation on doing Analytics](https://pyterrier.readthedocs.io/en/latest/experiments.html#running-experiments)
- [Available Evaluation Measures](https://pyterrier.readthedocs.io/en/latest/experiments.html#available-evaluation-measures)

### Available Evaluation Measures (TODO: We have to choose)

- Mean Average Precision (map).
- Mean Reciprocal Rank (recip_rank).
- Normalized Discounted Cumulative Gain (ndcg), or calculated at a given rank cutoff (e.g. ndcg_cut_5).
- Number of queries (num_q) - not averaged.
- Number of retrieved documents (num_ret) - not averaged.
- Number of relevant documents (num_rel) - not averaged.
- Number of relevant documents retrieved (num_rel_ret) - not averaged.
- Interpolated recall precision curves (iprec_at_recall). This is family of measures, so requesting iprec_at_recall will output measurements for IPrec@0.- 00, IPrec@0.10, etc.
- Precision at rank cutoff (e.g. P_5).
- Recall (recall) will generate recall at different cutoffs, such as recall_5, etc.).
- Mean response time (mrt) will report the average number of milliseconds to conduct a query (this is calculated by pt.Experiment() directly, not pytrec_eval).
- trec_eval measure families such as official, set and all_trec will be expanded. These result in many measures being returned. For instance, asking for - official results in the following (very wide) output reporting the usual default metrics of trec_eval:

In [None]:
pt.Experiment(
    [tfidf, bm25, bm25_terrier_stemmed, bm25_terrier_stemmed_docT5query, bm25_terrier_stemmed_deepct],
    pt.get_dataset('msmarco_passage').get_topics('test-2019'),
    pt.get_dataset('msmarco_passage').get_qrels('test-2019'),
    batch_size=200,
    filter_by_qrels=True,
    eval_metrics=["map", "recip_rank"],
    names=['TF_IDF', 'BM25', 'BM25 Stemmed', 'BM25 Stemmed + docT5query', 'BM25 Stemmed + DeepCT'])

KernelInterrupted: Execution interrupted by the Jupyter kernel.

<a style='text-decoration:none;line-height:16px;display:flex;color:#5B5B62;padding:10px;justify-content:end;' href='https://deepnote.com?utm_source=created-in-deepnote-cell&projectId=eeb7fcb4-84cb-4f4a-9830-9d1fcbc47e1d' target="_blank">
 </img>
Created in <span style='font-weight:600;margin-left:4px;'>Deepnote</span></a>