# Introduction to PyTerrier

This notebook introduces the Python package `pyterrier` that builds on the Java-based retrieval toolkit Terrier. All of the code can be run on Google Colab and does not require any other data uploads. However, it should also be possible to run this notebook on your local machine without too much technical preparations.

Pyterrier is mature software and offers an excellent API that is easy to learn and it provides easy access to IR evaluations, even for those who are new to the topic of IR research. Within the scope of this notebook, we will use Pyterrier to build an experimental environment that will handle the data indexing, retrieval, and evaluation for you.

Even though Pyterrier provides bindings for modern neural retrieval methods and features a sophisticated way of how to declare experimental pipelines, we will focus on more traditional lexical-based retrieval methods that provide a solid basis for your ideas of how to improve a baseline ranking by network analysis and bibliometric data.

**Scope of this notebook**:

- PyTerrier installation and configuration,
- Downloading and indexing the TREC Covid / CORD19 collection,
- Using the BatchRetrieve transformer for searching an index,
- Conducting an Experiment,
- Interactive search examples,
- Writing custom reranking pipelines based on different criteria,
- Short introduction to `ir_datasets`,
- Helpful resources at the end of this notebook.

**HINT**: Some of the examples are taken from the ECIR21 tutorial of Pyterrier, which is a good resource for those who want to have further material. The link is provided in the resources.

### Install Pyterrier and initialize the package

After installation, PyTerrier needs to be initialized by the `init()` method as PyTerrier must download Terrier's jar file and start the Java virtual machine.

In [None]:
!pip install python-terrier

import pyterrier as pt
if not pt.started():
  pt.init()

### Download and index the dataset

Our document collection CORD19 is publicly available and can be downloaded from Semantic Scholar. In the code cell below, PyTerrier handles the download, extraction, and indexing. All of these are standard operations, but still cause a lot of hassle, if you would have to implement them from scratch.

In [None]:
import os

dataset = pt.datasets.get_dataset('irds:cord19/trec-covid')
pt_index_path = './indices/cord19'

if not os.path.exists(pt_index_path + "/data.properties"):
  indexer = pt.index.IterDictIndexer(pt_index_path, blocks=True)
  index_ref = indexer.index(dataset.get_corpus_iter(), 
                            fields=['title', 'doi', 'abstract'], 
                            meta=('docno',))
  
else:
  index_ref = pt.IndexRef.of(pt_index_path + "/data.properties")
  
index = pt.IndexFactory.of(index_ref)

## Run a simple batch retriever

Let's run a retrieval method (BM25) for a batch of queries. In a typical IR experiment, the systems are evaluated over 50 (or more) queries as it is also the case for experiments based on TREC Covid. 

After initialization, simply hand over all the `title`-queries to the BM25-Transformer. It will output a DataFrame with rankings up to a fixed rank (num_results) for each of the 50 queries.

An explanation of the column names is given below.

In [None]:
bm25 = pt.BatchRetrieve(index_ref , wmodel='BM25', num_results=10)
res = bm25.transform(dataset.get_topics('title'))
res

- `qid`: id of the query or topic
- `docid`: Terrier' internal integer for each document
- `docno`: the external (string) unique identifier for each document
- `score`: score of the retrieval method (match between query and document)
- `rank`: a handy attribute showing the descending order by score
- `query`: the input query

## Run an experiment

Often it is more interesting to compare a pack of different retrieval systems to determine which one performs best. For this purpose, PyTerrier offers the evaluation of multiple systems in combination with some standard evaluation measures. As PyTerrier also holds the ground-truth relevance labels (qrels), an entire retrieval benchmark can be run by the cell below.

The results are returned in a handy DataFrame. Depending on the evaluation use-case you can select different evaluation measures, conduct significance tests and apply the corresponding correction methods.

In [None]:
TF_IDF = pt.BatchRetrieve(index_ref, wmodel="TF_IDF") 
BM25 = pt.BatchRetrieve(index_ref, wmodel="BM25") 
DFRee = pt.BatchRetrieve(index_ref, wmodel="DFRee") 

systems = [
    TF_IDF,
    BM25,
    DFRee
]

topics = dataset.get_topics('title')

qrels = dataset.get_qrels()

# eval_metrics=['P_20', 'ndcg_cut_20', 'map']
eval_metrics=['map']

exp_res = pt.Experiment(
    systems,
    topics,
    qrels,
    eval_metrics=eval_metrics,
    baseline=0,
    correction='bonferroni'
)

exp_res

## Interactive search and calibration

Very often, retrieval methods need to be calibrated for the document collection to be more effective. In the example below, you can play with different parameterizations of the BM25 retrieval methods and see by the output of a single query how the ranking changes.

BM25 according to Robertson et al.

 $\sum_{t \in q} \log \left(1+\frac{N-d f_{t}+0.5}{d f_{t}+0.5}\right) \cdot \frac{t f_{t d}}{k_{1} \cdot\left(1-b+b \cdot\left(\frac{L_{\text {d}}}{L_{a v g}}\right)\right)+t f_{t d}}$

$k_1$ calibrates the term frequency scaling. It is a positive tuning parameter.
$k_1=0$ means the term frequency is not considered, whereas large values lead to raw term frequency scalings.

$b$ ranges between 0 and 1 and scales the term weight by the document length. There is no length normalization, if $b=0$.

In [None]:
_b = 0.5
_k1 = 0.95

bm25 = pt.BatchRetrieve(index_ref , wmodel='BM25', controls={"c" : _b, "bm25.k_1": _k1, "bm25.k_3": 0.75}) 

query = 'vaccine'
res = BM25.search(query)
res 

## Boost by publication date

In this section, we want to rerank a first-stage ranking by a custom criterion. More specifically, we boost the ranking score by a recency criterion (the publication date).

Even though, the index contains some other document fields besides the full text, it does not contain all of the available metadata provided for the CORD19 dataset. 

As we did the download earlier, we can easily use the metadata CSV file in the ir_datasets/ subdirectory.

In [None]:
import pandas as pd
metadata = pd.read_csv('~/.ir_datasets/cord19/2020-07-16/metadata.csv')
metadata

In [None]:
metadata.columns

In [None]:
metadata[metadata['cord_uid'] == res.iloc[0]['docno']]

Let us define the custom score boost based on the publication year and integrate it into the ranking.

In [None]:
from datetime import datetime

def date_boost(docno):
  publish_time = metadata[metadata['cord_uid'] == docno]['publish_time'].iloc[0]

  if len(publish_time) > 4:
    date_object = datetime.strptime(publish_time, '%Y-%m-%d').date()

    if date_object.year > 2015:
      return 1.5

  return 0.75

bm25 = pt.BatchRetrieve(index_ref , wmodel='BM25', controls={"c" : _b, "bm25.k_1": _k1, "bm25.k_3": 0.75}, num_results=10)

boost = lambda doc: doc['score'] * date_boost(doc['docno'])

reranker = bm25 >> pt.apply.doc_score(boost)

reranker.search('vaccine')

## Boost by number of authors

In this section, we want to rerank a first-stage ranking by a custom criterion. More specifically, we boost the ranking score by the number of authors a publication has.

In [None]:
def author_boost(docno):
    raw_authors = metadata[metadata['cord_uid'] == docno]['authors'].iloc[0]

    if isinstance(raw_authors, str):
      authors = raw_authors.split(';')
      num_authors = len(authors)
      return num_authors

    return 1

bm25 = pt.BatchRetrieve(index_ref , wmodel='BM25', controls={"c" : _b, "bm25.k_1": _k1, "bm25.k_3": 0.75}, num_results=10)

boost = lambda doc: doc['score'] * author_boost(doc['docno'])

reranker = bm25 >> pt.apply.doc_score(boost)

reranker.search('vaccine')

## Boost by citation counts

In this section, we want to rerank a first-stage ranking by a custom criterion. More specifically, we boost the ranking score by the citation count of a publication. The purpose of this section should give you some idea about what other kinds of reranking criterions are possible.

To get the citation count, we use `scholarly` an unofficial Python-API to Google Scholar. Please do not hammer the server, otherwise your IP may get blocked. See also the random sleep interval between the request.

**Highly experimental: sometimes it works, sometimes not.** Later, it may be more practical to scrape bibliometric data in advance and ingest it into a database that can be queried on purpose.

In [None]:
# https://github.com/scholarly-python-package/scholarly
# https://scholarly.readthedocs.io/en/latest/index.html
!pip3 install scholarly

In [None]:
from scholarly import scholarly 
import time
from random import randint

def cite_boost(docno):
    _title = metadata[metadata['cord_uid'] == docno]['title'].iloc[0]
    pub = scholarly.search_pubs(_title)
    time.sleep(randint(2,4))
    num_citations = next(pub).get('num_citations')

    return num_citations

bm25 = pt.BatchRetrieve(index_ref , wmodel='BM25', controls={"c" : _b, "bm25.k_1": _k1, "bm25.k_3": 0.75}, num_results=10)

boost = lambda doc: doc['score'] * cite_boost(doc['docno'])

reranker = bm25 >> pt.apply.doc_score(boost)

reranker.search('vaccine')

## TL;DR: Experimentation with a custom reranking criterion

Use this template as a quickstart for your experiments.

In [None]:
!pip install python-terrier

import pyterrier as pt
if not pt.started():
  pt.init()

import os

dataset = pt.datasets.get_dataset('irds:cord19/trec-covid')
pt_index_path = './indices/cord19'

if not os.path.exists(pt_index_path + "/data.properties"):
  indexer = pt.index.IterDictIndexer(pt_index_path, blocks=True)
  index_ref = indexer.index(dataset.get_corpus_iter(), 
                            fields=['title', 'doi', 'date', 'abstract'], 
                            meta=('docno',))
  
else:
  index_ref = pt.IndexRef.of(pt_index_path + "/data.properties")
  
index = pt.IndexFactory.of(index_ref)

In [None]:
def custom_boost(docno):
    # INSERT YOUR IDEAS HERE

    return 1


baseline = pt.BatchRetrieve(index_ref , wmodel='BM25', num_results=10)

boost = lambda doc: doc['score'] * custom_boost(doc['docno'])

reranker = baseline >> pt.apply.doc_score(boost)

systems = [
    baseline,
    reranker
]

names = ['Baseline', 'Reranker']

topics = dataset.get_topics('title')

qrels = dataset.get_qrels()

eval_metrics=['P_20', 'ndcg_cut_20', 'map']

exp_res = pt.Experiment(
    systems,
    topics,
    qrels,
    eval_metrics,
    names=names,
    baseline=0
)

exp_res

## ir_datasets

`ir_datasets` is integrated into PyTerrier and is the major component, which facilitates the data handling. As PyTerrier's API is actually comprehensive enough you do not necessarily need to use `ir_datasets` explicitly, but depending on your implementation ideas, it may be helpful. 

The example below shows how to get a DocStore from a defined dataset in order to read out single documents by their `docno`.

In [None]:
import ir_datasets

_dataset = ir_datasets.load("cord19")
docstore = _dataset.docs_store()

res = reranker.search('vaccine')
docno = res.iloc[0]['docno']

docstore.get(docno)

## Writing run files to disk

In this section, we conduct the IR evaluations in the old-fashioned way with files stored on disk and by the use of the software `trec_eval`. We have included this section for illustrative purposes, so that you can have an idea about what kind of hassle is avoided by the use of PyTerrier. However, depending on your approach, it may be helpful for some evaluations to know the more traditional way.

First, let's make rankings for a batch of queries and write it into a text file - a so-called run file.

In [None]:
bm25 = pt.BatchRetrieve(index_ref , wmodel='BM25', num_results=1000)
res = bm25.transform(dataset.get_topics('title'))
res = res.drop_duplicates(subset=['docno']) # we did not remove duplicates from the index
run_name = 'bm25'
file_name = run_name
pt.io.write_results(res, file_name, format='trec',run_name=run_name)
!cat bm25

Before evaluation, you need to obtain the ground-truth relevance labels (qrels) from NIST servers.

In [None]:
!wget https://ir.nist.gov/trec-covid/data/qrels-covid_d5_j0.5-5.txt && head -n 10 qrels-covid_d5_j0.5-5.txt

While not necessary at this point, it is sometimes helpful to have a look at the queries and corresponding topic texts.

In [None]:
!wget https://ir.nist.gov/trec-covid/data/topics-rnd5.xml && head -n 6 topics-rnd5.xml

Finally, we need to download and compile the evaluation software `trec_eval`.

In [None]:
!git clone https://github.com/usnistgov/trec_eval.git && cd trec_eval && make

Once the compilation has finished, we can evaluate the run file. 

In [None]:
!./trec_eval/trec_eval qrels-covid_d5_j0.5-5.txt bm25

# T5 Transformer-based Reranking

PyTerrier's extensions offer an out-of-the-box experience for rerankings based on modern neural language models. The example below illustrates how default (not fine-tuned) T5-based rerankers can be integrated into the evaluations.

In [None]:
# Reranking based on a Mono- or Duo-T5 reranker
# https://github.com/terrierteam/pyterrier_t5
# https://colab.research.google.com/github/terrierteam/pyterrier_t5/blob/master/pyterrier_t5_trec-covid.ipynb#scrollTo=5fMRKXSjjd1w
# https://www.nvidia.com/content/dam/en-zz/Solutions/Data-Center/tesla-t4/t4-tensor-core-datasheet-951643.pdf
# see also the documentation: https://pyterrier.readthedocs.io/en/latest/neural.html

!pip install --upgrade git+https://github.com/terrierteam/pyterrier_t5.git

from pyterrier_t5 import MonoT5ReRanker, DuoT5ReRanker
monoT5 = MonoT5ReRanker(text_field='abstract')
duoT5 = DuoT5ReRanker(text_field='abstract')

bm25 = pt.BatchRetrieve(index_ref, wmodel="BM25") % 100
mono_pipeline = bm25 >> pt.text.get_text(dataset, "abstract") >> monoT5
duo_pipeline = mono_pipeline % 10 >> duoT5

# Resources




## TREC Covid and CORD19
- **TREC Covid website**: https://ir.nist.gov/trec-covid/
- **Data resources**: https://ir.nist.gov/trec-covid/data.html
- **CORD19 dataset**: https://github.com/allenai/cord19
- **CORD19 paper**: https://api.semanticscholar.org/CorpusID:216056360

## Pyterrier
- **Pyterrier documentation**: https://pyterrier.readthedocs.io/en/latest/
- **Transformer documentation**: https://pyterrier.readthedocs.io/en/latest/transformer.html (**not** the neural language models)
- **GitHub repository**: https://github.com/terrier-org/pyterrier
- **arXiv publication**: https://arxiv.org/abs/2007.14271
- **Javadocs of matching models**: http://terrier.org/docs/current/javadoc/org/terrier/matching/models/package-summary.html 
- **ECIR21 tutorial:** https://github.com/terrier-org/ecir2021tutorial

## ir_datasets
- **Website / catalog**: https://ir-datasets.com/
- **CORD19 subpage**: https://ir-datasets.com/cord19.html
- **Python API**: https://ir-datasets.com/python.html

## Terrier's matching models

cf. http://terrier.org/docs/current/javadoc/org/terrier/matching/models/package-summary.html

In [None]:
# TF_IDF = pt.BatchRetrieve(index, wmodel="TF_IDF") 
# BM25 = pt.BatchRetrieve(index, wmodel="BM25") 
# Tf = pt.BatchRetrieve(index, wmodel="Tf") 
# BM25F = pt.BatchRetrieve(index, wmodel="BM25F") 
# XSqrA_M = pt.BatchRetrieve(index, wmodel="XSqrA_M") 
# DirichletLM = pt.BatchRetrieve(index, wmodel="DirichletLM") 
# LemurTF_IDF = pt.BatchRetrieve(index, wmodel="LemurTF_IDF") 
# PL2 = pt.BatchRetrieve(index, wmodel="PL2") 
# PL2F = pt.BatchRetrieve(index, wmodel="PL2F") 
# BB2 = pt.BatchRetrieve(index, wmodel="BB2") 
# Dl = pt.BatchRetrieve(index, wmodel="Dl") 
# DLH = pt.BatchRetrieve(index, wmodel="DLH") 
# DLH13 = pt.BatchRetrieve(index, wmodel="DLH13") 
# DPH = pt.BatchRetrieve(index, wmodel="DPH") 
# CoordinateMatch = pt.BatchRetrieve(index, wmodel="CoordinateMatch") 
# DFIC = pt.BatchRetrieve(index, wmodel="DFIC") 
# DFIZ = pt.BatchRetrieve(index, wmodel="DFIZ") 
# DFR_BM25 = pt.BatchRetrieve(index, wmodel="DFR_BM25") 
# DFRee = pt.BatchRetrieve(index, wmodel="DFRee") 
# DFReeKLIM = pt.BatchRetrieve(index, wmodel="DFReeKLIM") 
# DFRWeightingModel = pt.BatchRetrieve(index, wmodel="DFRWeightingModel") 
# In_expB2 = pt.BatchRetrieve(index, wmodel="In_expB2") 
# In_expC2 = pt.BatchRetrieve(index, wmodel="In_expC2") 
# InB2 = pt.BatchRetrieve(index, wmodel="InB2") 
# InL2 = pt.BatchRetrieve(index, wmodel="InL2") 
# InL2 = pt.BatchRetrieve(index, wmodel="InL2") 
# LGD = pt.BatchRetrieve(index, wmodel="LGD") 
# MDL2 = pt.BatchRetrieve(index, wmodel="MDL2") 
# ML2 = pt.BatchRetrieve(index, wmodel="ML2") 
# Hiemstra_LM = pt.BatchRetrieve(index, wmodel="Hiemstra_LM") 
# IFB2 = pt.BatchRetrieve(index, wmodel="IFB2") 
# null = pt.BatchRetrieve(index, wmodel="Null") 