# PyTerrier ECIR Tutorial Notebook - Part 1 (Adopted for IR LAB in Leipzig SoSe 2023)

This is one of a series of Colab notebooks created for the [ECIR 2021](https://www.ecir2021.eu) Tutorial entitled '**IR From Bag-of-words to BERT and Beyond through Practical Experiment**'. It demonstrates the use of [PyTerrier](https://github.com/terrier-org/pyterrier). We adopted it so that it runs directly in on the corpus and setup of the IR Lab.

This notebooks has the following learning outcomes:
  - indexing a collection that was imported to TIRA via ir_datasets
  - accessing an index
  - using the `BatchRetrieve` transformer for searching an index
  - conducting an `Experiment` 


Related Reading:
 - [Pandas documentation](https://pandas.pydata.org/docs/)
 - [PyTerrier documentation](https://pyterrier.readthedocs.io/en/latest/)


PyTerrier is a Python framework, but uses the underlying [Terrier information retrieval toolkit](http://terrier.org) for many indexing and retrieval operations. While PyTerrier was new in 2020, Terrier is written in Java and has a long history dating back to 2001. PyTerrier makes it easy to perform IR experiments in Python, but using the mature Terrier platform for the expensive indexing and retrieval operations. 

In the following, we introduce everything you need to know about PyTerrier, and also provide appropriate links to relevant parts of the [PyTerrier documentation](https://pyterrier.readthedocs.io/en/latest/).


### Installation & Configuration

Please start the Jupyter notebook on your system via:

```
docker run -p 8888:8888 --rm -ti -v ${PWD}:/workspace --entrypoint jupyter webis/tira-ir-starter-pyterrier:0.0.2-base  notebook --allow-root --ip 0.0.0.0
```

The next step is to initialise PyTerrier. This is performed using PyTerrier's `init()` method. The `init()` method is needed as PyTerrier must download Terrier's jar file and start the Java virtual machine. We prevent `init()` from being called more than once by checking `started()`.

In [1]:
import pyterrier as pt

# We use use PyTerrier inside TIRA.
# To simplify some of the common pitfalls, we use two methods from the tira third_party_integrations:
# - ensure_pyterrier_is_loaded:
#    loads PyTerrier without internet connection
#    (in TIRA, retrieval approaches have no access to the internet to improve reproducibility)
#
# - get_input_directory_and_output_directory:
#   A software in TIRA is expected to read the data from an input directory and write the results (i.e., the run file) to an output directory.
#   Both input and output directories are passed as arguments when the software is executed within TIRA,
#   so this method ensures that you can run the same notebook locally for development as in TIRA by
#   returning the passed input directory (that might be mounted) if the software is not executed in TIRA.
#   Here, we will not use the output directory.
#
# You do not have to use any of those methods, in the end it is only "generate an output from an input".
# We are of course also happy for pull requests that help to improve the handling of frequently used patterns.
# Please find the documentation here: https://github.com/tira-io/tira/blob/main/python-client/tira/third_party_integrations.py
#
from tira.third_party_integrations import ensure_pyterrier_is_loaded, get_input_directory_and_output_directory

ensure_pyterrier_is_loaded()
input_directory, output_directory = get_input_directory_and_output_directory('./iranthology-dataset-tira')


Start PyTerrier with version=5.7, helper_version=0.0.7, no_download=True


PyTerrier 0.9.2 has loaded Terrier 5.7 (built by craigm on 2022-11-10 18:30) and terrier-helper 0.0.7

No etc/terrier.properties, using terrier.default.properties for bootstrap configuration.


I will use a small hardcoded example located in ./iranthology-dataset-tira.
The output directory is /tmp/


### Documents, Indexing and Indexes

First, we extract the corpus (we added the zipped corpus to the repository)

In [3]:
!apt-get install -y zip
!rm -Rf {input_directory}
!unzip dataset.zip

Reading package lists... Done
Building dependency tree       
Reading state information... Done
zip is already the newest version (3.0-11build1).
0 upgraded, 0 newly installed, 0 to remove and 19 not upgraded.
Archive:  dataset.zip
   creating: iranthology-dataset-tira/
  inflating: iranthology-dataset-tira/documents.jsonl  
  inflating: iranthology-dataset-tira/queries.jsonl  
  inflating: iranthology-dataset-tira/metadata.json  
  inflating: iranthology-dataset-tira/queries.xml  
  inflating: iranthology-dataset-tira/qrels.txt  


Unzipping the command has exported the dataset to the correct location pointed to by the `input_directory` variable:

In [4]:
!ls -lha {input_directory}

total 77M
drwxr-xr-x 2 root root 4.0K May  3 04:46 .
drwxr-xr-x 6 1000 1000 4.0K May  3 04:47 ..
-rw-r--r-- 1 root root  77M May  2 15:33 documents.jsonl
-rw-r--r-- 1 root root   41 May  2 15:33 metadata.json
-rw-r--r-- 1 root root  433 May  3 04:46 qrels.txt
-rw-r--r-- 1 root root 1.6K May  2 15:33 queries.jsonl
-rw-r--r-- 1 root root 2.1K May  2 15:33 queries.xml


Much of PyTerrier's view of the world is wrapped up in Pandas dataframes. Let's consider some textual documents in a dataframe.


In [5]:
# we need to import pandas. We commonly rename it to pd, to make commands shorter
import pandas as pd

# lets not truncate output too much
pd.set_option('display.max_colwidth', 150)

docs_df = pd.read_json(f'{input_directory}/documents.jsonl', lines=True)

docs_df.head(5)

Unnamed: 0,docno,text,original_document
0,2019.sigirconf_workshop-2019birndl.0,Proceedings of the 4th Joint Workshop on Bibliometric-enhanced Information Retrieval and Natural Language Processing for Digital Libraries (BIRNDL...,"{'doc_id': '2019.sigirconf_workshop-2019birndl.0', 'abstract': '', 'title': 'Proceedings of the 4th Joint Workshop on Bibliometric-enhanced Inform..."
1,2019.sigirconf_workshop-2019birndl.1,Preface: 4th Joint Workshop on BIRNDL at SIGIR 2019,"{'doc_id': '2019.sigirconf_workshop-2019birndl.1', 'abstract': '', 'title': 'Preface: 4th Joint Workshop on BIRNDL at SIGIR 2019', 'authors': [], ..."
2,2019.sigirconf_workshop-2019birndl.2,"Personalized Feed/Query-formulation, Predictive Impact, and Ranking The Meta discovery system is designed to aid biomedical researchers in keeping...","{'doc_id': '2019.sigirconf_workshop-2019birndl.2', 'abstract': 'The Meta discovery system is designed to aid biomedical researchers in keeping up ..."
3,2019.sigirconf_workshop-2019birndl.3,"Discourse Processing for Text Analysis: Recent Successes, Current Challenges Computational discourse processing has come a long way in the 10 year...","{'doc_id': '2019.sigirconf_workshop-2019birndl.3', 'abstract': 'Computational discourse processing has come a long way in the 10 years since I spo..."
4,2019.sigirconf_workshop-2019birndl.4,Distant Supervision for Silver Label Generation of Software Mentions in Social Scientific Publications Many scientific investigations rely on soft...,"{'doc_id': '2019.sigirconf_workshop-2019birndl.4', 'abstract': 'Many scientific investigations rely on software for a range of different tasks inc..."


Before any search engine can estimate which documents are most likely to be relevant for a given query, it must index the documents. 

In the following cell, we index the dataframe's documents. The index, with all its data structures, is written into a directory called `index_ir_docs`. 

In [6]:
indexer = pt.DFIndexer("./index_ir_docs", overwrite=True, meta={'docno' : 100}, verbose=True)
index_ref = indexer.index(docs_df["text"], docs_df["docno"])
index_ref.toString()

 31%|████████████████████████████▉                                                                 | 16543/53673 [00:18<00:34, 1085.32documents/s]



100%|█████████████████████████████████████████████████████████████████████████████████████████████▊| 53555/53673 [00:46<00:00, 1291.56documents/s]

04:49:00.154 [main] WARN org.terrier.structures.indexing.Indexer - Indexed 3 empty documents


100%|██████████████████████████████████████████████████████████████████████████████████████████████| 53673/53673 [00:47<00:00, 1120.95documents/s]


'./index_ir_docs/data.properties'

An `IndexRef`
 is essentially a string saying where an index is stored. Indeed, we can look in the `index_3docs` directory and see that it has created various small files: 

In [7]:
!ls -lh index_ir_docs/

total 21M
-rw-r--r-- 1 root root 2.5M May  3 04:48 data.direct.bf
-rw-r--r-- 1 root root 892K May  3 04:48 data.document.fsarrayfile
-rw-r--r-- 1 root root 2.2M May  3 04:49 data.inverted.bf
-rw-r--r-- 1 root root 3.4M May  3 04:49 data.lexicon.fsomapfile
-rw-r--r-- 1 root root 1017 May  3 04:49 data.lexicon.fsomaphash
-rw-r--r-- 1 root root 158K May  3 04:49 data.lexicon.fsomapid
-rw-r--r-- 1 root root 8.8M May  3 04:48 data.meta-0.fsomapfile
-rw-r--r-- 1 root root 420K May  3 04:48 data.meta.idx
-rw-r--r-- 1 root root 2.6M May  3 04:48 data.meta.zdata
-rw-r--r-- 1 root root 4.1K May  3 04:49 data.properties


With an `IndexRef`, we can load it to an actual index. The method `pt.IndexFactory.of()` is the relevant factory. 

In [8]:
index = pt.IndexFactory.of(index_ref)

#lets see what type index is.
type(index)

jnius.reflect.org.terrier.structures.Index

Ok, so this object refers to Terrier's [`Index`](http://terrier.org/docs/current/javadoc/org/terrier/structures/Index.html) type. Check the linked Javadoc – you will see that this Java object has methods such as:
 - `getCollectionStatistics()`
 - `getInvertedIndex()`
 - `getLexicon()`

Let's see what is returned by the `CollectionStatistics()` method:

In [9]:
print(index.getCollectionStatistics().toString())

Number of documents: 53673
Number of terms: 40253
Number of postings: 1788410
Number of fields: 0
Number of tokens: 2703938
Field names: []
Positions:   false



We have 53673 documents and 40253 terms which is our vocabulary.

Let's now think about the inverted index. Remember that the inverted index tells us in which *documents* each term occurs in. The `LexiconEntry` is the pointer that tell us where to find the postings for that term in the inverted index.

### Searching an Index

Our way into search in PyTerrier is called `BatchRetrieve`. BatchRetrieve is configured by specifying an index and a weighting model (`Tf` in our example). We then search for a single-word query, `"document"`.

In [10]:
br = pt.BatchRetrieve(index, wmodel="Tf")
br.search("document")

Unnamed: 0,qid,docid,docno,rank,score,query
0,1,403,2001.sigirconf_workshop-2001w1.0,0,59.0,document
1,1,10167,1997.sigirconf_conference-97.33,1,33.0,document
2,1,53013,2019.tois_journal-ir0anthology0volumeA37A1.8,2,31.0,document
3,1,10107,2016.sigirconf_conference-2016.208,3,21.0,document
4,1,7485,2008.sigirconf_conference-2008.35,4,19.0,document
...,...,...,...,...,...,...
995,1,1369,2010.clef_workshop-2010w.58,995,4.0,document
996,1,1380,2010.clef_workshop-2010w.69,996,4.0,document
997,1,1725,2008.clef_workshop-2008w.158,997,4.0,document
998,1,1746,2005.clef_workshop-2005w.16,998,4.0,document


So the `search()` method returns a dataframe with columns:
 - `qid`: this is by default "1", since it's our first and only query
 - `docid`: Terrier' internal integer for each document
 - `docno`: the external (string) unique identifier for each document
 - `score`: since we use the `Tf` weighting model, this score corresponds the total frequency of the query (terms) in each document
 - `rank`: A handy attribute showing the descending order by score
 - `query`: the input query

As expected, the `Tf` weighting model used here only counts the frequencies of the query terms in each document, i.e.:
$$
score(d,q) = \sum_{t \in q} tf_{t,d}
$$


We can also pass a dataframe of one or more queries to the `transform()` method (rather than the `search()` method) of a transformer, with queries numbered "q1", "q2" etc.. 

In [11]:
import pandas as pd
queries = pd.DataFrame([["q1", "document"], ["q2", "first document"]], columns=["qid", "query"])
br.transform(queries)

Unnamed: 0,qid,docid,docno,rank,score,query
0,q1,403,2001.sigirconf_workshop-2001w1.0,0,59.0,document
1,q1,10167,1997.sigirconf_conference-97.33,1,33.0,document
2,q1,53013,2019.tois_journal-ir0anthology0volumeA37A1.8,2,31.0,document
3,q1,10107,2016.sigirconf_conference-2016.208,3,21.0,document
4,q1,7485,2008.sigirconf_conference-2008.35,4,19.0,document
...,...,...,...,...,...,...
1995,q2,28429,2013.ictir_conference-2013.7,995,5.0,first document
1996,q2,28444,2013.ictir_conference-2013.22,996,5.0,first document
1997,q2,28497,2016.ictir_conference-2016.43,997,5.0,first document
1998,q2,28628,2007.wwwconf_conference-2007.26,998,5.0,first document


In fact, we are usually calling `transform()`, so it's the default method – i.e. 
`br.transform(queries)` can be more succinctly written as `br(queries)`.

In [12]:
br(queries)

Unnamed: 0,qid,docid,docno,rank,score,query
0,q1,403,2001.sigirconf_workshop-2001w1.0,0,59.0,document
1,q1,10167,1997.sigirconf_conference-97.33,1,33.0,document
2,q1,53013,2019.tois_journal-ir0anthology0volumeA37A1.8,2,31.0,document
3,q1,10107,2016.sigirconf_conference-2016.208,3,21.0,document
4,q1,7485,2008.sigirconf_conference-2008.35,4,19.0,document
...,...,...,...,...,...,...
1995,q2,28429,2013.ictir_conference-2013.7,995,5.0,first document
1996,q2,28444,2013.ictir_conference-2013.22,996,5.0,first document
1997,q2,28497,2016.ictir_conference-2016.43,997,5.0,first document
1998,q2,28628,2007.wwwconf_conference-2007.26,998,5.0,first document


To continue this tutorial, we now use the topics/queries from the sample solution.
We do not have reliable relevance judgments for those topics/queries yet, but we annotated for each query one relevant document (very unreliable for later experiments, we need to make more judgments as part of milestone 2, we just looked at a BM25 ranking and looked for the first relevant document).



In [13]:
queries = pt.io.read_topics(f'{input_directory}/queries.xml', format='trecxml')
qrels = pt.io.read_qrels(f'{input_directory}/qrels.txt')

In [15]:
queries

Unnamed: 0,qid,query
0,1,detect health related queries
1,2,large language models for query expansion
2,3,datasets for web search
3,4,known item search for movies


In [16]:
qrels

Unnamed: 0,qid,docno,label
0,1,2016.fire_conference-2016w.51,0
1,1,2021.ipm_journal-ir0anthology0volumeA58A1.6,0
2,1,2011.spire_conference-2011.10,1
3,1,2019.cikm_conference-2019.346,0
4,1,2021.tist_journal-ir0anthology0volumeA12A2.4,0
5,1,2013.wwwconf_conference-2013c.302,0
6,2,2008.cikm_conference-2008.157,0
7,2,2018.ictir_conference-2018.30,1
8,2,2007.sigirconf_conference-2007.110,0
9,3,2013.wsdm_conference-2013.91,1


### Weighting Models

So far, we have been using the simple "`Tf`" as our ranking function for document retrieval in BatchRetrieve. However, we can use other models such as `"TF_IDF"` by simply changing the `wmodel="Tf"` keyword argument in the constructor of `BatchRetrieve`.


In [17]:
tfidf = pt.BatchRetrieve(index, wmodel="TF_IDF")
tfidf.search("large language models for query expansion")

Unnamed: 0,qid,docid,docno,rank,score,query
0,1,21104,2008.cikm_conference-2008.157,0,12.986092,large language models for query expansion
1,1,28411,2018.ictir_conference-2018.30,1,12.499025,large language models for query expansion
2,1,8715,2007.sigirconf_conference-2007.110,2,12.410436,large language models for query expansion
3,1,25015,2011.irfc_conference-2011.6,3,12.342333,large language models for query expansion
4,1,1973,2009.clef_workshop-2009.5,4,12.040447,large language models for query expansion
...,...,...,...,...,...,...
995,1,23638,2013.cikm_conference-2013.216,995,5.981613,large language models for query expansion
996,1,28403,2018.ictir_conference-2018.22,996,5.979334,large language models for query expansion
997,1,36720,2015.trec_conference-2015.44,997,5.979330,large language models for query expansion
998,1,32841,2013.wwwconf_conference-2013c.134,998,5.979132,large language models for query expansion


You will note that, as expected, the scores of documents ranked by `TF_IDF` are no longer integers. You can see the exact formula used by Terrier from [the Github repo](https://github.com/terrier-org/terrier-core/blob/5.x/modules/core/src/main/java/org/terrier/matching/models/TF_IDF.java#L79).

Terrier supports many weighting models – the documentation contains [a list of supported models](http://terrier.org/docs/current/javadoc/org/terrier/matching/models/package-summary.html) - some of which we will discover later in the tutorial.


### What is Success?

So far, we have been creating search engine models, but we haven't decided if any of them ia actually any good. Let's investigate if we are getting a correct ("relevant") document at the first rank.

In [None]:
pt.Experiment(
    [tfidf],
    queries,
    qrels,
    eval_metrics=["map", "ndcg"])

Now, repeat the experiment with some more Retrieval models :)


**Attention: The effectiveness scores that we see are primarily influcenced by unjudged documents. Removing this bias and conducting more robust evaluations will be the main objective of milestone 3**

In [22]:
tfidf = pt.BatchRetrieve(index, wmodel="TF_IDF")
bm25 = pt.BatchRetrieve(index, wmodel="BM25")
dirichlet = pt.BatchRetrieve(index, wmodel="DirichletLM")
dph = pt.BatchRetrieve(index, wmodel="DPH")

pt.Experiment(
    [tfidf, bm25, dirichlet, dph],
    queries,
    qrels,
    names=['TF-IDF', 'BM25', 'Dirichlet', 'DPH'],
    eval_metrics=["map", "ndcg"]
)

Unnamed: 0,name,map,ndcg
0,TF-IDF,0.375,0.473197
1,BM25,0.375,0.473197
2,Dirichlet,0.007681,0.103093
3,DPH,0.256521,0.329604


## That's all folks

The following parts of the PyTerrier documentation may be useful references for this notebook:
 * [PyTerrier datasets](https://pyterrier.readthedocs.io/en/latest/datasets.html)
 * [Using Terrier for retrieval](https://pyterrier.readthedocs.io/en/latest/terrier-indexing.html)
 * [Transformers in PyTerrier](https://pyterrier.readthedocs.io/en/latest/transformer.html)
 * [Transformer Operators](https://pyterrier.readthedocs.io/en/latest/operators.html)