# IR Lab SoSe 2024: Baseline Retrieval System

This jupyter notebook serves as baseline retrieval system that you can use to construct your information needs (topics).
We will use the a corpus of scientific papers (title + abstracts) from the fields of information retrieval and natural language processing (the [IR Anthology](https://ir.webis.de/anthology/) and the [ACL Anthology](https://aclanthology.org/)). Throughout the course, you will try to improve upon this baseline retrieval system.

### Step 1: Import Libraries

We will use [tira](https://www.tira.io/), an information retrieval shared task platform, for loading the (pre-built) retrieval index and [ir_dataset](https://ir-datasets.com/) to subsequently build a retrieval system with [PyTerrier](https://github.com/terrier-org/pyterrier), an open-source search engine.

In [None]:
# You only need to execute this cell if you are using Google Golab.
# If you use GitHub Codespaces, everything is already installed.
!pip3 install tira ir-datasets python-terrier

In [None]:
# Imports
from tira.third_party_integrations import ensure_pyterrier_is_loaded
from tira.rest_api_client import Client
import pyterrier as pt

# do not truncate text in the dataframe
import pandas as pd
pd.set_option('display.max_colwidth', None)

In [None]:
# Create a REST client to the TIRA platform for retrieving the pre-indexed data.
ensure_pyterrier_is_loaded()
tira = Client()

### Step 2: Load the Dataset and the Index

The type of the index object that we load is `<class 'jnius.reflect.org.terrier.structures.Index'>`, in fact a [Java class](http://terrier.org/docs/v3.6/javadoc/org/terrier/structures/Index.html) wrapped into Python. However, you do not need to worry about this: at this point, we will simply use the provided Index object to run procedures defined in Python.

In [None]:
# The dataset: the union of the IR Anthology and the ACL Anthology
# This line creates an IRDSDataset object and registers it under the name provided as an argument.
pt_dataset = pt.get_dataset('irds:ir-lab-sose-2024/ir-acl-anthology-20240411-training')

# A (pre-built) PyTerrier index loaded from TIRA
index = tira.pt.index('ir-lab-sose-2024/tira-ir-starter/Index (tira-ir-starter-pyterrier)', pt_dataset)

### Step 3: Define the Retrieval Pipeline

We will define a BM25 retrieval system that first retrieves the top-10 results and then adds the text of the documents to the retrieved document IDs.
The PyTerrier framework uses the `%` and the `>>` Operators to indicate rank cut-off and sequential execution of steps, respectively.
For details, see: [https://pyterrier.readthedocs.io/en/latest/operators.html](https://pyterrier.readthedocs.io/en/latest/operators.html)

In [None]:
# Declarative pipeline:
# Step 1: Create a search engine based on the BM35 scoring function. Retrieve the top 10 results.
# Step 2: Add the document text for each retrieved result (from the dataset).
bm25 = pt.BatchRetrieve(index, wmodel="BM25") %10 >> pt.text.get_text(pt_dataset, "text")

### Step 4: Do Some Searches to Refine your Topic

You can search via `bm25.search("your query")`.
In the following, we see some examples:

In [None]:
bm25.search('how to combine bm25 for multiple fields')

In [None]:
bm25.search('how to estimate the size of a proprietary search index')

In [None]:
bm25.search('pagerank')

In [None]:
bm25.search('misinformation')