# PyTerrier Notebook for Full-Rank Submissions

This notebook serves as a baseline full-rank submission for [TIRA](https://tira.io)/[TIREx](https://tira.io/tirex) that builds a PyTerrier index and subsequently creates a run with BM25.

### Step 1: Ensure Libraries are Imported

In [1]:
import os

# Detect if we are in the TIRA sandbox
# Install the required dependencies if we are not in the sandbox.
if 'TIRA_DATASET_ID' not in os.environ:
    !pip3 install python-terrier tira==0.0.88 ir_datasets
else:
    print('We are in the TIRA sandbox.')

Collecting python-terrier
  Downloading python-terrier-0.10.0.tar.gz (107 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m107.6/107.6 kB[0m [31m1.3 MB/s[0m eta [36m0:00:00[0ma [36m0:00:01[0m
[?25h  Installing build dependencies ... [?25ldone
[?25h  Getting requirements to build wheel ... [?25ldone
[?25h  Preparing metadata (pyproject.toml) ... [?25ldone
[?25hCollecting tira==0.0.88
  Downloading tira-0.0.88-py3-none-any.whl.metadata (4.4 kB)
Collecting ir_datasets
  Downloading ir_datasets-0.5.5-py3-none-any.whl.metadata (12 kB)
Collecting docker==6.*,>=6.0.0 (from tira==0.0.88)
  Downloading docker-6.1.3-py3-none-any.whl.metadata (3.5 kB)
Collecting wget (from python-terrier)
  Downloading wget-3.2.zip (10 kB)
  Installing build dependencies ... [?25ldone
[?25h  Getting requirements to build wheel ... [?25ldone
[?25h  Preparing metadata (pyproject.toml) ... [?25ldone
[?25hCollecting tqdm (from python-terrier)
  Downloading tqdm-4.66.1-py3-none-an

In [2]:
from tira.third_party_integrations import ensure_pyterrier_is_loaded, persist_and_normalize_run

# this loads and starts pyterrier so that it also works in the TIRA
ensure_pyterrier_is_loaded()

# PyTerrier must be imported after the call to ensure_pyterrier_is_loaded in TIRA.
import pyterrier as pt


  from .autonotebook import tqdm as notebook_tqdm


Due to execution in TIRA, I have patched ir_datasets to always return the single input dataset mounted to the sandbox.
Start PyTerrier with version=5.7, helper_version=0.0.7, no_download=True
terrier-assemblies 5.7 jar-with-dependencies not found, downloading to /home/codespace/.pyterrier...


### Step 2: Load the data

In [None]:
data = pt.get_dataset('irds:ir-lab-jena-leipzig-wise-2023/training-20231104-training')

In [None]:
print('See the first two queries:')
topics = data.get_topics('title')
print(topics.head(2))

See the first two queries:
         qid        query
0  q06223196  car shelter
1    q062228      airport


### Step 3: Build the Index

In [None]:
print('Build index:')
"""
Function: pt.IterDictIndexer
Purpose: Creates an indexer for building an index from an iterable of documents.

Parameters:
    1. index_path (str): Specifies the directory where the index will be stored.
       - "/tmp/index": In this case, the index is stored in the '/tmp/index' directory.

    2. meta (dict): A dictionary specifying the metadata fields to be stored alongside the index.
       - {'docno': 100}: This indicates that the 'docno' field is included as metadata with a maximum length of 100 characters.
       - (test with text length differences)
       - Note: 'docno' typically represents a unique document identifier.

    3. verbose (bool): Controls the verbosity of the output during indexing.
       - True: Enables verbose output, providing detailed information during the indexing process.

    4. overwrite (bool): Determines whether to overwrite an existing index in the specified path.
       - True: If an index already exists at the specified path, it will be overwritten.

    5. stemmer (str): Specifies the stemmer/lemmatizer to be used for text normalization.
       - 'StanfordLemmatizer': Utilizes the Stanford Lemmatizer for text normalization, which is more sophisticated than simple stemming, as it considers the context of words to reduce them to their base or dictionary form.

Usage:
    - This line initializes an indexer that will create an index at '/tmp/index'.
    - The indexer includes document number metadata, provides verbose output, overwrites any existing index at the location, and uses the Stanford Lemmatizer for text normalization.
"""
iter_indexer = pt.IterDictIndexer("/tmp/index", meta={'docno': 100}, verbose=True, overwrite=True, stemmer='EnglishSnowballStemmer')
!rm -Rf /tmp/index
indexref = iter_indexer.index(data.get_corpus_iter())

print('Done. Index is created')

Build index:


ir-lab-jena-leipzig-wise-2023/training-20231104-training documents: 100%|██████████| 47064/47064 [00:44<00:00, 1055.90it/s]


Done. Index is created


### Step 4: Create the Retrieval Pipeline

In [None]:
"""
Using tutorial 6, we chose the parameters from the best run using the *validation* dataset. Therefore b is 0.8 and k_1 is 1.2.
"""
b = 0.8
k_1 = 1.2

configuration = {"bm25.b" : b, "bm25.k_1": k_1}
bm25 = pt.BatchRetrieve(indexref, wmodel="BM25", verbose=True, controls=configuration)

### Step 5: Create the Run and Persist the Run

In [None]:
print('Create run')
run = bm25(topics)
print('Done, run was created')

Create run


BR(BM25): 100%|██████████| 672/672 [00:14<00:00, 46.86q/s]


Done, run was created


**TODO** once proper evaluation is implemented:
- Query Expansion ([documentaion](http://terrier.org/docs/v4.1/javadoc/org/terrier/matching/models/queryexpansion/Bo1.html))
- Query Segmentation

Reason:  
Right now we could easily implement those things, but without having a proper system set up to **compare and understand** the results, it will not yield useful results.

In [None]:
import datetime
run_time = datetime.datetime.now()
formatted_datetime = run_time.strftime("%Y-%m-%d_%H-%M-%S")
file_name = f'./runs_milestone3/{formatted_datetime}-run.txt'

file_name = './run.txt'

persist_and_normalize_run(run, 'bm25-baseline', default_output=file_name)

Done. run file is stored under "./runs_milestone3/2023-11-20_14-49-19-run.txt".
