# Information Retrieval TIRA 

---
## Contributers
Team: Return of the Query
- [Moritz Raetz](mailto:moritz.raetz@uni.jena.de)
- [Leonard Teschner](mailto:leonard.teschner@uni-jena.de)

Text mit [Deepl.com/write](https://www.deepl.com/de/write) umgeschrieben.

### Load Libraries

In [1]:
import click
import pyterrier as pt
from pyterrier_t5 import MonoT5ReRanker, DuoT5ReRanker
from pathlib import Path
from tirex_tracker import tracking, ExportFormat
from tira.third_party_integrations import ir_datasets, ensure_pyterrier_is_loaded
from tqdm import tqdm
ensure_pyterrier_is_loaded(is_offline=False)

  from .autonotebook import tqdm as notebook_tqdm
Java started and loaded: pyterrier.java.colab, pyterrier.java, pyterrier.java.24, pyterrier.terrier.java [version=5.11 (build: craig.macdonald 2025-01-13 21:29), helper_version=0.0.8]


#### Given: Text extraction of document

In [2]:
def extract_text_of_document(doc, field):
    # ToDo: here one can make modifications to the document representations
    if field == "default_text":
        return doc.default_text()
    elif field == "title":
        return doc.title
    elif field == "description":
        return doc.description

#### Given: Index building and loading

Added Porterstemmer

In [55]:
def get_index(dataset, field, output_path, index_dir=None):
    if index_dir is not None and index_dir.is_dir():
        return pt.IndexFactory.of(str(index_dir.absolute()))
    index_dir = output_path / "indexes" / f"{dataset}-on-{field}"
    if not index_dir.is_dir():
        print("Build new index")
        docs = []
        dataset = ir_datasets.load(f"ir-lab-wise-2025/{dataset}")

        for doc in tqdm(dataset.docs_iter(), "Pre-Process Documents"):
            docs.append({"docno": doc.doc_id, "text": extract_text_of_document(doc, field)})

        print("Index Documents")
        with tracking(export_file_path=index_dir / "index-metadata.yml", export_format=ExportFormat.IR_METADATA):
            pt.IterDictIndexer(str(index_dir.absolute()), meta={'docno' : 100}, verbose=True, stemmer="PorterStemmer").index(docs)

    return pt.IndexFactory.of(str(index_dir.absolute()))

#### Given Retrieval

Added Pipeline with MonoT5 and DuoT5

In [59]:
def run_retrieval(output, index, dataset, retrieval_model, text_field_to_retrieve):
    print("Check if run exists")
    tag = f"pyterrier-{retrieval_model}-on-{text_field_to_retrieve}-3"
    target_dir = output / "runs" / dataset / tag
    target_file = target_dir / "run.txt.gz"

    if target_file.exists():
        print(f"Run {target_file.resolve()} exists, generate new name...")
        i = 2
        while True:
            new_tag = f"{tag}-v{i}"
            new_target_dir = output / "runs" / dataset / new_tag
            new_target_file = new_target_dir / "run.txt.gz"
            if not new_target_file.exists():
                target_dir = new_target_dir
                target_file = new_target_file
                tag = new_tag
                print(f"New run name: {target_file.resolve()}")
                break
            i += 1

    print(f"Run retrieval with {retrieval_model} on {text_field_to_retrieve}")
    dataset = pt.datasets.get_dataset(f"irds:ir-lab-wise-2025/{dataset}")
    topics = dataset.get_topics()
    
    # set up retrieval pipeline
    retriever = pt.terrier.Retriever(index, wmodel=retrieval_model, verbose=True)
    reranker_pointwise = MonoT5ReRanker(batch_size=8)
    reranker_pairwise = DuoT5ReRanker(batch_size=8)

    description = f"This is a PyTerrier retriever using the retrieval model {retriever} retrieving on the {text_field_to_retrieve} text representation of the documents. Everything is set to the defaults."

    with tracking(export_file_path=target_dir / "retrieval-metadata.yml", export_format=ExportFormat.IR_METADATA, system_description=description, system_name=tag):
        # Steps promoted in lecture: Pointwise reranking on top 100 - monoT5
        # Pairwise reranking on top 5 - duoT5
        mono_pipeline = retriever % 100  >> reranker_pointwise
        duo_pipeline = (mono_pipeline % 20 )  >> reranker_pairwise
        run = duo_pipeline.transform(topics)

    pt.io.write_results(run, target_file)
    print(f"Run saved to {target_file}")

#### Define parms and needed vars

In [None]:
dataset = "radboud-validation-20251114-training" # or"spot-check-20251122-training"
text_field_to_retrieve = "default_text" # "default_text", "title", "description"
index_dir = None  # Path to existing index or None
output = Path("Test") 
retrieval_model = "BM25"

#### Build Index

In [57]:
index = get_index(dataset, text_field_to_retrieve, output)

Build new index


Pre-Process Documents: 63621it [00:15, 4168.92it/s]


Index Documents
20:26:50.081 [ForkJoinPool-2-worker-3] WARN org.terrier.structures.indexing.Indexer -- Indexed 140 empty documents


#### Run retrieval

In [60]:
run_retrieval(output, index, dataset, retrieval_model, text_field_to_retrieve)

Check if run exists
Run /workspaces/wows-code/ecir26/je-returnOfTheQuery-01/Test/runs/radboud-validation-20251114-training/pyterrier-BM25-on-title-3/run.txt.gz exists, generate new name...
New run name: /workspaces/wows-code/ecir26/je-returnOfTheQuery-01/Test/runs/radboud-validation-20251114-training/pyterrier-BM25-on-title-3-v11/run.txt.gz
Run retrieval with BM25 on title
There are multiple query fields available: ('text', 'title', 'query', 'description', 'narrative'). To use with pyterrier, provide variant or modify dataframe to add query column.


TerrierRetr(BM25): 100%|██████████| 28/28 [00:00<00:00, 37.01q/s]
monoT5: 100%|██████████| 350/350 [01:10<00:00,  5.00batches/s]
duoT5: 100%|██████████| 28/28 [05:32<00:00, 11.88s/queries]


Run saved to Test/runs/radboud-validation-20251114-training/pyterrier-BM25-on-title-3-v11/run.txt.gz


#### Diffrent runs
- v3 -> duo on top 10
- v4 -> only mono on top 50
- v5 -> only mono on top 50 without text param
- v6 -> only mono on top 10
- v7 -> only duo on top 10
- v8 -> mono on top 10, duo on top 5
- v9 -> mono on top 10, duo on top 10
- v10 -> only duo on top 10, Porterstemmer
- v11 -> mono on top 100, duo on top 20, Porterstemmer