# ParallelIR

### Authors: Filippo Lucchesi, Francesco Pio Crispino, Martina Speciale

#### Pulp Fiction Group

---

## 🎯 Project Overview

This project implements a modular and parallelized **Information Retrieval (IR)** system, developed as part of an academic lab.

The main objectives include:
- Efficient **parallel construction** of the inverted index
- Comparison of ranking functions: **TF-IDF vs BM25**
- Use of **caching** to optimize repeated queries
- Implementation of a custom **Relevance Feedback** algorithm inspired by Rocchio

All experiments are run and benchmarked using the [`python-terrier`](https://github.com/terrier-org/pyterrier) framework and the [IR Datasets](https://ir-datasets.com/) library.


## 📦 Environment Setup

We install all required Python libraries and handle NLTK downloads. This notebook is designed to run on **Kaggle** (GPU optional).


In [5]:
# Install required packages (only needed once per environment)
!pip install -q ir_datasets ir-measures scikit-learn dill pybind11 tqdm pympler python-terrier

import nltk
from nltk.corpus import wordnet
from nltk.corpus import stopwords

# Download required NLTK resources (only the first time)
nltk.download('wordnet')
nltk.download('stopwords')

[nltk_data] Downloading package wordnet to /home/martina/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     /home/martina/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

## 📚 Imports and PyTerrier Setup

We now import all core libraries for Information Retrieval, ranking, analysis, and visualization. PyTerrier is used for document indexing, ranking, and evaluation.


In [7]:
# IR and evaluation
import pyterrier as pt
import ir_datasets
import ir_measures
from ir_measures import *

# Data manipulation and visualization
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# Utility libraries
import os
import re
import math
import time
import heapq
import hashlib
import string
import array
import collections
from collections import defaultdict, Counter
from tqdm import tqdm


In [17]:
# ✅ Initialize PyTerrier (run once per session)
if not pt.java.started():
    pt.java.init()

## 📄 Dataset and Indexing

We now load the IR dataset using `ir_datasets` and prepare it for use with PyTerrier by indexing its documents.


In [44]:
!pip install -q ir_datasets ir-measures scikit-learn dill pybind11 tqdm pympler python-terrier

In [46]:
# Load the dataset using ir_datasets
dataset = ir_datasets.load("antique/train")

# Print basic dataset info
print("Dataset loaded:", dataset)
print("Documents:", dataset.docs_count())
print("Queries:", dataset.queries_count())
print("Qrels (relevance judgments):", dataset.qrels_count())

Dataset loaded: Dataset(id='antique/train', provides=['docs', 'queries', 'qrels'])
Documents: 403666
Queries: 2426
Qrels (relevance judgments): 27422


In [47]:
# Create the directory one level above notebooks
import os
os.makedirs("../indexes", exist_ok=True)

# Set path for index
index_path = "../indexes/antique-index"

# Build the index if it doesn't already exist
if not os.path.exists(os.path.join(index_path, "data.properties")):
    indexer = pt.IterDictIndexer(index_path)
    indexref = indexer.index(
        ({"docno": doc.doc_id, "text": doc.text} for doc in dataset.docs_iter())
    )
else:
    indexref = pt.IndexRef.of(index_path)

# Load the index
index = pt.IndexFactory.of(indexref)


[INFO] Please confirm you agree to the authors' data usage agreement found at <https://ciir.cs.umass.edu/downloads/Antique/readme.txt>
[INFO] If you have a local copy of https://ciir.cs.umass.edu/downloads/Antique/antique-collection.txt, you can symlink it here to avoid downloading it again: /home/martina/.ir_datasets/downloads/684f7015aff377062a758e478476aac8
[INFO] [starting] https://ciir.cs.umass.edu/downloads/Antique/antique-collection.txt
[INFO] [finished] https://ciir.cs.umass.edu/downloads/Antique/antique-collection.txt: [00:29] [93.6MB] [3.17MB/s]
                                                                                               

15:11:39.044 [ForkJoinPool-3-worker-1] WARN org.terrier.structures.indexing.Indexer -- Indexed 2224 empty documents


## 🔍 Retrieval: TF-IDF and BM25

We now create two retrieval pipelines: one based on the TF-IDF weighting scheme, and one on BM25. These are evaluated on the Vaswani dataset using standard IR metrics.


In [48]:
# Create BM25 and TF-IDF retrieval pipelines
bm25 = pt.BatchRetrieve(index, wmodel="BM25")
tfidf = pt.BatchRetrieve(index, wmodel="TF_IDF")

# Load queries and qrels (ground-truth relevance judgments)
topics = dataset.queries_iter()
qrels = dataset.qrels_iter()

# Convert queries and qrels to DataFrames for use with PyTerrier
topics_df = pd.DataFrame([t._asdict() for t in topics])
qrels_df = pd.DataFrame([q._asdict() for q in qrels])

# Preview query format
print("Sample queries:")
print(topics_df.head())
print("Sample qrels:")
print(qrels_df.head())

  bm25 = pt.BatchRetrieve(index, wmodel="BM25")
  tfidf = pt.BatchRetrieve(index, wmodel="TF_IDF")
[INFO] [starting] https://ciir.cs.umass.edu/downloads/Antique/antique-train-queries.txt
[INFO] [finished] https://ciir.cs.umass.edu/downloads/Antique/antique-train-queries.txt: [00:00] [137kB] [373kB/s]
[INFO] [starting] https://ciir.cs.umass.edu/downloads/Antique/antique-train.qrel                
[INFO] [finished] https://ciir.cs.umass.edu/downloads/Antique/antique-train.qrel: [00:01] [626kB] [624kB/s]
                                                                                         

Sample queries:
  query_id                                               text
0  3097310  What causes severe swelling and pain in the kn...
1  3910705  why don't they put parachutes underneath airpl...
2   237390                how to clean alloy cylinder heads ?
3  2247892                          how do i get them whiter?
4  1078492                    What is Cloud 9 and 7th Heaven?
Sample qrels:
  query_id     doc_id  relevance iteration
0  2531329  2531329_0          4        U0
1  2531329  2531329_5          4        Q0
2  2531329  2531329_4          3        Q0
3  2531329  2531329_7          3        Q0
4  2531329  2531329_6          3        Q0


### 🛠 Column Mapping for PyTerrier Compatibility

To use custom `topics` and `qrels` DataFrames with `pt.Experiment`, you must rename the columns to match what PyTerrier expects:

| Original Column | Renamed To | Reason                        |
|------------------|-------------|-------------------------------|
| `query_id`       | `qid`       | PyTerrier expects this field |
| `text`           | `query`     | PyTerrier expects this field |
| `doc_id`         | `docno`     | Matches index document field |
| `relevance`      | `label`     | Required as relevance score  |


In [53]:
# Convert iterators to DataFrames
topics_df = pd.DataFrame(dataset.queries_iter())
qrels_df = pd.DataFrame(dataset.qrels_iter())

# Rename only if necessary
topics_df = topics_df.rename(columns={k: v for k, v in {
    "query_id": "qid",
    "text": "query"
}.items() if k in topics_df.columns})

qrels_df = qrels_df.rename(columns={k: v for k, v in {
    "query_id": "qid",
    "doc_id": "docno",
    "relevance": "label"
}.items() if k in qrels_df.columns})


In [54]:
# Preprocess queries: strip punctuation that breaks Terrier parser
topics_df["query"] = topics_df["query"].str.replace(r"[^\w\s]", "", regex=True)

# Optional preview
print(topics_df.head(3))
print(qrels_df.head(3))

       qid                                              query
0  3097310  What causes severe swelling and pain in the knees
1  3910705  why dont they put parachutes underneath airpla...
2   237390                 how to clean alloy cylinder heads 
       qid      docno  label iteration
0  2531329  2531329_0      4        U0
1  2531329  2531329_5      4        Q0
2  2531329  2531329_4      3        Q0


In [55]:
from pyterrier.measures import *

results = pt.Experiment(
    [tfidf, bm25],
    topics_df,
    qrels_df,
    eval_metrics=["map", "ndcg", "recall"],
    names=["TF-IDF", "BM25"]
)

results


Unnamed: 0,name,map,ndcg,R@5,R@10,R@15,R@20,R@30,R@100,R@200,R@500,R@1000
0,TF-IDF,0.13462,0.285444,0.124418,0.159724,0.182997,0.200185,0.222592,0.292359,0.336036,0.395777,0.439749
1,BM25,0.134642,0.285791,0.124607,0.16015,0.183631,0.200644,0.223307,0.293343,0.336536,0.396537,0.441141


## 📊 Retrieval Performance on ANTIQUE/test

We evaluated two classic retrieval models — **TF-IDF** and **BM25** — on the ANTIQUE/test dataset using PyTerrier. Both models were run against the full document collection, and their performance was measured using standard IR metrics.

### 🔍 Evaluation Metrics

| Model   | MAP      | nDCG     | R@5     | R@10    | R@15    | R@20    | R@30    | R@100   | R@200   | R@500   | R@1000  |
|---------|----------|----------|---------|---------|---------|---------|---------|---------|---------|---------|---------|
| TF-IDF  | 0.134620 | 0.285444 | 0.12442 | 0.15972 | 0.18300 | 0.20018 | 0.22259 | 0.29236 | 0.33604 | 0.39578 | 0.43975 |
| BM25    | 0.134642 | 0.285791 | 0.12461 | 0.16015 | 0.18363 | 0.20064 | 0.22331 | 0.29334 | 0.33654 | 0.39654 | 0.44114 |

### 📌 Observations

- **BM25 slightly outperforms TF-IDF** across all metrics, particularly on deep recall levels like R@1000.
- The performance difference is **marginal**, indicating both models behave similarly on this dataset.
- These scores reflect the difficulty of **open-ended natural language queries** in the ANTIQUE dataset — improvements would likely require semantic models (e.g., BERT-based retrieval).

These baseline results provide a foundation for future comparison with more advanced neural or hybrid ranking approaches.
