# ParallelIR

### Authors: Filippo Lucchesi, Francesco Pio Crispino, Martina Speciale

#### Pulp Fiction Group

---

## 🎯 Project Overview

This project implements a modular and parallelized **Information Retrieval (IR)** system, developed as part of an academic lab.

The main objectives include:
- Efficient **parallel construction** of the inverted index
- Comparison of ranking functions: **TF-IDF vs BM25**
- Use of **caching** to optimize repeated queries
- Implementation of a custom **Relevance Feedback** algorithm inspired by Rocchio

All experiments are run and benchmarked using the [`python-terrier`](https://github.com/terrier-org/pyterrier) framework and the [IR Datasets](https://ir-datasets.com/) library.


### 🔁 Load previously saved data
##### 💾 Efficient Reproducibility

In [None]:
import joblib

# Load all objects from disk
docs = joblib.load("cache/docs.joblib")
doc_ids = joblib.load("cache/doc_ids.joblib")
topics_df = joblib.load("cache/topics_df.joblib")
qrels_df = joblib.load("cache/qrels_df.joblib")
tfidf_matrix = joblib.load("cache/tfidf_matrix.joblib")
tfidf_vectorizer = joblib.load("cache/tfidf_vectorizer.joblib")

# Try loading the cache, or fall back to empty
try:
    query_cache = joblib.load("cache/query_cache.joblib")
except FileNotFoundError:
    query_cache = {}

## 📦 Environment Setup

We install all required Python libraries and handle NLTK downloads. This notebook is designed to run on **Kaggle** (GPU optional).


In [None]:
# Install required packages (only needed once per environment)
%pip install -q ir_datasets ir-measures scikit-learn dill pybind11 tqdm pympler python-terrier

import nltk
from nltk.corpus import wordnet
from nltk.corpus import stopwords

# Download required NLTK resources (only the first time)
nltk.download('wordnet')
nltk.download('stopwords')

[nltk_data] Downloading package wordnet to /home/martina/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     /home/martina/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

## 📚 Imports and PyTerrier Setup

We now import all core libraries for Information Retrieval, ranking, analysis, and visualization. PyTerrier is used for document indexing, ranking, and evaluation.


In [3]:
# IR and evaluation
import pyterrier as pt
import ir_datasets
import ir_measures
from ir_measures import *

# Data manipulation and visualization
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# Utility libraries
import os
import re
import math
import time
import heapq
import hashlib
import string
import array
import collections
from collections import defaultdict, Counter
from tqdm import tqdm


In [4]:
# ✅ Initialize PyTerrier (run once per session)
if not pt.java.started():
    pt.java.init()

Java started and loaded: pyterrier.java, pyterrier.terrier.java [version=5.11 (build: craig.macdonald 2025-01-13 21:29), helper_version=0.0.8]


## 📄 Dataset and Indexing

We now load the IR dataset using `ir_datasets` and prepare it for use with PyTerrier by indexing its documents.


In [None]:
%pip install -q ir_datasets ir-measures scikit-learn dill pybind11 tqdm pympler python-terrier

In [6]:
# Load the dataset using ir_datasets
dataset = ir_datasets.load("antique/train")

# Print basic dataset info
print("Dataset loaded:", dataset)
print("Documents:", dataset.docs_count())
print("Queries:", dataset.queries_count())
print("Qrels (relevance judgments):", dataset.qrels_count())

Dataset loaded: Dataset(id='antique/train', provides=['docs', 'queries', 'qrels'])


Documents: 403666
Queries: 2426
Qrels (relevance judgments): 27422


In [7]:
# Create the directory one level above notebooks
import os
os.makedirs("../indexes", exist_ok=True)

# Set path for index
index_path = "../indexes/antique-index"

# Build the index if it doesn't already exist
if not os.path.exists(os.path.join(index_path, "data.properties")):
    indexer = pt.IterDictIndexer(index_path)
    indexref = indexer.index(
        ({"docno": doc.doc_id, "text": doc.text} for doc in dataset.docs_iter())
    )
else:
    indexref = pt.IndexRef.of(index_path)

# Load the index
index = pt.IndexFactory.of(indexref)


## 🔍 Retrieval: TF-IDF and BM25

We now create two retrieval pipelines: one based on the TF-IDF weighting scheme, and one on BM25. These are evaluated on the Vaswani dataset using standard IR metrics.


In [8]:
# Create BM25 and TF-IDF retrieval pipelines
bm25 = pt.BatchRetrieve(index, wmodel="BM25")
tfidf = pt.BatchRetrieve(index, wmodel="TF_IDF")

# Load queries and qrels (ground-truth relevance judgments)
topics = dataset.queries_iter()
qrels = dataset.qrels_iter()

# Convert queries and qrels to DataFrames for use with PyTerrier
topics_df = pd.DataFrame([t._asdict() for t in topics])
qrels_df = pd.DataFrame([q._asdict() for q in qrels])

# Preview query format
print("Sample queries:")
print(topics_df.head())
print("Sample qrels:")
print(qrels_df.head())

  bm25 = pt.BatchRetrieve(index, wmodel="BM25")
  tfidf = pt.BatchRetrieve(index, wmodel="TF_IDF")


Sample queries:
  query_id                                               text
0  3097310  What causes severe swelling and pain in the kn...
1  3910705  why don't they put parachutes underneath airpl...
2   237390                how to clean alloy cylinder heads ?
3  2247892                          how do i get them whiter?
4  1078492                    What is Cloud 9 and 7th Heaven?
Sample qrels:
  query_id     doc_id  relevance iteration
0  2531329  2531329_0          4        U0
1  2531329  2531329_5          4        Q0
2  2531329  2531329_4          3        Q0
3  2531329  2531329_7          3        Q0
4  2531329  2531329_6          3        Q0


### 🛠 Column Mapping for PyTerrier Compatibility

To use custom `topics` and `qrels` DataFrames with `pt.Experiment`, you must rename the columns to match what PyTerrier expects:

| Original Column | Renamed To | Reason                        |
|------------------|-------------|-------------------------------|
| `query_id`       | `qid`       | PyTerrier expects this field |
| `text`           | `query`     | PyTerrier expects this field |
| `doc_id`         | `docno`     | Matches index document field |
| `relevance`      | `label`     | Required as relevance score  |


In [9]:
# Convert iterators to DataFrames
topics_df = pd.DataFrame(dataset.queries_iter())
qrels_df = pd.DataFrame(dataset.qrels_iter())

# Rename only if necessary
topics_df = topics_df.rename(columns={k: v for k, v in {
    "query_id": "qid",
    "text": "query"
}.items() if k in topics_df.columns})

qrels_df = qrels_df.rename(columns={k: v for k, v in {
    "query_id": "qid",
    "doc_id": "docno",
    "relevance": "label"
}.items() if k in qrels_df.columns})


In [10]:
# Preprocess queries: strip punctuation that breaks Terrier parser
topics_df["query"] = topics_df["query"].str.replace(r"[^\w\s]", "", regex=True)

# Optional preview
print(topics_df.head(3))
print(qrels_df.head(3))

       qid                                              query
0  3097310  What causes severe swelling and pain in the knees
1  3910705  why dont they put parachutes underneath airpla...
2   237390                 how to clean alloy cylinder heads 
       qid      docno  label iteration
0  2531329  2531329_0      4        U0
1  2531329  2531329_5      4        Q0
2  2531329  2531329_4      3        Q0


In [11]:
from pyterrier.measures import *

results = pt.Experiment(
    [tfidf, bm25],
    topics_df,
    qrels_df,
    eval_metrics=["map", "ndcg", "recall"],
    names=["TF-IDF", "BM25"]
)

results


Unnamed: 0,name,map,ndcg,R@5,R@10,R@15,R@20,R@30,R@100,R@200,R@500,R@1000
0,TF-IDF,0.13462,0.285444,0.124418,0.159724,0.182997,0.200185,0.222592,0.292359,0.336036,0.395777,0.439749
1,BM25,0.134642,0.285791,0.124607,0.16015,0.183631,0.200644,0.223307,0.293343,0.336536,0.396537,0.441141


## 📊 Retrieval Performance on ANTIQUE/test

We evaluated two classic retrieval models — **TF-IDF** and **BM25** — on the ANTIQUE/test dataset using PyTerrier. Both models were run against the full document collection, and their performance was measured using standard IR metrics.

### 🔍 Evaluation Metrics

| Model   | MAP      | nDCG     | R@5     | R@10    | R@15    | R@20    | R@30    | R@100   | R@200   | R@500   | R@1000  |
|---------|----------|----------|---------|---------|---------|---------|---------|---------|---------|---------|---------|
| TF-IDF  | 0.134620 | 0.285444 | 0.12442 | 0.15972 | 0.18300 | 0.20018 | 0.22259 | 0.29236 | 0.33604 | 0.39578 | 0.43975 |
| BM25    | 0.134642 | 0.285791 | 0.12461 | 0.16015 | 0.18363 | 0.20064 | 0.22331 | 0.29334 | 0.33654 | 0.39654 | 0.44114 |

### 📌 Observations

- **BM25 slightly outperforms TF-IDF** across all metrics, particularly on deep recall levels like R@1000.
- The performance difference is **marginal**, indicating both models behave similarly on this dataset.
- These scores reflect the difficulty of **open-ended natural language queries** in the ANTIQUE dataset — improvements would likely require semantic models (e.g., BERT-based retrieval).

These baseline results provide a foundation for future comparison with more advanced neural or hybrid ranking approaches.


## 🧱 Manual Construction of TF-IDF Matrix and Cosine Similarity Ranking

In this section, we manually implement a basic information retrieval system based on:

- **TF-IDF vectorization** of the documents
- **Cosine similarity** calculation between a query and all documents
- Ranking documents by their similarity to the query

This approach helps us understand the core mechanics of term-based retrieval models, without relying on built-in PyTerrier components like BM25 or TF.

The implementation uses:
- `TfidfVectorizer` from `sklearn` to transform documents into vector space
- `cosine_similarity` to compute the similarity between the query and each document
- Result sorting to return the top-K most similar documents

In [12]:
# Setup and Manual TF-IDF Construction

# Required libraries
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np

# Extract documents and document IDs from the dataset
docs = [doc.text for doc in dataset.docs_iter()]
doc_ids = [doc.doc_id for doc in dataset.docs_iter()]

# Create the TF-IDF matrix for all documents
tfidf_vectorizer = TfidfVectorizer(stop_words='english')
tfidf_matrix = tfidf_vectorizer.fit_transform(docs)

print(f"TF-IDF matrix shape: {tfidf_matrix.shape}")

TF-IDF matrix shape: (403666, 186364)


In [14]:
# Cosine Similarity Ranking (Single Query)

# Pick a query to test (e.g., first query in the dataset)
sample_query = topics_df.iloc[0]["query"]

# Transform the query using the same TF-IDF vectorizer
query_vec = tfidf_vectorizer.transform([sample_query])

# Compute cosine similarity between the query vector and document matrix
cos_sim = cosine_similarity(query_vec, tfidf_matrix).flatten()

# Get top 10 most similar documents
top_n = 10
top_doc_indices = np.argsort(cos_sim)[::-1][:top_n]

# Print ranked document results
print("Query:", sample_query)
for rank, idx in enumerate(top_doc_indices):
    print(f"{rank + 1}. doc_id: {doc_ids[idx]} - score: {cos_sim[idx]:.4f}")


Query: What causes severe swelling and pain in the knees
1. doc_id: 3786595_4 - score: 0.5125
2. doc_id: 2606613_7 - score: 0.4862
3. doc_id: 2859959_18 - score: 0.4793
4. doc_id: 2933555_0 - score: 0.4529
5. doc_id: 532973_9 - score: 0.4466
6. doc_id: 389820_27 - score: 0.4094
7. doc_id: 3786595_5 - score: 0.3985
8. doc_id: 637192_11 - score: 0.3899
9. doc_id: 3704893_18 - score: 0.3867
10. doc_id: 3301833_0 - score: 0.3852


## ⚡ Optimization: Caching Cosine Similarity Results

To improve efficiency, we introduce a basic **caching mechanism**:

- Each query is saved in a dictionary (`query_cache`) along with its ranked results
- If the same query is submitted again, the results are fetched directly from the cache instead of recalculating them
- This significantly reduces execution time in interactive or repeated-query scenarios

The function `retrieve_with_cache()` encapsulates this logic:
- If the query exists in `query_cache`, cached results are returned
- Otherwise, similarity is computed and stored for future use

This optimization is especially useful in experiments with large document collections or when evaluating many repeated queries.

In [17]:
# ⚡ Caching Similarity Results

# Create a dictionary to store cached results
query_cache = {}

# Function that checks cache or computes cosine similarity
def retrieve_with_cache(query_text, top_k=10):
    # Return cached result if available
    if query_text in query_cache:
        return query_cache[query_text]

    # Otherwise, compute similarity and cache the result
    query_vec = tfidf_vectorizer.transform([query_text])
    cos_sim = cosine_similarity(query_vec, tfidf_matrix).flatten()
    top_doc_indices = np.argsort(cos_sim)[::-1][:top_k]

    results = [(doc_ids[i], cos_sim[i]) for i in top_doc_indices]
    query_cache[query_text] = results
    return results


### 💾 Save to Disk
##### Efficient Reproducibility: Caching Intermediate Objects

* To improve reproducibility and avoid re-running expensive operations, we serialize and store key data structures using `joblib`. All serialized files are saved into a dedicated `cache/` directory.

#### 📦 What We Save

| Object            | Description                                      |
|-------------------|--------------------------------------------------|
| `docs`            | List of raw document texts from the dataset      |
| `doc_ids`         | Corresponding document identifiers               |
| `topics_df`       | DataFrame of queries (renamed for PyTerrier)     |
| `qrels_df`        | DataFrame of relevance judgments                 |
| `tfidf_matrix`    | The TF-IDF representation of the documents       |
| `tfidf_vectorizer`| The fitted `TfidfVectorizer` object              |
 | `query_cache`     | (Optional) A dictionary caching query results    |



In [None]:
import os
import joblib

# Create the cache directory
os.makedirs("cache", exist_ok=True)

# Save base dataset objects
joblib.dump(docs, "cache/docs.joblib")
joblib.dump(doc_ids, "cache/doc_ids.joblib")
joblib.dump(topics_df, "cache/topics_df.joblib")
joblib.dump(qrels_df, "cache/qrels_df.joblib")

# Save TF-IDF matrix and vectorizer
joblib.dump(tfidf_matrix, "cache/tfidf_matrix.joblib")
joblib.dump(tfidf_vectorizer, "cache/tfidf_vectorizer.joblib")

# Save query cache (optional)
joblib.dump(query_cache, "cache/query_cache.joblib")

['cache/tfidf_vectorizer.joblib']