# ParallelIR

### Authors: Filippo Lucchesi, Francesco Pio Crispino, Martina Speciale

#### Pulp Fiction Group

---

## 🎯 Project Overview

This project implements a modular and parallelized **Information Retrieval (IR)** system, developed as part of an academic lab.

The main objectives include:
- Efficient **parallel construction** of the inverted index
- Comparison of ranking functions: **TF-IDF vs BM25**
- Use of **caching** to optimize repeated queries
- Implementation of a custom **Relevance Feedback** algorithm inspired by Rocchio

All experiments are run and benchmarked using the [`python-terrier`](https://github.com/terrier-org/pyterrier) framework and the [IR Datasets](https://ir-datasets.com/) library.


## 📦 Environment Setup

We install all required Python libraries and handle NLTK downloads. This notebook is designed to run on **Kaggle** (GPU optional).


In [5]:
# Install required packages (only needed once per environment)
!pip install -q ir_datasets ir-measures scikit-learn dill pybind11 tqdm pympler python-terrier

import nltk
from nltk.corpus import wordnet
from nltk.corpus import stopwords

# Download required NLTK resources (only the first time)
nltk.download('wordnet')
nltk.download('stopwords')

[nltk_data] Downloading package wordnet to /home/martina/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     /home/martina/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

## 📚 Imports and PyTerrier Setup

We now import all core libraries for Information Retrieval, ranking, analysis, and visualization. PyTerrier is used for document indexing, ranking, and evaluation.


In [7]:
# IR and evaluation
import pyterrier as pt
import ir_datasets
import ir_measures
from ir_measures import *

# Data manipulation and visualization
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# Utility libraries
import os
import re
import math
import time
import heapq
import hashlib
import string
import array
import collections
from collections import defaultdict, Counter
from tqdm import tqdm


In [17]:
# ✅ Initialize PyTerrier (run once per session)
if not pt.java.started():
    pt.java.init()

## 📄 Dataset and Indexing

We now load the IR dataset using `ir_datasets` and prepare it for use with PyTerrier by indexing its documents.


In [None]:
import ir_datasets
ir_datasets.registsry

<ir_datasets.util.registry.Registry at 0x7e5e6756de20>

In [38]:
# Load the dataset using ir_datasets
dataset = ir_datasets.load("npl")

# Print basic dataset info
print("Dataset loaded:", dataset)
print("Documents:", dataset.docs_count())
print("Queries:", dataset.queries_count())
print("Qrels (relevance judgments):", dataset.qrels_count())

KeyError: 'npl'

In [None]:
# Create the directory one level above notebooks
import os
os.makedirs("../indexes", exist_ok=True)

# Set path for index
index_path = "../indexes/vaswani-index"

# Build the index if it doesn't already exist
if not os.path.exists(os.path.join(index_path, "data.properties")):
    indexer = pt.IterDictIndexer(index_path)
    indexref = indexer.index(
        ({"docno": doc.doc_id, "text": doc.text} for doc in dataset.docs_iter())
    )
else:
    indexref = pt.IndexRef.of(index_path)

# Load the index
index = pt.IndexFactory.of(indexref)


## 🔍 Retrieval: TF-IDF and BM25

We now create two retrieval pipelines: one based on the TF-IDF weighting scheme, and one on BM25. These are evaluated on the Vaswani dataset using standard IR metrics.


In [22]:
# Create BM25 and TF-IDF retrieval pipelines
bm25 = pt.BatchRetrieve(index, wmodel="BM25")
tfidf = pt.BatchRetrieve(index, wmodel="TF_IDF")

# Load queries and qrels (ground-truth relevance judgments)
topics = dataset.queries_iter()
qrels = dataset.qrels_iter()

# Convert queries and qrels to DataFrames for use with PyTerrier
topics_df = pd.DataFrame([t._asdict() for t in topics])
qrels_df = pd.DataFrame([q._asdict() for q in qrels])

# Preview query format
print("Sample queries:")
print(topics_df.head())
print("Sample qrels:")
print(qrels_df.head())

Sample queries:
  query_id                                               text
0        1  MEASUREMENT OF DIELECTRIC CONSTANT OF LIQUIDS ...
1        2  MATHEMATICAL ANALYSIS AND DESIGN DETAILS OF WA...
2        3  USE OF DIGITAL COMPUTERS IN THE DESIGN OF BAND...
3        4  SYSTEMS OF DATA CODING FOR INFORMATION TRANSFER\n
4        5  USE OF PROGRAMS IN ENGINEERING TESTING OF COMP...
Sample qrels:
  query_id doc_id  relevance iteration
0        1   1239          1         0
1        1   1502          1         0
2        1   4462          1         0
3        1   4569          1         0
4        1   5472          1         0


  bm25 = pt.BatchRetrieve(index, wmodel="BM25")
  tfidf = pt.BatchRetrieve(index, wmodel="TF_IDF")


### 🛠 Column Mapping for PyTerrier Compatibility

To use custom `topics` and `qrels` DataFrames with `pt.Experiment`, you must rename the columns to match what PyTerrier expects:

| Original Column | Renamed To | Reason                        |
|------------------|-------------|-------------------------------|
| `query_id`       | `qid`       | PyTerrier expects this field |
| `text`           | `query`     | PyTerrier expects this field |
| `doc_id`         | `docno`     | Matches index document field |
| `relevance`      | `label`     | Required as relevance score  |


In [36]:
# Rename columns to match PyTerrier requirements
topics_df = topics_df.rename(columns={"query_id": "qid", "text": "query"})
qrels_df = qrels_df.rename(columns={"query_id": "qid", "doc_id": "docno", "relevance": "label"})

In [37]:
from pyterrier.measures import *

results = pt.Experiment(
    [tfidf, bm25],
    topics_df,
    qrels_df,
    eval_metrics=["map", "ndcg", "recall"],
    names=["TF-IDF", "BM25"]
)

results


Unnamed: 0,name,map,ndcg,R@5,R@10,R@15,R@20,R@30,R@100,R@200,R@500,R@1000
0,TF-IDF,0.290907,0.61537,0.16233,0.220588,0.266475,0.299181,0.36871,0.597182,0.73724,0.866759,0.934214
1,BM25,0.296519,0.6212,0.162592,0.218513,0.260196,0.300136,0.374032,0.599046,0.73208,0.867562,0.934607
