# ParallelIR

### Authors: Filippo Lucchesi, Francesco Pio Crispino, Martina Speciale

#### Pulp Fiction Group

---

## 🎯 Project Overview

This project implements a modular and parallelized **Information Retrieval (IR)** system, developed as part of an academic lab.

The main objectives include:
- Efficient **parallel construction** of the inverted index
- Comparison of ranking functions: **TF-IDF vs BM25**
- Use of **caching** to optimize repeated queries
- Implementation of a custom **Relevance Feedback** algorithm inspired by Rocchio

All experiments are run and benchmarked using the [`python-terrier`](https://github.com/terrier-org/pyterrier) framework and the [IR Datasets](https://ir-datasets.com/) library.


## 📦 Environment Setup

We install all required Python libraries and handle NLTK downloads. This notebook is designed to run on **Kaggle** (GPU optional).


In [None]:
# Install required packages (only needed once per environment)
!pip install -q ir_datasets ir-measures scikit-learn dill pybind11 tqdm pympler python-terrier

import nltk
from nltk.corpus import wordnet
from nltk.corpus import stopwords

# Download required NLTK resources (only the first time)
nltk.download('wordnet')
nltk.download('stopwords')

[nltk_data] Downloading package wordnet to /home/martina/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     /home/martina/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

## 📚 Imports and PyTerrier Setup

We now import all core libraries for Information Retrieval, ranking, analysis, and visualization. PyTerrier is used for document indexing, ranking, and evaluation.


In [7]:
# IR and evaluation
import pyterrier as pt
import ir_datasets
import ir_measures
from ir_measures import *

# Data manipulation and visualization
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# Utility libraries
import os
import re
import math
import time
import heapq
import hashlib
import string
import array
import collections
from collections import defaultdict, Counter
from tqdm import tqdm


In [9]:
# ✅ Initialize PyTerrier (run once per session)
if not pt.started():
    pt.init()

  if not pt.started():
Java started and loaded: pyterrier.java, pyterrier.terrier.java [version=5.11 (build: craig.macdonald 2025-01-13 21:29), helper_version=0.0.8]
java is now started automatically with default settings. To force initialisation early, run:
pt.java.init() # optional, forces java initialisation
  pt.init()


## 📄 Dataset and Indexing

We now load the IR dataset using `ir_datasets` and prepare it for use with PyTerrier by indexing its documents.


In [14]:
# Load the dataset using ir_datasets
dataset = ir_datasets.load("vaswani")

# Print basic dataset info
print("Dataset loaded:", dataset)
print("Documents:", dataset.docs_count())
print("Queries:", dataset.queries_count())
print("Qrels (relevance judgments):", dataset.qrels_count())

Dataset loaded: Dataset(id='vaswani', provides=['docs', 'queries', 'qrels'])
Documents: 11429
Queries: 93
Qrels (relevance judgments): 2083


In [None]:
# Create the directory one level above notebooks
import os
os.makedirs("../indexes", exist_ok=True)

# Set path for index
index_path = "../indexes/vaswani-index"

# Build the index if it doesn't already exist
if not os.path.exists(os.path.join(index_path, "data.properties")):
    indexer = pt.IterDictIndexer(index_path)
    indexref = indexer.index(
        ({"docno": doc.doc_id, "text": doc.text} for doc in dataset.docs_iter())
    )
else:
    indexref = pt.IndexRef.of(index_path)

# Load the index
index = pt.IndexFactory.of(indexref)
