# Classic IR using sparse document vectors

This notebook uses the `pyserini` Python package to explore classic IR methods using sparse embeddings and the BM25 algorithm. It is based on:
* https://github.com/castorini/pyserini/blob/master/docs/usage-search.md
* https://colab.research.google.com/github/castorini/anserini-notebooks/blob/master/pyserini_msmarco_passage_demo.ipynb#scrollTo=YacoQ28AZtQx

In [None]:
# Set needed Java environment variables
import os

os.environ['JVM_PATH'] = '/ix/cs2731_2025f/class_env/lib/jvm/lib/server/libjvm.so'
os.environ['JVM_PATH']

# Build and query an IR system
Includes:
1. Load the preprocessed dataset (MS MARCO). Here's what the textbook says about MS Marco:

> The MS MARCO (Microsoft Machine Reading Comprehension) collection of datasets includes 1 million real anonymized English questions from Microsoft Bing query logs together with a human generated answer and 9 million passages (Bajaj et al., 2016), that can be used both to test retrieval ranking and question answering.

2. Build a search engine on that dataset using BM25
3. Query the search engine

In [None]:
from pyserini.search.lucene import LuceneSearcher

# This loads a pre-built "index", which is a corpus modified to be more easily searchable.
# It also loads a searcher based on modified tf-idf representations of documents (using the BM25 algorithm)
lucene_bm25_searcher = LuceneSearcher.from_prebuilt_index('msmarco-v1-passage-full')

<span style="color:red">Fill in the `query` variable before running the following cell of code:</span>

In [None]:
query = '' # FILL IN your own query here or uncomment the example one below
# query = 'what is a lobster roll?'

hits = lucene_bm25_searcher.search(query)

for i in range(0, 10):
    print(f'{i+1:2} score {hits[i].score:.5f} {hits[i].lucene_document.get("raw"):.{400}}')
    print()

FILL IN any observations from trying different queries here:

## View document vectors

Now choose one of the documents that was surfaced with the prior query. Let's take a look at its raw term frequency vector and BM25-transformed vector.

<span style="color:red">Fill in the document's "id" as the `doc_id` variable in the next cell:</span>

In [None]:
from pyserini.index.lucene import LuceneIndexReader

index_reader = LuceneIndexReader.from_prebuilt_index('msmarco-v1-passage-full')

doc_id = ''
tf = index_reader.get_document_vector(doc_id)
print('******* Raw term frequency vector *******')
print(sorted(tf.items(), key=lambda x: x[0]))
print()

bm25_vector = {term: index_reader.compute_bm25_term_weight(doc_id, term, analyzer=None) for term in tf.keys()}
print('******* BM25 term frequency vector *******')
print(sorted(bm25_vector.items(), key=lambda x: x[0]))

**What kinds of words are weighted more highly in BM25?**

# Evaluate your model

For ranked answers in a search engine, one metric is **mean reciprocal rank (MRR)**. For every test set instance, the system gets a score equivalent to the reciprocal of the rank of the first correct answer. So that would be 1/4 if the highest-ranked correct answer is 4. Overall for a test set $Q$,

$$ MRR = \frac{1}{|Q|} \sum^{|Q|}_{i=1} \frac{1}{rank_i}$$

In [None]:
# Load example queries ("topics")
from pyserini.search import get_topics

topics = get_topics('msmarco-passage-dev-subset')
print(f'{len(topics)} queries total')
topics[1102400]['title'] # An example query

In [None]:
# Run all test queries on MS MARCO corpus with your BM25 system
# Takes 1-5 minutes

from pyserini.search.lucene import LuceneSearcher
from tqdm import tqdm

def run_all_queries(file, topics, searcher):
    with open(file, 'w') as runfile:
        cnt = 0
        print('Running {} queries in total'.format(len(topics)))
        for id in tqdm(topics):
            query = topics[id]['title']
            hits = searcher.search(query, 100) # only return the top 100 results
            for i in range(0, len(hits)):
                _ = runfile.write(f'{id}\t{hits[i].docid}\t{i+1}\n') # from https://github.com/castorini/pyserini/blob/master/pyserini/output_writer.py

lucene_bm25_searcher = LuceneSearcher.from_prebuilt_index('msmarco-v1-passage')

run_all_queries('run-msmarco-passage-bm25.txt', topics, lucene_bm25_searcher)

In [None]:
from pyserini import search

qrels_path = search.get_qrels_file('msmarco-passage-dev-subset')
qrels_path

In [None]:
# Run evaluation to calculate mean reciprocal rank (MRR)
! python -m pyserini.eval.msmarco_passage_eval \
   {qrels_path} \
   run-msmarco-passage-bm25.txt

# (Optional) Try loading and processing your own corpus

**First explore the session24_rag.ipynb notebook before doing this.**

Possibilities:
* [This class's textbook as PDF](https://web.stanford.edu/~jurafsky/slp3/ed3bookaug20_2024.pdf)
* ACL Anthology of NLP papers: [full-text](https://huggingface.co/datasets/WINGNUS/ACL-OCL) or [BibTeX with abstracts](https://aclanthology.org/anthology+abstracts.bib.gz)
* Enron email corpus: [tar.gz](https://www.cs.cmu.edu/~enron/enron_mail_20150507.tar.gz) or [Kaggle download](https://www.kaggle.com/datasets/wcukierski/enron-email-dataset)
* Another website or blog of your choosing!