<a href="https://colab.research.google.com/github/liadmagen/NLP-Course/blob/master/IR_Ex1.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [3]:
%pip install --upgrade -q ir_datasets rank_bm25

Welcome to the first Information Retrieval Lab!

In this lab, we will do our first steps in information retrieval, using BM25 (an advanced version of TF-IDF).

We will use the `ir_datasets` library to pull the ANTIQUE collection. It contains ~400k documents (answers) and ~2.5k queries.

In [41]:
import ir_datasets
from collections import defaultdict, Counter
import math

In [33]:
# Load the dataset (first time will download it)
dataset = ir_datasets.load("antique/train")

In [43]:
print("# of documents: ", dataset.docs_count())
print("# of queries: ", dataset.queries_count())

# of documents:  403666
# of queries:  2426


In [36]:
docs = list(dataset.docs_iter())[:5000] # Using a subset for speed in class
queries = list(dataset.queries_iter())

In [37]:
for query in queries[:10]:
    print(query.text) # namedtuple<query_id, text>

What causes severe swelling and pain in the knees?
why don't they put parachutes underneath airplane seats?
how to clean alloy cylinder heads ?
how do i get them whiter?
What is Cloud 9 and 7th Heaven?
How do you like your eggs?
What is a conscience?
How do I get college money?
how can u tell when a person is tellin u a lie?
How do you transfer voicemail messages onto tape?


In [38]:
print(docs[:1])

[GenericDoc(doc_id='2020338_0', text="A small group of politicians believed strongly that the fact that Saddam Hussien remained in power after the first Gulf War was a signal of weakness to the rest of the world, one that invited attacks and terrorism. Shortly after taking power with George Bush in 2000 and after the attack on 9/11, they were able to use the terrorist attacks to justify war with Iraq on this basis and exaggerated threats of the development of weapons of mass destruction. The military strength of the U.S. and the brutality of Saddam's regime led them to imagine that the military and political victory would be relatively easy.")]


In [39]:
for doc in docs[:10]:
    print(doc.doc_id, doc.text)

2020338_0 A small group of politicians believed strongly that the fact that Saddam Hussien remained in power after the first Gulf War was a signal of weakness to the rest of the world, one that invited attacks and terrorism. Shortly after taking power with George Bush in 2000 and after the attack on 9/11, they were able to use the terrorist attacks to justify war with Iraq on this basis and exaggerated threats of the development of weapons of mass destruction. The military strength of the U.S. and the brutality of Saddam's regime led them to imagine that the military and political victory would be relatively easy.
2020338_1 Because there is a lot of oil in Iraq.
2020338_2 It is tempting to say that the US invaded Iraq because it has lots of oil, but the US is not a country in a deep economic problem that capturing other countryâ€™s oil is an actual need for survival. It is more likely that the Iraq invading Kuwait scenario would fall under that assumption.. I think that the US governme

## Task A: Building the Inverted Index
In this task, you will implement a simple preprocessing pipeline (lowercase + text splitting) and map **terms** to the documents they appear in.

In [42]:
def simple_tokenize(text):
    return text.lower().split()

inverted_index = defaultdict(list)
doc_store = {} # To retrieve full text later

for doc in docs:
    doc_store[doc.doc_id] = doc.text
    tokens = set(simple_tokenize(doc.text)) # Set ensures we only count doc once per term
    for token in tokens:
        inverted_index[token].append(doc.doc_id)

print(f"Index built with {len(inverted_index)} unique terms.")

Index built with 29733 unique terms.


## Task B: Boolean Retrieval
Implement a "matching" function that returns documents containing all words in a query.

In [None]:
def boolean_search(query, index):
    query_tokens = simple_tokenize(query)
    if not query_tokens: return []

    # Start with the doc list for the first word
    results = set(index.get(query_tokens[0], []))

    # Intersect with subsequent words (The "AND" logic)
    for token in query_tokens[1:]:
        results &= set(index.get(token, []))

    return list(results)

# Test it
sample_query = "how to cook"
hits = boolean_search(sample_query, inverted_index)
print(f"Boolean hits for '{sample_query}': {len(hits)}")

## Task C: The BM25 Baseline
Now, we use a library to see how statistical weighting (TF-IDF derivative) improves results compared to the binary "yes/no" of Boolean search.

In [None]:
from rank_bm25 import BM25Okapi

# Prepare corpus for BM25
tokenized_corpus = [simple_tokenize(doc.text) for doc in docs]
bm25 = BM25Okapi(tokenized_corpus)

def bm25_search(query, n=5):
    tokenized_query = simple_tokenize(query)
    # Get top 5 documents
    top_docs = bm25.get_top_n(tokenized_query, docs, n=n)
    return top_docs

# Compare results
results = bm25_search("why do cats purr")
for i, res in enumerate(results):
    print(f"Rank {i+1}: {res.text[:100]}...")

Lab Analysis Questions:
- The Vocabulary Problem: Try searching for "stomach ache."  
  Does the Boolean search find documents that use the word "gastritis"? Why or why not?

- The Saturation Effect: In the BM25 formula, notice how the score doesn't increase infinitely if a word appears 100 times vs 1000 times.  
  Why is this important for web documents?

- Efficiency: How would our inverted_index dictionary perform if we had 100 million documents instead of 5,000?