# Notebook 1 (Updated): Preprocessing the Dataset

This notebook covers **Step 1** of the information retrieval project using the `.jsonl` and `.tsv` file formats. We will load the data, apply a series of preprocessing steps, and save the cleaned data for the retrieval models.

**Preprocessing Pipeline:**
1.  **Load Data**: Read `corpus.jsonl`, `queries.jsonl`, and the `qrels/*.tsv` files.
2.  **Tokenization**: Split text into individual words (tokens).
3.  **Lowercasing**: Convert all text to lowercase.
4.  **Stopword & Punctuation Removal**: Remove common English words and punctuation.
5.  **POS Tagging & Lemmatization**: Reduce words to their base form (lemma) using part-of-speech context.
6.  **Save Processed Data**: Store the cleaned data for use in the next notebook.

## 1. Setup and Installation

First, we'll install and import the necessary libraries. We'll use `pandas` for data handling and `nltk` for natural language processing.

In [1]:
!pip install pandas nltk

# Download the necessary NLTK data packages
import nltk
nltk.download('punkt')                      # tokenizer
nltk.download('stopwords')                  # stopword list
nltk.download('wordnet')                    # lemmatizer lexicon
nltk.download('omw-1.4')                    # multilingual WordNet data (lemmatizer often needs it)
nltk.download('averaged_perceptron_tagger')
nltk.download('punkt_tab')
print("\n✅ NLTK resources downloaded successfully!")

Defaulting to user installation because normal site-packages is not writeable

✅ NLTK resources downloaded successfully!


[nltk_data] Downloading package punkt to
[nltk_data]     /Users/moorateeahtashil/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/moorateeahtashil/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     /Users/moorateeahtashil/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package omw-1.4 to
[nltk_data]     /Users/moorateeahtashil/nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /Users/moorateeahtashil/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package punkt_tab to
[nltk_data]     /Users/moorateeahtashil/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!


## 2. Load the Dataset

This section is updated to handle the new file formats:
- **`corpus.jsonl` & `queries.jsonl`**: These are JSON Lines files, where each line is a separate JSON object. We read them line-by-line.
- **`qrels/*.tsv`**: These are Tab-Separated Value files. We'll load them and combine them into a single DataFrame for our purposes.

In [2]:
import pandas as pd
import json
import os
from glob import glob

# --- Configuration ---
# IMPORTANT: Adjust these paths if your files are in a different location.
CORPUS_FILE = '../fiqa/corpus.jsonl'
QUERIES_FILE = '../fiqa/queries.jsonl'
QRELS_DIR = '../fiqa/qrels/'

# --- Load Corpus --- 
print(f"Loading corpus from {CORPUS_FILE}...")
corpus_data = []
with open(CORPUS_FILE, 'r', encoding='utf-8') as f:
    for line in f:
        item = json.loads(line)
        # Best practice: combine title and text for a richer document representation
        full_text = item.get('title', '') + ' ' + item.get('text', '')
        corpus_data.append({
            'doc_id': str(item['_id']), # Ensure IDs are strings
            'text': full_text.strip()
        })
corpus_df = pd.DataFrame(corpus_data)
print(f"Loaded {len(corpus_df)} documents.")
display(corpus_df.head())

# --- Load Queries ---
print(f"\nLoading queries from {QUERIES_FILE}...")
queries_data = []
with open(QUERIES_FILE, 'r', encoding='utf-8') as f:
    for line in f:
        item = json.loads(line)
        queries_data.append({
            'query_id': str(item['_id']), # Ensure IDs are strings
            'text': item['text']
        })
queries_df = pd.DataFrame(queries_data)
print(f"Loaded {len(queries_df)} queries.")
display(queries_df.head())

# --- Load and Combine Qrels ---
print(f"\nLoading qrels from {QRELS_DIR}...")
qrels_files = glob(os.path.join(QRELS_DIR, '*.tsv'))
qrels_df_list = []
for file_path in qrels_files:
    df = pd.read_csv(file_path, sep='\t')
    # Standardize column names
    df.rename(columns={'query-id': 'query_id', 'corpus-id': 'doc_id'}, inplace=True)
    # Ensure IDs are strings for consistency
    df['query_id'] = df['query_id'].astype(str)
    df['doc_id'] = df['doc_id'].astype(str)
    qrels_df_list.append(df)

qrels_df = pd.concat(qrels_df_list, ignore_index=True)
# Filter for only positive relevance scores
qrels_df = qrels_df[qrels_df['score'] > 0]
print(f"Loaded and combined {len(qrels_df)} relevance judgments from {len(qrels_files)} files.")
display(qrels_df.head())

Loading corpus from ../fiqa/corpus.jsonl...
Loaded 57638 documents.


Unnamed: 0,doc_id,text
0,3,I'm not saying I don't like the idea of on-the...
1,31,So nothing preventing false ratings besides ad...
2,56,You can never use a health FSA for individual ...
3,59,Samsung created the LCD and other flat screen ...
4,63,Here are the SEC requirements: The federal sec...



Loading queries from ../fiqa/queries.jsonl...
Loaded 6648 queries.


Unnamed: 0,query_id,text
0,0,What is considered a business expense on a bus...
1,4,Business Expense - Car Insurance Deductible Fo...
2,5,Starting a new online business
3,6,“Business day” and “due date” for bills
4,7,New business owner - How do taxes work for the...



Loading qrels from ../fiqa/qrels/...
Loaded and combined 17110 relevance judgments from 3 files.


Unnamed: 0,query_id,doc_id,score
0,0,18850,1
1,4,196463,1
2,5,69306,1
3,6,560251,1
4,6,188530,1


## 3. Preprocessing Pipeline

This text processing function is the same as before. It tokenizes, cleans, and lemmatizes any given text, making it perfect for both documents and queries.

In [3]:
from nltk.corpus import stopwords, wordnet
from nltk.stem import WordNetLemmatizer
from nltk.tokenize import word_tokenize
import string

lemmatizer = WordNetLemmatizer()
stop_words = set(stopwords.words('english'))
punctuation = set(string.punctuation)

def get_wordnet_pos(tag):
    """Map NLTK POS tag to a format WordNetLemmatizer can understand."""
    if tag.startswith('J'):
        return wordnet.ADJ
    elif tag.startswith('V'):
        return wordnet.VERB
    elif tag.startswith('N'):
        return wordnet.NOUN
    elif tag.startswith('R'):
        return wordnet.ADV
    else:
        return wordnet.NOUN # Default to noun

def preprocess_text(text):
    """Applies the full preprocessing pipeline to a single string of text."""
    # 1. Tokenize and lowercase
    tokens = word_tokenize(text.lower())
    
    # 2. Part-of-Speech (POS) tagging
    pos_tags = nltk.pos_tag(tokens)
    
    # 3. Lemmatize with POS tags and remove stopwords/punctuation
    lemmas = []
    for word, tag in pos_tags:
        # Check if it's a stopword or just punctuation
        if word not in stop_words and word not in punctuation and word.isalpha():
            # Get the correct POS tag for the lemmatizer
            wnet_pos = get_wordnet_pos(tag)
            lemma = lemmatizer.lemmatize(word, pos=wnet_pos)
            lemmas.append(lemma)
            
    return lemmas

# --- Example of preprocessing ---
sample_text = "Investing in stocks can be a rewarding but risky endeavor. What are the best strategies for beginners?"
processed_sample = preprocess_text(sample_text)
print(f"Original: {sample_text}")
print(f"Processed: {processed_sample}")

Original: Investing in stocks can be a rewarding but risky endeavor. What are the best strategies for beginners?
Processed: ['invest', 'stock', 'rewarding', 'risky', 'endeavor', 'best', 'strategy', 'beginner']


## 4. Apply Preprocessing to Corpus and Queries

Now, we apply our function to the `text` column of both the corpus and queries DataFrames. This step can take some time on large datasets.

In [4]:
from tqdm import tqdm
tqdm.pandas()

print("Preprocessing corpus...")
# Using progress_apply to see a progress bar
corpus_df['processed_text'] = corpus_df['text'].progress_apply(preprocess_text)

print("\nPreprocessing queries...")
queries_df['processed_text'] = queries_df['text'].progress_apply(preprocess_text)

print("\n--- Processed Corpus Sample ---")
display(corpus_df.head())

print("\n--- Processed Queries Sample ---")
display(queries_df.head())

Preprocessing corpus...


100%|██████████| 57638/57638 [03:15<00:00, 294.78it/s]



Preprocessing queries...


100%|██████████| 6648/6648 [00:02<00:00, 3011.48it/s]


--- Processed Corpus Sample ---





Unnamed: 0,doc_id,text,processed_text
0,3,I'm not saying I don't like the idea of on-the...,"[say, like, idea, training, ca, expect, compan..."
1,31,So nothing preventing false ratings besides ad...,"[nothing, prevent, false, rating, besides, add..."
2,56,You can never use a health FSA for individual ...,"[never, use, health, fsa, individual, health, ..."
3,59,Samsung created the LCD and other flat screen ...,"[samsung, create, lcd, flat, screen, technolog..."
4,63,Here are the SEC requirements: The federal sec...,"[sec, requirement, federal, security, law, def..."



--- Processed Queries Sample ---


Unnamed: 0,query_id,text,processed_text
0,0,What is considered a business expense on a bus...,"[consider, business, expense, business, trip]"
1,4,Business Expense - Car Insurance Deductible Fo...,"[business, expense, car, insurance, deductible..."
2,5,Starting a new online business,"[start, new, online, business]"
3,6,“Business day” and “due date” for bills,"[business, day, due, date, bill]"
4,7,New business owner - How do taxes work for the...,"[new, business, owner, tax, work, business, v,..."


## 5. Create an Inverted Index

An **inverted index** is a data structure that maps terms to the documents containing them. It is the cornerstone of efficient search. We build it from our processed corpus.

In [5]:
from collections import defaultdict

def create_inverted_index(corpus_df):
    inverted_index = defaultdict(set) # Use a set to avoid duplicate doc_ids
    for _, row in tqdm(corpus_df.iterrows(), total=corpus_df.shape[0], desc="Building Index"):
        doc_id = row['doc_id']
        terms = row['processed_text']
        for term in terms:
            inverted_index[term].add(doc_id)
    # Convert sets to lists for JSON serialization
    for term in inverted_index:
        inverted_index[term] = list(inverted_index[term])
    return inverted_index

inverted_index = create_inverted_index(corpus_df[['doc_id', 'processed_text']])

# --- Example lookup in the inverted index ---
if 'stock' in inverted_index:
    print("\nDocuments containing the word 'stock':")
    print(inverted_index['stock'][:10]) # Show first 10

Building Index: 100%|██████████| 57638/57638 [00:02<00:00, 26615.32it/s]



Documents containing the word 'stock':
['260094', '368079', '57960', '98654', '539263', '434838', '305128', '513734', '217286', '354815']


## 6. Save Processed Data

Finally, we save our processed DataFrames and the inverted index to disk. We will use these files in the next notebook (`2_Models_and_Evaluation.ipynb`) to build and evaluate our IR models.

In [6]:
import pickle

OUTPUT_DIR = '../fiqa/processed_data'
if not os.path.exists(OUTPUT_DIR):
    os.makedirs(OUTPUT_DIR)

# Save processed dataframes using pickle for efficiency
corpus_df.to_pickle(os.path.join(OUTPUT_DIR, 'corpus_processed.pkl'))
queries_df.to_pickle(os.path.join(OUTPUT_DIR, 'queries_processed.pkl'))
qrels_df.to_pickle(os.path.join(OUTPUT_DIR, 'qrels.pkl'))

# Save inverted index as a JSON file
with open(os.path.join(OUTPUT_DIR, 'inverted_index.json'), 'w', encoding='utf-8') as f:
    json.dump(inverted_index, f)

print(f"\n✅ Preprocessing complete. All processed data saved to the '{OUTPUT_DIR}' directory.")


✅ Preprocessing complete. All processed data saved to the '../fiqa/processed_data' directory.
