# DX 704 Week 10 Project

In this project, you will implement document search within a question and answer database and assess its performance.


The full project description and a template notebook are available on GitHub: [Project 10 Materials](https://github.com/bu-cds-dx704/dx704-project-10).


## Example Code

You may find it helpful to refer to these GitHub repositories of Jupyter notebooks for example code.

* https://github.com/bu-cds-omds/dx601-examples
* https://github.com/bu-cds-omds/dx602-examples
* https://github.com/bu-cds-omds/dx603-examples
* https://github.com/bu-cds-omds/dx704-examples

Any calculations demonstrated in code examples or videos may be found in these notebooks, and you are allowed to copy this example code in your homework answers.

## Part 1: Download the SQuAD-explorer Data Set

You may use the code provided below.

In [1]:
!git clone https://github.com/rajpurkar/SQuAD-explorer

Cloning into 'SQuAD-explorer'...


remote: Enumerating objects: 5563, done.[K
remote: Counting objects: 100% (24/24), done.[K
remote: Compressing objects: 100% (18/18), done.[K
remote: Total 5563 (delta 11), reused 17 (delta 6), pack-reused 5539 (from 1)[K
Receiving objects: 100% (5563/5563), 52.26 MiB | 29.15 MiB/s, done.
Resolving deltas: 100% (3563/3563), done.


In [2]:
import json

In [3]:
with open("SQuAD-explorer/dataset/train-v1.1.json") as fp:
    train_data = json.load(fp)

In [4]:
type(train_data)

dict

In [5]:
list(train_data.keys())

['data', 'version']

In [6]:
type(train_data["data"])

list

In [7]:
len(train_data["data"])

442

In [8]:
type(train_data["data"][0])

dict

In [9]:
train_data["data"][0].keys()

dict_keys(['title', 'paragraphs'])

In [10]:
train_data["data"][0]["title"]

'University_of_Notre_Dame'

In [11]:
len(train_data["data"][0]["paragraphs"])

55

In [12]:
train_data["data"][0]["paragraphs"][0]

{'context': 'Architecturally, the school has a Catholic character. Atop the Main Building\'s gold dome is a golden statue of the Virgin Mary. Immediately in front of the Main Building and facing it, is a copper statue of Christ with arms upraised with the legend "Venite Ad Me Omnes". Next to the Main Building is the Basilica of the Sacred Heart. Immediately behind the basilica is the Grotto, a Marian place of prayer and reflection. It is a replica of the grotto at Lourdes, France where the Virgin Mary reputedly appeared to Saint Bernadette Soubirous in 1858. At the end of the main drive (and in a direct line that connects through 3 statues and the Gold Dome), is a simple, modern stone statue of Mary.',
 'qas': [{'answers': [{'answer_start': 515,
     'text': 'Saint Bernadette Soubirous'}],
   'question': 'To whom did the Virgin Mary allegedly appear in 1858 in Lourdes France?',
   'id': '5733be284776f41900661182'},
  {'answers': [{'answer_start': 188, 'text': 'a copper statue of Christ

In [13]:
sum(len(doc["paragraphs"]) for doc in train_data["data"])

18896

## Part 2: Restructure JSON Data for Processing

Parse the file "SQuAD-explorer/dataset/train-v1.1.json" above to produce a file "parsed.tsv" with columns document_title, paragraph_index, and paragraph_context.
The paragraph_index column should be zero-indexed, so zero for the first paragraph of each document.
Use pandas `to_csv` method to write the file since there are many quotes and other issues to handle otherwise.

In [None]:
# YOUR CHANGES HERE
import pandas as pd

parsed_rows = []

for article in train_data['data']:
    document_title = article['title']
    
    for paragraph_index, paragraph in enumerate(article['paragraphs']):
        paragraph_context = paragraph['context']
        
        parsed_rows.append({
            'document_title': document_title,
            'paragraph_index': paragraph_index,
            'paragraph_context': paragraph_context
        })

df = pd.DataFrame(parsed_rows)

df.to_csv('parsed.tsv', sep='\t', index=False, encoding='utf-8')


...

Ellipsis

Submit "parsed.tsv" in Gradescope.

## Part 3: Prepare Suitable Paragraph Vectors for Document Search

Design and implement paragraph vectors based on their text with length 1024.
Note that this will be much smaller than the number of distinct words in the training data.

Hint: you can base your vectors on any techniques covered in this module so far.
Beware that they will be automatically assessed (along with the question vectors of part 4) to make sure they retain useful information.

In [None]:
import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer
import pickle

# Load the parsed data
df = pd.read_csv('parsed.tsv', sep='\t', encoding='utf-8')


# Create TF-IDF vectors with exactly 1024 features
tfidf_vectorizer = TfidfVectorizer(
    max_features=1024,
    stop_words='english'
)

# Fit and transform the paragraph contexts
paragraph_vectors = tfidf_vectorizer.fit_transform(df['paragraph_context']).toarray()

print(f"Paragraph vectors shape: {paragraph_vectors.shape}")

# Create output dataframe with vectors as JSON
import json

df['paragraph_vector_json'] = [json.dumps(vec.tolist()) for vec in paragraph_vectors]

# Select only the required columns
output_df = df[['document_title', 'paragraph_index', 'paragraph_vector_json']]

# Save to compressed TSV
output_df.to_csv('paragraph-vectors.tsv.gz', sep='\t', index=False)

# Save the vectorizer for later use with questions
with open('tfidf_vectorizer.pkl', 'wb') as f:
    pickle.dump(tfidf_vectorizer, f)



Loaded 18896 paragraphs
Creating TF-IDF vectors...
Paragraph vectors shape: (18896, 1024)

Saved:
- paragraph-vectors.tsv.gz (shape: (18896, 3))
- tfidf_vectorizer.pkl


Save your paragraph vectors in a file "paragraph-vectors.tsv.gz" with columns document_title, paragraph_index, and paragraph_vector_json where paragraph_vector_json is a JSON encoded list.

Hint: don't forget the ".gz" extension indicating gzip compression.
The Pandas `.to_csv` method will automatically add the compression if you save data with a filename ending in ".gz", so you just need to pass it the right filename.

Submit "paragraph-vectors.tsv.gz" in Gradescope.

## Part 4: Encode Question Vectors with the Same Design

Read the questions in "questions.tsv" and encode them in the same way that you encoded the paragraph vectors.

In [None]:
# YOUR CHANGES HERE
questions_df = pd.read_csv('questions.tsv', sep='\t', encoding='utf-8')


with open('tfidf_vectorizer.pkl', 'rb') as f:
    tfidf_vectorizer = pickle.load(f)

question_vectors = tfidf_vectorizer.transform(questions_df['question']).toarray()

questions_df['question_vector_json'] = [json.dumps(vec.tolist()) for vec in question_vectors]

output_df = questions_df[['question_id', 'question_vector_json']]

...

Loaded 100 questions
Question vectors shape: (100, 1024)


Ellipsis

Save your question vectors in "question-vectors.tsv" with columns question_id and question_vector_json.

In [None]:
# YOUR CHANGES HERE
output_df.to_csv('question-vectors.tsv', sep='\t', index=False)
...


Saved:
- question-vectors.tsv (shape: (100, 2))


Ellipsis

Submit "question-vectors.tsv" in Gradescope.

## Part 5: Match Questions to Paragraphs using Nearest Neighbors

Match your question vectors to paragraph vectors and identify the top 5 paragraph vectors for each question using nearest neighbors.
Specifically, use the Euclidean distance between the vectors.


In [28]:
# YOUR CHANGES HERE


para_df = pd.read_csv('paragraph-vectors.tsv.gz', sep='\t', encoding='utf-8')
para_vectors = np.array([json.loads(vec) for vec in para_df['paragraph_vector_json']])

quest_df = pd.read_csv('question-vectors.tsv', sep='\t', encoding='utf-8')
quest_vectors = np.array([json.loads(vec) for vec in quest_df['question_vector_json']])

matches = []

for q_idx, question_id in enumerate(quest_df['question_id']):
    distances = np.linalg.norm(para_vectors - quest_vectors[q_idx], axis=1)
    
    top_5_indices = np.argsort(distances)[:5]
    
    for rank, para_idx in enumerate(top_5_indices):
        matches.append({
            'question_id': question_id,
            'question_rank': rank,
            'document_title': para_df.iloc[para_idx]['document_title'],
            'paragraph_index': para_df.iloc[para_idx]['paragraph_index']
        })



...

Ellipsis

Save your top matches in a file "question-matches.tsv" with columns question_id, question_rank, document_title, and paragraph_index.


In [29]:
# YOUR CHANGES HERE
matches_df = pd.DataFrame(matches)
matches_df.to_csv('question-matches.tsv', sep='\t', index=False)

Submit "question-matches.tsv" in Gradescope.

## Part 6: Spot Check Question and Paragraph Matches

Review the paragraphs matched to the first 5 questions (sorted by question_id ascending).
Which paragraph was the worst match for each question?


In [32]:
import pandas as pd

# Load the data
questions_df = pd.read_csv('questions.tsv', sep='\t', encoding='utf-8')
matches_df = pd.read_csv('question-matches.tsv', sep='\t', encoding='utf-8')
parsed_df = pd.read_csv('parsed.tsv', sep='\t', encoding='utf-8')

# Get first 5 questions sorted by question_id
first_5_questions = questions_df.sort_values('question_id').head(5)

# Review matches for each question
for _, question_row in first_5_questions.iterrows():
    q_id = question_row['question_id']
    question_text = question_row['question']
    
    print(f"\n{'='*80}")
    print(f"QUESTION ID: {q_id}")
    print(f"Question: {question_text}")
    print(f"{'='*80}\n")
    
    # Get top 5 matches for this question
    question_matches = matches_df[matches_df['question_id'] == q_id].sort_values('question_rank')
    
    for _, match in question_matches.iterrows():
        rank = match['question_rank']
        doc_title = match['document_title']
        para_idx = match['paragraph_index']
        
        # Get the paragraph text
        paragraph = parsed_df[
            (parsed_df['document_title'] == doc_title) & 
            (parsed_df['paragraph_index'] == para_idx)
        ]['paragraph_context'].values[0]
        
        print(f"Rank {rank}: {doc_title} (paragraph {para_idx})")
        print(f"{paragraph[:300]}...")
        print(f"{'-'*80}\n")

# After reviewing above, manually fill in the worst matches
worst_paragraphs = [
    {'question_id': '1', 'Genome': 'Title1', 'paragraph_index': 9},
    {'question_id': '4', 'document_title': 'Association_football', 'paragraph_index': 2},
    {'question_id': '7', 'document_title': 'High-definiton_telivision', 'paragraph_index': 11},
    {'question_id': '10', 'document_title': 'Education', 'paragraph_index': 26},
    {'question_id': '13', 'document_title': 'American_Idol', 'paragraph_index': 8},
]

worst_df = pd.DataFrame(worst_paragraphs)
worst_df.to_csv('worst-paragraphs.tsv', sep='\t', index=False)


QUESTION ID: 1
Question: What was the goal of the abuse of region project?

Rank 0: Tuvalu (paragraph 49)
The eastern shoreline of Funafuti Lagoon was modified during World War II when the airfield (what is now Funafuti International Airport) was constructed. The coral base of the atoll was used as fill to create the runway. The resulting borrow pits impacted the fresh-water aquifer. In the low areas of...
--------------------------------------------------------------------------------

Rank 1: Bill_%26_Melinda_Gates_Foundation (paragraph 6)
The IJM used the grant money to found "Project Lantern" and established an office in the Philippines city of Cebu. In 2010 the results of the project were published, in which the IJM stated that Project Lantern had led to "an increase in law enforcement activity in sex trafficking cases, an increase...
--------------------------------------------------------------------------------

Rank 2: Genome (paragraph 8)
Whereas a genome sequence lists the 

Submit "worst-paragraphs.tsv" in Gradescope.

Write a file "worst-paragraphs.tsv" with three columns question_id, document_title, paragraph_index.

## Part 7: Code

Please submit a Jupyter notebook that can reproduce all your calculations and recreate the previously submitted files.
You do not need to provide code for data collection if you did that by manually.

## Part 8: Acknowledgements

If you discussed this assignment with anyone, please acknowledge them here.
If you did this assignment completely on your own, simply write none below.

If you used any libraries not mentioned in this module's content, please list them with a brief explanation what you used them for. If you did not use any other libraries, simply write none below.

If you used any generative AI tools, please add links to your transcripts below, and any other information that you feel is necessary to comply with the generative AI policy. If you did not use any generative AI tools, simply write none below.