# DX 704 Week 10 Project

In this project, you will implement document search within a question and answer database and assess its performance.


The full project description and a template notebook are available on GitHub: [Project 10 Materials](https://github.com/bu-cds-dx704/dx704-project-10).


## Example Code

You may find it helpful to refer to these GitHub repositories of Jupyter notebooks for example code.

* https://github.com/bu-cds-omds/dx601-examples
* https://github.com/bu-cds-omds/dx602-examples
* https://github.com/bu-cds-omds/dx603-examples
* https://github.com/bu-cds-omds/dx704-examples

Any calculations demonstrated in code examples or videos may be found in these notebooks, and you are allowed to copy this example code in your homework answers.

## Part 1: Download the SQuAD-explorer Data Set

You may use the code provided below.

In [1]:
!git clone https://github.com/rajpurkar/SQuAD-explorer

fatal: destination path 'SQuAD-explorer' already exists and is not an empty directory.


In [2]:
import json

In [3]:
with open("SQuAD-explorer/dataset/train-v1.1.json") as fp:
    train_data = json.load(fp, )

In [4]:
type(train_data)

dict

In [5]:
list(train_data.keys())

['data', 'version']

In [6]:
type(train_data["data"])

list

In [7]:
len(train_data["data"])

442

In [8]:
type(train_data["data"][0])

dict

In [9]:
train_data["data"][0].keys()

dict_keys(['title', 'paragraphs'])

In [10]:
train_data["data"][0]["title"]

'University_of_Notre_Dame'

In [11]:
len(train_data["data"][0]["paragraphs"])

55

In [12]:
train_data["data"][0]["paragraphs"][0]

{'context': 'Architecturally, the school has a Catholic character. Atop the Main Building\'s gold dome is a golden statue of the Virgin Mary. Immediately in front of the Main Building and facing it, is a copper statue of Christ with arms upraised with the legend "Venite Ad Me Omnes". Next to the Main Building is the Basilica of the Sacred Heart. Immediately behind the basilica is the Grotto, a Marian place of prayer and reflection. It is a replica of the grotto at Lourdes, France where the Virgin Mary reputedly appeared to Saint Bernadette Soubirous in 1858. At the end of the main drive (and in a direct line that connects through 3 statues and the Gold Dome), is a simple, modern stone statue of Mary.',
 'qas': [{'answers': [{'answer_start': 515,
     'text': 'Saint Bernadette Soubirous'}],
   'question': 'To whom did the Virgin Mary allegedly appear in 1858 in Lourdes France?',
   'id': '5733be284776f41900661182'},
  {'answers': [{'answer_start': 188, 'text': 'a copper statue of Christ

In [13]:
sum(len(doc["paragraphs"]) for doc in train_data["data"])

18896

## Part 2: Restructure JSON Data for Processing

Parse the file "SQuAD-explorer/dataset/train-v1.1.json" above to produce a file "parsed.tsv" with columns document_title, paragraph_index, and paragraph_context.
The paragraph_index column should be zero-indexed, so zero for the first paragraph of each document.
Use pandas `to_csv` method to write the file since there are many quotes and other issues to handle otherwise.

In [14]:
with open('pretty_json.json', 'w') as f:
    json.dump(train_data, f, indent=4)

In [15]:
# Parse the contents of train_data to create a new file called "parsed.tsv"
import pandas as pd

parsed_file = []

# loop through josn to find all titles
for doc in train_data['data']:
    title = doc['title']
    for idx, paragraph in enumerate(doc['paragraphs']):
        context = paragraph['context']
        parsed_file.append({
            'document_title': title,
            'paragraph_index': idx,
            'paragraph_context': context
        })

# convert to dataframe
df = pd.DataFrame(parsed_file)
df.to_csv('submission/parsed.tsv', sep='\t', index=False)

# print preview of the data
print(f"Created parsed.tsv with {len(df)} rows.")
print("Sample of the first couple rows:")
print(df.head())

Created parsed.tsv with 18896 rows.
Sample of the first couple rows:
             document_title  paragraph_index  \
0  University_of_Notre_Dame                0   
1  University_of_Notre_Dame                1   
2  University_of_Notre_Dame                2   
3  University_of_Notre_Dame                3   
4  University_of_Notre_Dame                4   

                                   paragraph_context  
0  Architecturally, the school has a Catholic cha...  
1  As at most other universities, Notre Dame's st...  
2  The university is the major seat of the Congre...  
3  The College of Engineering was established in ...  
4  All of Notre Dame's undergraduate students are...  


Submit "parsed.tsv" in Gradescope.

## Part 3: Prepare Suitable Paragraph Vectors for Document Search

Design and implement paragraph vectors based on their text with length 1024.
Note that this will be much smaller than the number of distinct words in the training data.

Hint: you can base your vectors on any techniques covered in this module so far.
Beware that they will be automatically assessed (along with the question vectors of part 4) to make sure they retain useful information.

In [16]:
# Design paragraph vectors based on the text length 1024
from sklearn.feature_extraction.text import TfidfVectorizer

vectorizer = TfidfVectorizer(max_features=1024)
X = vectorizer.fit_transform(df['paragraph_context'])

print(f"Paragraph vectors shape: {X.shape}")
print(f"Vocabulary size: {len(vectorizer.vocabulary_)}")
print(f"Number of features: {X.shape[1]}")

Paragraph vectors shape: (18896, 1024)
Vocabulary size: 1024
Number of features: 1024


Save your paragraph vectors in a file "paragraph-vectors.tsv.gz" with columns document_title, paragraph_index, and paragraph_vector_json where paragraph_vector_json is a JSON encoded list.

Hint: don't forget the ".gz" extension indicating gzip compression.
The Pandas `.to_csv` method will automatically add the compression if you save data with a filename ending in ".gz", so you just need to pass it the right filename.

In [17]:
# Save the vectors in a new file
df_X = pd.DataFrame(X.toarray())
df_vectors = pd.DataFrame({'document_title': df['document_title'], 'paragraph_index': df['paragraph_index'], 'paragraph_vector_json': df_X.values.tolist()})
df_vectors.to_csv('submission/paragraph-vectors.tsv.gz', sep='\t', index=False)

Submit "paragraph-vectors.tsv.gz" in Gradescope.

## Part 4: Encode Question Vectors with the Same Design

Read the questions in "questions.tsv" and encode them in the same way that you encoded the paragraph vectors.

In [18]:
# Read the questions file and encode using the SAME vectorizer from paragraphs
df_questions = pd.read_csv('questions.tsv', sep='\t')

# Use transform() NOT fit_transform() - we want to use the same vocabulary as paragraphs
questions_json = vectorizer.transform(df_questions['question'])

print(f"Question vectors shape: {questions_json.shape}")
print(f"Paragraph vectors shape: {X.shape}")
print(f"Vectors have same number of features: {questions_json.shape[1] == X.shape[1]}")
df_questions.head()

Question vectors shape: (100, 1024)
Paragraph vectors shape: (18896, 1024)
Vectors have same number of features: True


Unnamed: 0,question_id,question
0,1,What was the goal of the abuse of region project?
1,4,How many satellites in the Beidou-1 constellat...
2,7,When did Beyoncé receive ten nominations for ...
3,10,"With which goddess did Sulla, Pompey, and Juli..."
4,13,What area is considered to have a desert clima...


Save your question vectors in "question-vectors.tsv" with columns question_id and question_vector_json.

In [19]:
# Save the question vectors in a new file
df_ques_vec = pd.DataFrame({'question_id': df_questions['question_id'], 'question_vector_json': questions_json.toarray().tolist()})
df_ques_vec.to_csv('submission/question-vectors.tsv', sep='\t', index=False)

Submit "question-vectors.tsv" in Gradescope.

## Part 5: Match Questions to Paragraphs using Nearest Neighbors

Match your question vectors to paragraph vectors and identify the top 5 paragraph vectors for each question using nearest neighbors.
Specifically, use the Euclidean distance between the vectors.


In [20]:
# Match question vectors to paragraph vectors using nearest neighbors
from sklearn.neighbors import NearestNeighbors
import numpy as np

# Convert vector lists to numpy arrays
paragraph_vectors = np.array(df_vectors['paragraph_vector_json'].tolist())
question_vectors = np.array(df_ques_vec['question_vector_json'].tolist())

print(f"Paragraph vectors shape: {paragraph_vectors.shape}")
print(f"Question vectors shape: {question_vectors.shape}")

# Check if dimensions match
if paragraph_vectors.shape[1] != question_vectors.shape[1]:
    raise ValueError(f"Dimension mismatch! Paragraphs have {paragraph_vectors.shape[1]} features, "
                     f"but questions have {question_vectors.shape[1]} features. "
                     f"Make sure to re-run Part 3 before Part 4!")

# Initialize and fit the model on PARAGRAPH vectors (the search space)
model = NearestNeighbors(n_neighbors=5, metric='euclidean')
model.fit(paragraph_vectors)

# Find nearest paragraphs for each QUESTION vector
distances, indices = model.kneighbors(question_vectors)

# Print preview of the predictions
print("\nSample of the nearest neighbors predictions:")
print(f"Total questions: {len(question_vectors)}")
print(f"Total paragraphs: {len(paragraph_vectors)}")
print("\nFirst 5 questions with their top 5 paragraph matches:")
for i in range(min(5, len(df_ques_vec))):
    print(f"\nQuestion ID: {df_ques_vec['question_id'].iloc[i]}")
    print(f"  Top 5 Nearest Paragraph Indices: {indices[i]}")
    print(f"  Distances: {distances[i]}")
    print(f"  Document Titles: {[df_vectors['document_title'].iloc[idx] for idx in indices[i]]}")  

Paragraph vectors shape: (18896, 1024)
Question vectors shape: (100, 1024)

Sample of the nearest neighbors predictions:
Total questions: 100
Total paragraphs: 18896

First 5 questions with their top 5 paragraph matches:

Question ID: 1
  Top 5 Nearest Paragraph Indices: [17565  1682 15050  8514 13052]
  Distances: [0.96707749 1.06718706 1.06821745 1.08058451 1.09104342]
  Document Titles: ['Tuvalu', 'Genome', 'Rajasthan', 'Bill_%26_Melinda_Gates_Foundation', 'Tibet']

Question ID: 4
  Top 5 Nearest Paragraph Indices: [ 1402  8005 15809 13021  4128]
  Distances: [0.97116657 1.01456967 1.08084652 1.08354636 1.11257042]
  Document Titles: ['Dog', 'Multiracial_American', 'The_Blitz', 'Rule_of_law', 'Classical_music']

Question ID: 7
  Top 5 Nearest Paragraph Indices: [ 1166 15803  1267  6242  5036]
  Distances: [1.09428625 1.1887239  1.19491097 1.20351572 1.20422867]
  Document Titles: ['Buddhism', 'The_Blitz', 'American_Idol', 'Gymnastics', 'High-definition_television']

Question ID: 10


Save your top matches in a file "question-matches.tsv" with columns question_id, question_rank, document_title, and paragraph_index.


In [28]:
# Create the question matches file
matches = []

for q_idx, question_id in enumerate(df_ques_vec['question_id']):
    # Get the 5 nearest paragraph indices for this question
    for rank, para_idx in enumerate(indices[q_idx]):
        matches.append({
            'question_id': question_id,
            'question_rank': rank,  # 0-4 for top 5 matches
            'document_title': df_vectors['document_title'].iloc[para_idx],
            'paragraph_index': df_vectors['paragraph_index'].iloc[para_idx]
        })

# Create DataFrame and save
df_matches = pd.DataFrame(matches)
df_matches.to_csv('submission/question-matches.tsv', sep='\t', index=False)

print(f"Created question-matches.tsv with {len(df_matches)} rows")
print(f"Expected: {len(df_ques_vec)} questions × 5 matches = {len(df_ques_vec) * 5} rows")
print("\nSample of first question's matches:")
print(df_matches[df_matches['question_id'] == df_ques_vec['question_id'].iloc[0]])

Created question-matches.tsv with 500 rows
Expected: 100 questions × 5 matches = 500 rows

Sample of first question's matches:
   question_id  question_rank                     document_title  \
0            1              0                             Tuvalu   
1            1              1                             Genome   
2            1              2                          Rajasthan   
3            1              3  Bill_%26_Melinda_Gates_Foundation   
4            1              4                              Tibet   

   paragraph_index  
0               49  
1                8  
2                9  
3                6  
4               10  


Submit "question-matches.tsv" in Gradescope.

## Part 6: Spot Check Question and Paragraph Matches

Review the paragraphs matched to the first 5 questions (sorted by question_id ascending).
Which paragraph was the worst match for each question?


In [None]:
# Review the first 5 questions and their matched paragraphs
import textwrap

# Get first question
first_5_questions = df_ques_vec.sort_values('question_id').head(5)

for q_idx in range(5):
    question_id = first_5_questions['question_id'].iloc[q_idx]
    question_text = df_questions[df_questions['question_id'] == question_id]['question'].iloc[0]
    
    print("=" * 80)
    print(f"QUESTION ID: {question_id}")
    print(f"Question: {question_text}")
    print("=" * 80)
    
    # Find the matches for this question in our matches dataframe
    question_matches = df_matches[df_matches['question_id'] == question_id].sort_values('question_rank')
    
    for _, match in question_matches.iterrows():
        rank = match['question_rank']
        doc_title = match['document_title']
        para_idx = match['paragraph_index']
        
        # Get the actual paragraph text
        paragraph = df[(df['document_title'] == doc_title) & 
                      (df['paragraph_index'] == para_idx)]['paragraph_context'].iloc[0]
        
        print(f"\n--- RANK {rank + 1} ---")
        print(f"Document: {doc_title}")
        print(f"Paragraph Index: {para_idx}")
        print(f"\nParagraph Text:")
        # Wrap text to 80 characters for readability
        wrapped_text = textwrap.fill(paragraph, width=80)
        print(wrapped_text)
        print()
    
    print("\n")


QUESTION ID: 1
Question: What was the goal of the abuse of region project?

--- RANK 1 ---
Document: Tuvalu
Paragraph Index: 49

Paragraph Text:
The eastern shoreline of Funafuti Lagoon was modified during World War II when
the airfield (what is now Funafuti International Airport) was constructed. The
coral base of the atoll was used as fill to create the runway. The resulting
borrow pits impacted the fresh-water aquifer. In the low areas of Funafuti the
sea water can be seen bubbling up through the porous coral rock to form pools
with each high tide. Since 1994 a project has been in development to assess the
environmental impact of transporting sand from the lagoon to fill all the borrow
pits and low-lying areas on Fongafale. In 2014 the Tuvalu Borrow Pits
Remediation (BPR) project was approved in order to fill 10 borrow pits, leaving
Tafua Pond, which is a natural pond. The New Zealand Government funded the BPR
project. The project was carried out in 2015 with 365,000 sqm of sand bei

Submit "worst-paragraphs.tsv" in Gradescope.

Write a file "worst-paragraphs.tsv" with three columns question_id, document_title, paragraph_index.

In [33]:
# create worst paragraphs dataframe
question_ids = [1, 4, 7, 10, 13]
worst_paragraph_idx = [9, 5, 12, 50, 9]
document_titles = ['Rajasthan', 'classical_music', 'Gymnastics', 'Computer', 'Southeast_Asia']

df_worst = pd.DataFrame({'question_id': question_ids, 'document_title': document_titles, 'paragraph_index': worst_paragraph_idx})
df_worst.to_csv('submission/worst-paragraph.tsv', sep='\t', index=False)
print("worst-paragraphs.tsv file created")

worst-paragraphs.tsv file created


## Part 7: Code

Please submit a Jupyter notebook that can reproduce all your calculations and recreate the previously submitted files.
You do not need to provide code for data collection if you did that by manually.

## Part 8: Acknowledgements

If you discussed this assignment with anyone, please acknowledge them here.
If you did this assignment completely on your own, simply write none below.

If you used any libraries not mentioned in this module's content, please list them with a brief explanation what you used them for. If you did not use any other libraries, simply write none below.

If you used any generative AI tools, please add links to your transcripts below, and any other information that you feel is necessary to comply with the generative AI policy. If you did not use any generative AI tools, simply write none below.