## Instructions
* Read each cell and implement the **TODOs** sequentially. The markdown/text cells also contain instructions which you need to follow to get the whole notebook working.
* Do not change the variable names unless the instructor allows you to.
* Do not delete the **TODO** comment blocks.
* Aside from the TODOs, there will be questions embedded in the notebook and a cell for you to provide your answer (denoted with A:). Answer all the markdown/text cells with **"A: "** on them. 
* You are expected to search how to some functions work on the Internet or via the docs. 
* You may add new cells for "scrap work".
* The notebooks will undergo a "Restart and Run All" command, so make sure that your code is working properly.
* You are expected to understand the data set loading and processing separately from this class.
* You may not reproduce this notebook or share them to anyone.

# Assignment 2.1 - Natural Language Processing Notebook 
In this notebook, you will be asked to implement a simple n-gram based spell checker and a simple tf-idf based document retrieval.

In [229]:
import numpy as np
import requests
import re
import os
from tqdm import tqdm
import glob
import pandas as pd
import nltk
from nltk.metrics import distance



# Simple N-gram based Spell Checker

The assignment will help you understand how language models and probabilistic methods can be applied to the task of spelling correction.

To implement a spell checker, you will need to first identify and suggest corrections for misspelled words. 
When a word is flagged as potentially misspelled, your program will compare it against the n-gram model, generate a list of possible corrections based on the highest n-gram similarity scores, and suggest the most probable corrections to the user. 

### Creating the N-gram model

As a simplification, you will not be creating the n-gram model from scratch. Instead, we will be utilizing a publicly available n-gram frequency counts for the english language.

Source: Multi-Lex (https://analytics.huma-num.fr/popr-ngram/Multi-LEX/index.html)

Let's load the 2-gram csv.

In [230]:
df_bigram = pd.read_csv("ENG2_million.csv", delimiter="\t")
df_bigram

Unnamed: 0,NGRAM,GRAM1,GRAM2,POS1,POS2,OCCURRENCE,FPM,ZFI,POS_PC
0,to be,to,be,prt,verb,47783417,3.113632,7.255341,100.000000
1,it is,it,is,pron,verb,36860809,3.000920,7.104395,100.000000
2,it was,it,was,pron,verb,35600591,2.985812,7.084163,100.000000
3,do not,do,not,verb,adv,27042969,2.866409,6.924258,100.000000
4,did not,did,not,verb,adv,25077663,2.833642,6.880376,99.997472
...,...,...,...,...,...,...,...,...,...
999995,to surname,to,surname,prt,verb,125,-2.468735,-0.220617,91.200000
999996,familiar analogy,familiar,analogy,adj,noun,226,-2.211537,0.123825,100.000000
999997,technical arrangements,technical,arrangements,adj,noun,879,-1.621656,0.913799,100.000000
999998,winch were,winch,were,noun,verb,50,-2.866675,-0.753542,100.000000


This csv provides a million 2-gram pairs. So you will need to handle words that are missing.

The columns are as follows: <br />
- NGRAM: The pair of words <br />
- GRAM1 and GRAM2: Each of the words separated <br />
- POS1 and POS2: Part of speech tags of each word<br />
- Occurrence : Number of times a sequence was read in the corpus. <br/>
- FPM : Frequency Per Million occurrence. <br/>
- ZFI : Standardized Frequency Index. <br/>

For convenience, we can use a dictionary data structure to store the 2-gram frequency counts.

In [231]:
#Since the data is very large 
#I am testing my logic first with a smaller self created dataframe
import pandas as pd 
# Create a sample dataframe 
data = { 
    'GRAM1': ['hello', 'hello', 'good', 'good', 'nice'], 
    'GRAM2': ['world', 'there', 'morning', 'evening', 'day'], 
    'OCCURRENCE': [10, 5, 20, 15, 8] 
    } 

# Convert the data dictionary into a pandas dataframe 
df_test = pd.DataFrame(data) 

# Display the dataframe 
df_test

Unnamed: 0,GRAM1,GRAM2,OCCURRENCE
0,hello,world,10
1,hello,there,5
2,good,morning,20
3,good,evening,15
4,nice,day,8


In [232]:
# Initialize the dictionary
bi_gram_dict = {}

# Iterate over rows of the dataframe
for _, row in df_test.iterrows():
    word1 = row['GRAM1']
    word2 = row['GRAM2']
    occurrence = row['OCCURRENCE']
    
    # If the first word is not in the dictionary, add it
    if word1 not in bi_gram_dict:
        bi_gram_dict[word1] = {}
    
    # Add the second word and its frequency count to the nested dictionary
    bi_gram_dict[word1][word2] = occurrence
bi_gram_dict

{'hello': {'world': 10, 'there': 5},
 'good': {'morning': 20, 'evening': 15},
 'nice': {'day': 8}}

In [233]:
bi_gram_dict = {}
############################################################
# TODO-01: Convert the n-gram counts from the pandas       #
# dataframe to a dictionary where                          #
# bi_gram_dict[word1][word2] gives the frequency count of  #
# the pair of words.                                       #
############################################################

# Iterate over rows of the dataframe
for _, row in df_bigram.iterrows():
    word1 = row['GRAM1']
    word2 = row['GRAM2']
    occurrence = row['OCCURRENCE']
    
    # If the first word is not in the dictionary, add it
    if word1 not in bi_gram_dict:
        bi_gram_dict[word1] = {}
    
    # Add the second word and its frequency count to the nested dictionary
    bi_gram_dict[word1][word2] = occurrence


############################################################
#                    End of your code.                     #
############################################################

Next, we want to get a list of english words to serve as our vocabulary. But since the n-gram provided above is not complete, we will get the vocabulary list from the Brown corpus instead.

In [234]:
from nltk.corpus import brown
nltk.download('brown')
print(brown.readme())

BROWN CORPUS

A Standard Corpus of Present-Day Edited American
English, for use with Digital Computers.

by W. N. Francis and H. Kucera (1964)
Department of Linguistics, Brown University
Providence, Rhode Island, USA

Revised 1971, Revised and Amplified 1979

http://www.hit.uib.no/icame/brown/bcm.html

Distributed with the permission of the copyright holder,
redistribution permitted.



[nltk_data] Downloading package brown to
[nltk_data]     C:\Users\ningw\AppData\Roaming\nltk_data...
[nltk_data]   Package brown is already up-to-date!


We can get all the words through the function ``.words()``.

In [235]:
print("List of words", brown.words())
print("Number of words", len(brown.words()))

List of words ['The', 'Fulton', 'County', 'Grand', 'Jury', 'said', ...]
Number of words 1161192


As you can see, the vocabulary or list of words contain both uppercase and lowercase letters. 
But we only want lowercase letters. Therefore, we first need to convert the vocabulary to lowercase letters. 

Hint: the ``set()`` function of python can be useful for this.

In [236]:
############################################################
# TODO-02: Convert the words to lowercase and get all the  #
# unique words into the variable vocabulary_lowercased.    #
############################################################


# Extract words, lowercase them, and find unique words
vocabulary_lowercased = set(word.lower() for word in brown.words())


############################################################
#                    End of your code.                     #
############################################################
N_vocab = len(vocabulary_lowercased)
print("Number of unique words in the vocabulary", N_vocab)

Number of unique words in the vocabulary 49815


Implement the function to compute for the 2-gram probability with Laplace smoothing.

Recall that the formula for the standard 2-gram is:
$$P(w_n | w_{n-1}) = \frac{\text{count}(w_{n-1}w_n)}{\sum_{w^\prime} \text{count}(w_{n-1}w^\prime)} = \frac{\text{count}(w_{n-1}w_n)}{\text{count}(w_{n-1})}$$

To apply Laplace smoothing, we simply add one to the counts resulting to the following formula, where $V$ is the vocabulary size:
$$P_{\text{Laplace}}(w_n | w_{n-1}) = \frac{\text{count}(w_{n-1}w_n) + 1}{\sum_{w^\prime} \text{count}(w_{n-1}w^\prime) + 1} = \frac{\text{count}(w_{n-1}w_n) + 1}{\text{count}(w_{n-1}) + V}$$


In [237]:
############################################################
# TODO-03: Implement the function that computes for the    #
# laplace smoothed bi-gram probabilities. Since the bi-    #
# gram model is incomplete (only a million rows), you      #
# would need to handle the words that do not appear in the #
# bi-gram model at all. For simplicity, you can just       #
# assign it a very low probability (e.g. 1e-10)            #
############################################################
def bi_gram_probability(word1, word2, bi_gram_dict):
    """
    Computes the bi-gram probability using a nested dictionary and Laplace smoothing.

    Args:
    - word1: The first word in the bi-gram.
    - word2: The second word in the bi-gram.
    - bi_gram_dict: Nested dictionary where bi_gram_dict[word1][word2] gives the count.

    Returns:
    - Smoothed probability of the bi-gram (word1, word2).
    """
    # Define a low probability for unseen bi-grams
    low_probability = 1e-10

    # Extract vocabulary size (number of unique second words in all word1 keys)
    vocabulary_size = sum(len(word_dict) for word_dict in bi_gram_dict.values())

    # Check if word1 exists in the dictionary
    if word1 in bi_gram_dict:
        # Get the count for the bi-gram (word1, word2)
        bigram_count = bi_gram_dict[word1].get(word2, 0)

        # Compute the total count of bi-grams starting with word1
        unigram_count = sum(bi_gram_dict[word1].values())

        # Apply Laplace smoothing
        probability = (bigram_count + 1) / (unigram_count + vocabulary_size)
        return probability

    # If word1 is not found, return the low probability
    return low_probability


############################################################
#                    End of your code.                     #
############################################################

In [238]:
# Example nested bi-gram dictionary
bi_gram_dict = {
    "the": {"cat": 5, "dog": 2},
    "cat": {"sat": 3, "barked": 1},
    "dog": {"barked": 4}
}

# Compute probabilities
prob1 = bi_gram_probability("the", "cat", bi_gram_dict)
prob2 = bi_gram_probability("cat", "barked", bi_gram_dict)
prob3 = bi_gram_probability("dog", "meowed", bi_gram_dict)  # Unseen bi-gram
prob4 = bi_gram_probability("bird", "flew", bi_gram_dict)  # Unseen first word

print("Probability of ('the', 'cat'):", prob1)
print("Probability of ('cat', 'barked'):", prob2)
print("Probability of ('dog', 'meowed'):", prob3)
print("Probability of ('bird', 'flew'):", prob4)


Probability of ('the', 'cat'): 0.5
Probability of ('cat', 'barked'): 0.2222222222222222
Probability of ('dog', 'meowed'): 0.1111111111111111
Probability of ('bird', 'flew'): 1e-10


### Implementing the Spell Checker components

For the spell checker, we would need to first pre-process the text and identify the potentially misspelled words.

In [239]:
###############################################################
# TODO-04: Implement the function that pre-processes the text #
# such as converting to lowercase and removing all symbols    #
# or punctuation marks. The function returns the              #
# preprocessed string.                                        #
###############################################################

def preprocess_text(text):
    """
    Preprocesses the text by converting to lowercase and removing symbols/punctuation.

    Args:
    - text: The input string.

    Returns:
    - The preprocessed string.
    """
    # Convert text to lowercase
    text = text.lower()

    # Use regular expressions to remove symbols and punctuation
    text = re.sub(r'[^\w\s]', '', text)  # Removes anything that's not a word or whitespace

    # Remove any newlines or extra spaces (optional)
    text = text.strip()  # Strips leading and trailing spaces

    return text

sample_text = "Hello, World! This is a test, right? Let's preprocess this: text."
clean_text = preprocess_text(sample_text)

print("Original Text:", sample_text)
print("Preprocessed Text:", clean_text)

##############################################################
#                    End of your code.                       #
##############################################################

Original Text: Hello, World! This is a test, right? Let's preprocess this: text.
Preprocessed Text: hello world this is a test right lets preprocess this text


In [240]:
###############################################################
# TODO-05: Implement the function that checks if the word is  #
# potentially misspelled. It should return True if            #
# misspelled and False otherwise.                             #
###############################################################
def is_misspelled(word, vocab):
    pass

###############################################################
#                    End of your code.                        #
###############################################################

For convenience of computing the n-grams later on, we also want to get the surrounding k words before and after the misspelled word. A possible way to structure this is shown below.
```
[
    {
        "word": ...,
        "before": ...,
        "after": ...,
    },
    ...
]
```

In [241]:
############################################################
# TODO-06: Implement the function that gets all the        #
# misspelled words in the provided text. Returns a list of #
# {"word":..., "before":..., "after":...}                  #
############################################################

def get_all_misspelled_words(text, vocab):
    pass

    
############################################################
#                    End of your code.                     #
############################################################

For each of the misspelled word, we need to generate a list of candidate words that could potentially be the correct word. To do this, we want to get all the words with at most $k$ edit distance from the misspelled word.

Look up the function: ``distance.edit_distance`` 

https://www.nltk.org/api/nltk.metrics.distance.html#nltk.metrics.distance.edit_distance

For now, lets use $k=2$.

In [242]:
############################################################
# TODO-07: Implement the function that gets all candidate  #
# words that are at most k edit distance from the given    #
# word. Return a list of candidate words.                  #
############################################################
def get_candidate_words(word, vocab, k):
    pass

############################################################
#                    End of your code.                     #
############################################################

Lastly, we want to get the top $k$ suggested corrections ranked based on the n-gram probabilities. For this you can compute two kinda of probabilities based on the context or surrounding words: the probability of the corrected word coming after the previous word and before the next word. For simplicity, you can add the two probabilities.

In [243]:
############################################################
# TODO-08: Implement the function that gets the top k      #
# suggested corrections based on the n-gram probabilities  #
# with the surrounding words. Return the word and its      #
# corresponding probability.                               #
############################################################

def topk_suggested_corrections(word, vocab, k):
    pass

############################################################
#                    End of your code.                     #
############################################################

Tying it all up, we now want to implement the spelling correction function that outputs the list of suggested corrections for each of the misspelled words.

We want this to be in the format shown below:
```
[
    {
        "misspelled_word": ...,
        "suggested_corrections": [...],
        "probabilities": [...],
    }
]
```

In [244]:
############################################################
# TODO-09: Implement a function that outputs the list of   #
# misspelled words and the suggested corrections together  #
# with its probabilities.                                  #
############################################################

def spelling_suggestions(sentence):
    pass
    
############################################################
#                    End of your code.                     #
############################################################

In [245]:
###############################################################
# TODO-10: Implement a function that selects the most likely  #
# correction of the misspelled words                          #
###############################################################

def spelling_corrector(sentence):
    pass
    
############################################################
#                    End of your code.                     #
############################################################

In [246]:
sample_sentence_1 = "With news pshd to smart phnes in real time and sociaal media reactions spreading aross the glbe in seconds, the public dicussion can appear acelerated and temprally framented"

In [247]:
sample_sentence_2 = "tommato is a frit, not a vegtable"

In [248]:
sample_sentence = "The quick brwn fox jumps oer the lay dog"

Now we want to test your spell checker.

In [249]:
############################################################
# TODO-11: Print the result of your spelling suggestions   #
# function to the sample sentences above.                  #
############################################################


############################################################
#                    End of your code.                     #
############################################################

In [250]:
############################################################
# TODO-12: Print the result of your spelling correction    #
# function to the sample sentences above.                  #
############################################################


############################################################
#                    End of your code.                     #
############################################################

<span style='color:red'>**Question 01:**</span> What are the potential limitations / failure cases of this model? Give at least three and explain why for each.

<span style='color:red'>**A01:**</span> 

<span style='color:red'>**Question 02:**</span> Is there a benefit for having longer n-grams? Explain your answer.

<span style='color:red'>**A02:**</span> 

<span style='color:red'>**Question 03:**</span> If we used a character level n-gram instead, what would be the advantages and disadvantages over the word level n-grams? Explain your answer.

<span style='color:red'>**A03:**</span> 

# Simplified Document Retrieval

TF-IDF (term frequency, inverse document frequency) is one of the simplest way you can retrieve relevant documents given a query. It can be seen as an extension of the bag of words model with the added benefit of weighting the words based on not only how often it appears but also how many documents use it. Intuitively, if the word is used in all documents, then it is not useful since it cannot discriminate between the documents.

Before we implement the TF-IDF document retrieval, let us first create the dataset that we will be working on. 

``paris_olympics_news_html.txt`` contains an html document that consists of several urls of articles from the Paris 2024 olympics. 

In [251]:
with open("paris_olympics_news_html.txt","r", encoding="utf-8") as f:
    html = f.read()
html



As you can see in the wall of text above, it is not in a very easy to get format.
Fortunately, we can use regex to extract all the urls want.

Look at this link to learn how to use regex in python <br />
Link: https://www.w3schools.com/python/python_regex.asp



In [252]:
############################################################
# TODO-13: Write code to extract all the urls from the html#
# text using regex (regular expressions). See the tutorial #
# link above.                                              #
############################################################

all_links = None

############################################################
#                    End of your code.                     #
############################################################

print(all_links)

None


You will see that the links also contain links to images. We only want the links to articles so we have to modify the regex to only get the article links and not the image links.

In [253]:
###############################################################
# TODO-14: Modify the regex to capture only the article links #
# and not the image links. There might also be duplicate      #
# links, so make sure to remove the duplicates.               #
###############################################################

unique_links = None
###############################################################
#                    End of your code.                        #
###############################################################

The code below will get the text in the articles and save it in a text file. Each of these articles will serve as our list of "documents" that we want to retrieve from 

**Note:** This code segment to download the articles takes quite a long time to run. For convenience, you can use the pre-downloaded one found in ``paris_olympics_articles.zip`` for the rest of the assignment. 

In [254]:
# optional if you want to download it yourself.

# from bs4 import BeautifulSoup
# HEADERS = {
#     "Accept": "application/json, text/plain, */*", 
#     "Accept-Encoding": "gzip, deflate, br, zstd", 
#     "Accept-Language": "en-GB,en-US;q=0.9,en;q=0.8",
#     "Connection": "keep-alive", 
#     "Referer": "",
#     "Sec-Fetch-Dest": "empty", 
#     "Sec-Fetch-Mode": "cors", 
#     "Sec-Fetch-Site":"same-site",  
#     "User-Agent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/122.0.0.0 Safari/537.36",
# }
# output_folder = "paris_olympics_articles"
# os.makedirs(output_folder,exist_ok=True)

# for i, link in enumerate(tqdm(unique_links)):
#     page = requests.get(link,headers=HEADERS)
#     soup = BeautifulSoup(page.content, "html.parser")
#     div_content = soup.find(id="globalTracking")
#     if div_content is not None:
#         page_text = soup.find(id="globalTracking").text
    
#         with open("{}/{:05d}.txt".format(output_folder,i),"w", encoding="utf-8") as f:
#             f.write(page_text)

Note: If this code segment fails to run for some reason, you can use the pre-downloaded one found in ``paris_olympics_articles.zip``.

In [255]:
raw_doc_texts = []
output_folder = "paris_olympics_articles"
for doc in glob.glob("{}/*.txt".format(output_folder)):
    with open(doc, 'r', encoding="utf-8") as f:
        raw_doc_texts.append(f.read().strip())

In [256]:
print("There are {} documents".format(len(raw_doc_texts)))

There are 0 documents


We need to pre-process the raw text to get it in a useful format. 

``nltk`` is a useful library for this. It provides several tools such as a list of stopwords, tokenizer, and stemmer.

In [257]:
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize # https://www.nltk.org/api/nltk.tokenize.word_tokenize.html
from nltk.stem import PorterStemmer # https://www.nltk.org/howto/stem.html
import nltk
nltk.download('stopwords')
nltk.download('punkt')

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\ningw\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\ningw\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

In [258]:
stop_words = stopwords.words('english')
symbols = "\n!\"#$%&()*+-./:;<=>?@[\]^_`{|}~"

In [259]:
############################################################
# TODO-15: Implement a function that converts the text to  #
# lowercase letters.                                       #
############################################################

def to_lowercase(text):
    pass

############################################################
#                    End of your code.                     #
############################################################

In [260]:
#############################################################
# TODO-16: Implement a function that removes the stopwords. #
#############################################################

def remove_stopwords(text):
    pass


#############################################################
#                    End of your code.                      #
#############################################################

In [261]:
###############################################################
# TODO-17: Implement a function that removes the symbols and  #
# punctuation marks.                                          #
###############################################################

def remove_symbols(text):
    symbols = "\n!\"#$%&()*+-./:;<=>?@[\]^_`{|}~"
    pass


##############################################################
#                    End of your code.                       #
##############################################################

In [262]:
############################################################
# TODO-18: Implement a function that uses the stemmer to   #
# convert all the words into its stem.                     #
############################################################
def stemming(text):
    pass

############################################################
#                    End of your code.                     #
############################################################

In [263]:
############################################################
# TODO-19: Implement a function that combines all the      #
# preprocessing steps and outputs a word tokenized version #
# of the text.                                             #
############################################################
def preprocess(text):
    pass

############################################################
#                    End of your code.                     #
############################################################

The code below will preprocess all the documents and store it in a list named ``processed_text``.

In [264]:
processed_text = []
for doc in raw_doc_texts:
    processed_text.append(preprocess(doc))

TF-IDF, as the name implies, is a combination of two terms. Term frequency, and inverse document frequency.
Let's first start with computing for the document frequency.

Document frequency basically counts how many documents does the token appear in. We can conveniently store this in a dictionary for efficient look-up later on.

**Note:** Some functions (particularly those iterating through all the documents) can take a few minutes to run. It's a good idea to test on a few subset first when debugging your code before running it completely on all the documents.

In [265]:
############################################################
# TODO-20: Compute for the document frequency of all the   #
# unique tokens in the entire dataset. Store it in a       #
# dictionary called doc_frequency, where                   #
# doc_frequency[token] contains the document counts.       #
############################################################
doc_frequency = {}

############################################################
#                    End of your code.                     #
############################################################

In [266]:
vocab = list(doc_frequency.keys())
n_vocab = len(vocab)
n_docs = len(processed_text)

The next step is to convert the document to a vector representation ($\mathbb{R}^V$). Specifically, we will represent the document as a bag of words vector wherein the elements of the vector correspond to the tf-idf weights of each of the word (or token) in the vocabulary.



In [267]:
###############################################################
# TODO-21: Implement a function the convert a document to its #
# vector representation where each element contains the       #
# tf-idf of the corresponding word or token.                  #
###############################################################

def document_to_tfidf_vector(doc, doc_frequency, vocab, n_docs):
    pass


############################################################
#                    End of your code.                     #
############################################################

For convenience of lookup later on, we compile all these document vectors into a matrix $\mathbb{R}^{N \times V}$, where each row is a document and the columns are the tokens / words.

In [268]:
doc_vectors = np.zeros((n_docs, n_vocab))
for doc_idx, doc in enumerate(processed_text):
    doc_vectors[doc_idx] = document_to_tfidf_vector(doc, doc_frequency, vocab, n_docs)

Lastly, we need to compute the similarity of a query vector to all the documents and retrieve the most similar document. For this, we use the cosine similarity, the formula of which is shown below.

$$\text{cosine\_similarity}(a, b) = \frac{a^Tb}{\lVert a \rVert_2 \lVert b \rVert_2 } = \left ( \frac{a}{\lVert a \rVert_2} \right )^T \left( \frac{b}{\lVert b \rVert_2 } \right ) $$


In practice, we implement this computation by batch, meaning we compute the cosine similarities to all documents at once. Specifically, let the `doc_vector` be a matrix of size $\mathbb{R}^{N \times V}$ and the `query_vector` be of size $\mathbb{R}^{V \times 1}$, then the output of the `cosine_similarity` function is a vector of size $\mathbb{R}^{N}$ which contains the cosine similarity of the `query_vector` to all $N$ documents.


In [269]:
############################################################
# TODO-22: Implement a function that computes the cosine   #
# similarity of a query vector with shape (V,) to all      #
# document vectors with shape (N,  V). The output will be  #
# a vector of shape (N,) that represents the cosine        #
# similarity for each of the documents.                    #
############################################################
def cosine_similarity(doc_vectors, query_vector):
    pass

############################################################
#                    End of your code.                     #
############################################################

In [270]:
query1 = "who is michael phelps?"
query2 = "rules for swimming"
query3 = "when will swimming event happen?"
query4 = "most decorated american woman in olympic history."
query5 = "Athletes representing Saint Lucia and Dominica won their countries' first-ever Olympic medals."

In [271]:
############################################################
# TODO-23: Write a piece of code that retrieves the most   #
# similar document given a query. Print the query text and #
# the raw text of the corresponding document (hint: there  #
# should be a raw_doc_texts above that contains the list   #
# of raw document text)                                    #
#                                                          #
# Format the print as follows                              #
#                                                          #
# Query text: ....                                         #
# Retrieved document text: ...                             #
############################################################




############################################################
#                    End of your code.                     #
############################################################

<span style='color:red'>**Question 04:**</span> Based on your results above, is there a difference between short queries or longer queries in terms of the quality of the retrievals? Explain why this might or might not be the case.

<span style='color:red'>**A04:**</span> 

<span style='color:red'>**Question 05:**</span> Does this implementation capture grammar or sentence structure? Why or why not?

<span style='color:red'>**A05:**</span> 

<span style='color:red'>**Question:**</span> How much time did it take you to answer this notebook?

<span style='color:red'>**A:**</span>

<span style='color:red'>**Question:**</span> What parts of the assignment did you like and what parts did you not like?

<span style='color:red'>**A:**</span>

<span style='color:red'>**Question:**</span> How do you think it could be improved?

<span style='color:red'>**A:**</span>

<span style='color:red'>**Question:**</span> Do you have any case studies in mind that would be nice to suggest / include in the assignment?

<span style='color:red'>**A:**</span>