# Programming Assignment 5: Embeddings! 

-----------------------------------------

## Installing new packages

--------------------------------------------------
**READ/SKIM THIS WHOLE CELL BEFORE RUNNING COMMANDS** :)

We need to install pytorch in our current environment. We also need to download the [BERT](https://huggingface.co/google-bert/bert-large-uncased) model, and we will do so from [HuggingFace](https://huggingface.co/models/). 

##### **Recommended**: install a new environment for this assignment. 

- If you're on an [Apple Silicon](https://support.apple.com/en-us/116943) (M1/M2/etc.) Mac, we recommend using the ARM64-optimized packages for best performance.
    ```
    CONDA_SUBDIR=osx-arm64 conda env create -f environment_pa5.yml
    conda activate cs124_pa5
    conda config --env --set subdir osx-arm64
    ```
 
    
- If you are on an Intel Mac, use the following commands instead
    ```
    CONDA_SUBDIR=osx-64 conda env create -f environment_pa5.yml
    conda activate cs124_pa5
    ```
    
- Otherwise, run the following (Windows/Linux): 
    ```
    conda env create -f environment_pa5.yml
    conda activate cs124_pa5
    ```

To use these new packages, change your kernel to use this new package version set. (Kernel -> Change Kernel)

Both of these environments now contain pytorch and huggingface in addition to the existing packages we had. To verify they installed, run the following cell:


##### **Less Recommended**: install into existing environment

To install these packages into our existing python environment, you can run the following commands:

```
conda activate cs124
conda install -c pytorch pytorch
conda install -c huggingface transformers
```
To use these new packages, you may need to restart your kernel (Kernel > Restart)

- This method is not as recommended because of the general principle "If it ain't broke, don't fix it". And specifically on M1+ macs, you will likely run into an issue of pytorch not working because your existing cs124 environment is running through the Rosetta 2 translation layer, meaning pytorch can't detect your hardware. But if you know how to troubleshoot issues you may come across, this method should work just fine.

-------------------------------------------------------

In [1]:
# This BERT model uses about 440MB of storage, but is deletable after the assignment
from transformers import BertTokenizer, BertModel, file_utils

In [2]:
import numpy as np
try:
    import torch
except:
    print("Error occurred. Did pytorch install correctly? Reach out to us on Ed for help.")

In [3]:
# Do not modify this cell, please just run it!
import quizlet

# Your Mission
The assignment consists of five parts. The first half deals with static embeddings, while the later half will use contextual embeddings. You don't have to worry if you haven't learned about transformers or BERT just yet, this assignment will walk you through the basics on how to use these models.

## The Static Embeddings

----------------------------------

You’ll be using subset of ~4k 50-dimensional GloVe embeddings trained on Wikipedia articles. The GloVe (Global Vectors) model learns vector representations for words by looking at global word-word co-occurrence statistics in a body of text and learning vectors such that their dot product is proportional to the probability of the corresponding words co-occuring in a piece of text. The GloVe model was developed right here at Stanford, and if you’re curious you can read more about it [here](https://nlp.stanford.edu/projects/glove/)!

# Part 1: Synonyms
For this section, your goal is to answer questions of the form:

- What is a synonym for `warrior`?  
  - soldier
  - sailor
  - pirate
  - spy  

You are given as input a word and a list of candidate choices. Your goal is to return the choice you think is the synonym. You’ll first implement two similarity metrics - euclidean distance and cosine similarity - then leverage them to answer the multiple choice questions!

Specifically, you will implement the following 4 functions:

* **euclidean_distance()**: calculate the euclidean distance between two vectors. Note: you’ll only use this metric in Part 1. For the rest of the assignment, you'll only use cosine similarity.
* **cosine_similarity()**: calculate the cosine similarity between two vectors. You’ll be using this helper function throughout the other parts of the assignment as well, so you’ll want to get it right!
* **find_synonym()**: given a word, a list of 4 candidate choices, and which similarity metric to use, return which word you think is the synonym! The function takes in `comparison_metric` as a parameter: if its value is `euc_dist`, you'll use Euclidean distance as the similarity metric; if its value is `cosine_sim`, you'll use cosine similarity as the metric.
* **part1_written()**: you’ll find that finding synonyms with word embeddings works quite well, especially when using cosine similarity as the metric. However, it’s not perfect. In this function, you’ll look at a question that your `find_synonyms()` function (using cosine similarity) gets wrong, and answer why you think this might be the case. Please return your answer as a string in this function.

Note: for the rest of the assignment, you'll only use cosine similarity as the comparison metric. You won't use the euclidean distance function anymore.



In [4]:
def cosine_similarity(v1, v2):
    '''
    Calculates and returns the cosine similarity between vectors v1 and v2
    Arguments:
        v1 (np.array), v2 (np.array): vectors
    Returns:
        cosine_sim (float): the cosine similarity between v1, v2
    '''
    cosine_sim = 0
    #########################################################
    ## TODO: calculate cosine similarity between v1, v2    ##
    #########################################################
    cosine_sim = np.dot(v1, v2) / \
        np.prod([np.sqrt(np.sum(np.square(item))) for item in [v1, v2]])  
    #########################################################
    ## End TODO                                            ##
    #########################################################
    return cosine_sim   

cosine_similarity(np.array([442, 8 ,2]), np.array([5, 3982, 3325]))

def euclidean_distance(v1, v2):
    '''
    Calculates and returns the euclidean distance between v1 and v2

    Arguments:
        v1 (np.array), v2 (np.array): vectors

    Returns:
        euclidean_dist (float): the euclidean distance between v1, v2
    '''
    euclidean_dist = 0
    #########################################################
    ## TODO: calculate euclidean distance between v1, v2   ##
    #########################################################
    euclidean_dist = np.sqrt(
        np.sum([(item[0] - item[1])*(item[0] - item[1]) \
                for item in np.column_stack((v1, v2))])        
    )
    #########################################################
    ## End TODO                                           ##
    #########################################################
    return euclidean_dist                 

def find_synonym(word, choices, embeddings, comparison_metric):
    '''
    Answer a multiple choice synonym question! Namely, given a word w 
    and list of candidate answers, find the word that is most similar to w.
    Similarity will be determined by either euclidean distance or cosine
    similarity, depending on what is passed in as the comparison_metric.

    Arguments:
        word (str): word
        choices (List[str]): list of candidate answers
        embeddings (Dict[str, np.array]): map of words to their embeddings
        comparison_metric (str): either 'euc_dist' or 'cosine_sim'. 
            This indicates which metric to use - either euclidean distance or cosine similarity.
            With euclidean distance, we want the word with the lowest euclidean distance.
            With cosine similarity, we want the word with the highest cosine similarity.

    Returns:
        answer (str): the word in choices most similar to the given word
    '''
    answer = None
    #########################################################
    ## TODO: find synonym                                  ##
    #########################################################
    score = None
    word_embedding = embeddings[word]
    for item in choices:
        candidate_embedding = embeddings[item]
        if  comparison_metric == 'cosine_sim':
            similarity = cosine_similarity(word_embedding, candidate_embedding)
            if score is None or score < similarity:
                score = similarity
                answer = item
                pass
            pass
        else:
            similarity = euclidean_distance(word_embedding, candidate_embedding)
            if score is None or score > similarity:
                score = similarity
                answer = item
                pass                
            pass
        pass
    #########################################################
    ## End TODO                                            ##
    ######################################################### 
    return answer

def part1_written():
    '''
    Finding synonyms using cosine similarity on word embeddings does fairly well!
    However, it's not perfect. In particular, you should see that it gets the last
    synonym quiz question wrong (the true answer would be positive):

    30. What is a synonym for sanguine?
        a) pessimistic
        b) unsure
        c) sad
        d) positive

    What word does it choose instead? In 1-2 sentences, explain why you think 
    it got the question wrong.
    
    See the cell below for the code to run for this part
    '''
    #########################################################
    ## TODO: replace string with your answer               ##
    ######################################################### 
    answer = 'pessimistic' 
    #########################################################
    ## End TODO                                            ##
    ######################################################### 
    return answer

In [5]:
"""This will create a class to test the functions you implemented above. If you are curious, 
you can see the code for this in quizlet.py but it is not required. If you run this cell,
we will load the test data for you and run it on your functions to test your implementation.

You should get an accuracy of 66% with euclidean distance and 83% with cosine distance
"""

part1 = quizlet.Part1_Runner(find_synonym, part1_written)
part1.evaluate(True)  # To only print the scores, pass in False as an argument

Part 1: Synonyms!
-----------------
Answering part 1 using euclidean distance as the comparison metric...
1. What is a synonym for gullible?
    a) unrealistic
    b) naive
    c) complicated
    d) wary
you answered: naive 

2. What is a synonym for counter?
    a) parry
    b) agree
    c) hold
    d) run
you answered: hold 

3. What is a synonym for feeble?
    a) reinforced
    b) weak
    c) damage
    d) break
you answered: weak 

4. What is a synonym for administer?
    a) give
    b) steal
    c) spray
    d) box
you answered: give 

5. What is a synonym for betray?
    a) trust
    b) inform
    c) table
    d) deceive
you answered: deceive 

6. What is a synonym for scour?
    a) search
    b) allow
    c) gaze
    d) gather
you answered: search 

7. What is a synonym for clean?
    a) bare
    b) tidy
    c) rummage
    d) pop
you answered: bare 

8. What is a synonym for abscond?
    a) escape
    b) rally
    c) relinquish
    d) flash
you answered: relinquish 

9. What is

(0.6666666666666666, 0.8333333333333334)

# Part 2: Exploration
In this section, you'll do an exploration question. Specifically, you'll implement the following 2 functions:

* **occupation_exploration()**: given a list of occupations, find the top 5 occupations with the highest cosine similarity to the word "man", and the top 5 occupations with the highest cosine similarity to the word "woman".
* **part2_written()**: look at your results from the previous exploration task. What do you observe, and why do you think this might be the case? Write your answer within the function by returning a string.


In [6]:
def occupation_exploration(occupations, embeddings):
    '''
    Given a list of occupations, return the 5 occupations that are closest
    to 'man', and the 5 closest to 'woman', using cosine similarity between
    corresponding word embeddings as a measure of similarity.

    Arguments:
        occupations (List[str]): list of occupations
        embeddings (Dict[str, np.array]): map of words (strings) to their embeddings (np.array)

    Returns:
        top_man_occs (List[str]): list of 5 occupations closest to 'man'
        top_woman_occs (List[str]): list of 5 occuptions closest to 'woman'
            note: both lists should be sorted, with the occupation with highest
                  cosine similarity first in the list
    '''
    top_man_occs = []
    top_woman_occs = []
    #########################################################
    ## TODO: get 5 occupations closest to 'man' & 'woman'  ##
    #########################################################
    f_dict = {}
    for word in ['man', 'woman']:
        word_embedding = embeddings[word]                 
        a_dict = {}
        for occupation in occupations:
            a_dict[occupation] = \
                cosine_similarity(word_embedding, embeddings[occupation])
            pass
        n_dict = dict(sorted(a_dict.items(), 
                             key = lambda item: item[1], reverse=True)[:5])
        f_dict[word] = n_dict.keys()        
        pass
    top_man_occs = f_dict['man']        
    top_woman_occs = f_dict['woman']
    #########################################################
    ## End TODO                                            ##
    #########################################################
    return top_man_occs, top_woman_occs

def part2_written():
    '''
    Take a look at what occupations you found are closest to 'man' and
    closest to 'woman'. Do you notice anything curious? In 1-2 sentences,
    describe what you find, and why you think this occurs.
    '''
    #########################################################
    ## TODO: replace string with your answer               ##
    ######################################################### 
    answer = """
    I have not  run it
    """
    #########################################################
    ## End TODO                                            ##
    ######################################################### 
    return answer

In [7]:
part2 = quizlet.Part2_Runner(occupation_exploration, part2_written)
part2.evaluate() 

Part 2: Exploration!
--------------------
occupations closest to "man" - you answered:
 1. teacher
 2. actor
 3. worker
 4. lawyer
 5. warrior
occupations closest to "woman" - you answered:
 1. nurse
 2. teacher
 3. worker
 4. maid
 5. waitress
 


(dict_keys(['teacher', 'actor', 'worker', 'lawyer', 'warrior']),
 dict_keys(['nurse', 'teacher', 'worker', 'maid', 'waitress']))

# Part 3: Contextual embeddings

For this section, your goal is to understand contextual embeddings, which are more powerful than static embeddings. In a static embedding, we just have one vector for each word. In a contextual embedding, such as those produced by the BERT algorithm, the vector for the word is influenced by all its neighbors.  That means the  embedding for the same word is different when it appears in different sentences!   We won't study the transformer that is the core mechansim of BERT until later in the quarter, so in this assignment you are just exploring BERT as a [black box](https://en.wikipedia.org/wiki/Black_box).

In [8]:
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertModel.from_pretrained('bert-base-uncased') # About 440MB large

# Feel free to ignore deprecation/unused weight warnings

In [9]:
# We want to run the model on our GPU if possible, but if not, we can use a CPU
if torch.backends.mps.is_available(): # Available on Macs with Apple silicon or AMD GPUs
    device = torch.device("mps")
    model.to(device)
elif torch.cuda.is_available(): # Available on computers with NVIDIA GPUs
    device = torch.device("cuda")
    model.to(device)
else:
    device = torch.device("cpu")
print("Model is on device: ", device)

Model is on device:  cpu


### Part 3.1: Contextual embedding with BERT

In [10]:
def get_bert_word_embedding(sentence, target_word):
    '''
    We implemented this function for you. It runs a sentence through BERT, and
    returns the embedding for that word. (shape (768,))
    '''
    inputs = tokenizer(sentence, return_tensors='pt')
    word_id = tokenizer.encode(target_word, add_special_tokens=False)
    
    if len(word_id) != 1:
        raise ValueError(f"'{target_word}' is split into multiple tokens by BERT. Please choose a simpler (~1 syllable) word.")
    word_id = word_id[0]

    word_position = torch.where(inputs['input_ids'][0] == word_id)[0]
    if len(word_position) == 0:
        raise ValueError(f"'{target_word}' not found in the sentence.")
    
    with torch.no_grad():
        if device is not torch.device("cpu"):
            inputs = {key: val.to(device) for key, val in inputs.items()}
        outputs = model(**inputs)
    embedding = outputs.last_hidden_state[0, word_position[0]]
    return embedding.cpu().numpy()

- Your task is to use BERT to study word polysemy (the fact that words can have multiple senses that are different from each other in meaning, like "bat" to mean both the flying mammal and the baseball instrument).  Your job is to find a maximally ambiguous word. We have provided an example in code below.

In [11]:
example_word = "bank"
example_sentence1 = f"I went to the {example_word} to deposit my money."
example_sentence2 = f"I went down by the river {example_word} to see the ducks."

In [12]:
def get_polyseme_similarity(word, sentence1, sentence2, return_score=False):
    embedding1 = get_bert_word_embedding(sentence1, word)
    embedding2 = get_bert_word_embedding(sentence2, word)
    similarity = cosine_similarity(embedding1, embedding2)
    if return_score:
        return similarity
    else:
        print(f"This word is {similarity*100:.2f}% similar in the two sentences.")

get_polyseme_similarity(example_word, example_sentence1, example_sentence2)

This word is 54.65% similar in the two sentences.


- Now it's your turn! Try to find a ~1 syllable [polyseme](https://prepedu.com/en/blog/polysemy-in-english) that can be used in very different contexts. You will get full points for getting it under 54% similarity. We'll have a leaderboard on gradescope for lowest similarity score achieved (In our testing, we achieved 38%).


In [13]:
def part3():
    '''
    Returns
        word (str): the word used in both sentences
        sentence1 (str): the first sentence
        sentence2 (str): the second sentence

    HINT: This word should be a polyseme, meaning it has
    multiple meanings, and each sentence should use a different definiton.
    '''
    #########################################################
    ## TODO: replace strings with your answers             ##
    ######################################################### 
    word = "record"
    sentence1 = f"I have a {word}"
    sentence2 = f"I {word} a video"
    #########################################################
    ## End TODO                                            ##
    ######################################################### 
    return word, sentence1, sentence2

In [14]:
part3 = quizlet.Part3_Runner(part3, get_bert_word_embedding, cosine_similarity)
part3.evaluate()

Part 3: Contextual embeddings with BERT
---------------------------------------
Polyseme disambiguation: 
Word: record
Sentence 1: I have a record
Sentence 2: I record a video


'This word is 40.32% similar in the two sentences.'

# Part 4: Sentence Similarity with BERT

For this section, your goal is to answer questions of the form:

- How semantically similar are the following two sentences?:

    - he later learned that the incident was caused by the concorde's sonic boom

    - he later found out the alarming incident had been caused by concorde's powerful sonic boom

## Part 4.1: Sentence-level embeddings with BERT

In this section, we will be leveraging the BERT model for a sentence classification task. In the real world, many applications of semantic understanding are done with fine-tuned transformer models, and we will be using a simple BERT model that was trained by Google on [BookCorpus](https://en.wikipedia.org/wiki/BookCorpus). To efficiently get the embeddings for multiple sentences, we will implement `get_bert_sentence_embeddings()`

Our `get_bert_sentence_embeddings()` function takes in two parameters besides our inputs. The `batch_size` parameter exists to limit memory usage, which is necessary if you wanted to use this function to compute embeddings on an even larger dataset. (Feel free to try it yourself)! The boolean `use_CLS` explains which of the two following methods we will use for classifying a document:
* **Use the final [CLS] token embedding**: The first token represents the combined context of the full sentence, so we will simply compare this one token across sentences
* **Mean pooling over all sentence tokens**: We will average the token embeddings in the last hidden layer of our BERT outputs. Calculating this is a bit complex, so we've done a lot of the steps for you already. Each step is explained with comments, but for each sentence, we are summing the outputs for each token, but only where the token is not a padding token.

Some (hopefully) helpful hints!
- For extracting the [CLS] token embeddings:
  - We want an output of shape (n_sentences, 768), and the shape of `outputs.last_hidden_state` is  (n_sentences, sequence_length, 768). The reason it is length 768 is because at the last hidden layer of the output in BERT, each token is represented by a vector of this length. If you do not know how multi-dimensional slicing works in NumPy/PyTorch, this guide may be helpful: [Python Slice Indexing](https://www.geeksforgeeks.org/python-slicing-multi-dimensional-arrays/) 
- For getting the mean of all tokens in the output
  - We have implemented the hard part of mean pooling already, and we documented it in the comments. Since we are passing in sentences of varying lenghts, all sentences are padded with [PAD] tokens that we wish to ignore. We use a mask to zero-out the embeddings for the [PAD] tokens, and we also ignore them in our total count by summing the mask. 
  - What is left to do is take the mean of these filtered embeddings. To do so, we sum along the sequence axis, then divide by the provided sum_mask variable.
- Want more visualization for the outputs of BERT? This [Illustrated Guide to BERT](http://jalammar.github.io/illustrated-bert/) is a recommended reading from [CS224n](https://web.stanford.edu/class/cs224n/index.html#schedule). The section in the linked YouTube video at the timestamp 2:52 is a good example of passing one sentence through BERT.


In [15]:
def get_bert_sentence_embeddings(sentences, use_CLS=True, batch_size=8):
    '''
    Generate embeddings for sentences using BERT's CLS token.
    
    Arguments:
        sentences (List[str]): Input sentences.
        use_CLS (bool): Whether to use the CLS token as the sentence embedding.
                        If it is false, we use mean pooling over sentence tokens.
        batch_size (int): Batch size for processing the sentences.
    
    Returns:
        np.ndarray: Sentence embeddings of shape (n_sentences, 768).
    '''
    all_embeddings = []
    # We process the sentences in batches to avoid running out of memory. 
    # Feel free to experiment with the batch size, 8 or 16 are likely best.
    for i in range(0, len(sentences), batch_size):
        batch_sentences = sentences[i:i+batch_size]
        inputs = tokenizer(batch_sentences, padding=True, truncation=True, return_tensors='pt')
        if device is not torch.device("cpu"):
            inputs = {k: v.to(device) for k, v in inputs.items()}  # Move inputs to your GPU
        with torch.no_grad(): # Runs the model without calculating gradients
            outputs = model(**inputs) # shape: (batch_size, max_sentence_length, 768)
        embeddings = None
        if use_CLS:
            #########################################################
            ### TODO: Extract each CLS token embedding from the     #
            #         output of each sentence                       #
            #########################################################
            ### BEGIN CODE HERE (~1 line) ###
            embeddings = outputs[:, 0, :]
            ### END CODE HERE ###
        else:
            # We first create a mask to distinguish real tokens from padding tokens
            attention_mask = inputs['attention_mask']
            # We then expand the mask to the same shape as the embeddings
            mask_expanded = attention_mask.unsqueeze(-1).expand(outputs.last_hidden_state.size()).float()
            # Sum the 1s in the mask to get the number of non-padding tokens for each sentence
            sum_mask = mask_expanded.sum(dim=1)
            # Clamp the sum to 1e-9 to avoid division by zero
            sum_mask = torch.clamp(sum_mask, min=1e-9)
            # Apply the mask to the embeddings, so that the padding tokens are ignored
            masked_embeddings = outputs.last_hidden_state * mask_expanded
            #########################################################
            ### TODO: Extract the mean of all token embeddings      #
            #         by summing the embeddings along the sequence  #
            #         dimension and dividing by the sum_mask        #
            #########################################################
            ### BEGIN CODE HERE (~1-2 lines) ###

            embeddings = None
            ### END CODE HERE ###
        all_embeddings.append(embeddings)
    
    embeddings = torch.cat(all_embeddings, dim=0)
    return embeddings.cpu().numpy() # Numpy does not support GPU tensors, so we move it to the CPU

## 4.2 Sanity check
Test out your implementations!
You should find that sentences 2 and 3 are quite similar to each other (>97% on CLS similarity, >90% on mean pooling)

In [16]:
sentences = ["To be or not to be, that is the question.", 
             "The feline sat on the rug.", "The cat sat on the mat."]

embeddings = get_bert_sentence_embeddings(sentences, use_CLS=True)

cos_sim_12 = cosine_similarity(embeddings[0], embeddings[1])
cos_sim_13 = cosine_similarity(embeddings[0], embeddings[2])
cos_sim_23 = cosine_similarity(embeddings[1], embeddings[2])
print(f"Similarity between s1 and s2: {cos_sim_12*100:.2f}%")
print(f"Similarity between s1 and s3: {cos_sim_13*100:.2f}%")
print(f"Similarity between s2 and s3: {cos_sim_23*100:.2f}%")

TypeError: tuple indices must be integers or slices, not tuple

## 4.3: Implementing a simple search engine with BERT

Now that we can compare the similarity of documents, we can implement a very rudimetary search engine that compares the embedding of a search query to the embedding of documents in our corpus. While modern web retrieval is much more complicated and has a million more optimizations, fine-tuned versions of BERT empower many things you may use in your daily life, such as search engines.

### The task:
We will be using StanfordAI's web_questions dataset (from [this paper](https://cs.stanford.edu/~pliang/papers/freebase-emnlp2013.pdf)). It is comprised of many question/answer pairs that you can look at further in depth on [HuggingFace](https://huggingface.co/datasets/Stanford/web_questions).

Our task is simple. Given a query, we want to find the highest cosine similarities between other queries that exist, and then return the resulting answer to that query. To simulate a Google search page, we will return the top_k results, where k is some number we specify.

This will be done in about 5 steps
- In the \_\_init\_\_() function, you will first get the matrix embedding for all of the questions in our dataset. Currently, the dataset is a dictionary of { question : List[answer] } key/value pairs. You need to pass the questions as a list of sentences into get_bert_sentence_embeddings(). Ignore the answers for now.
- In the web_search() function, we want to first get the embedding for our query. You may need to pass it into your model function as a list.
- We then want to compute the cosine similarity between our query embedding and our question embeddings.


In [None]:
# Unlike your previous cosine sim function, this one works with matrices
from sklearn.metrics.pairwise import cosine_similarity as cosine_similarity_2d

class BertWebSearch():
    def __init__(self):
        # This is a dict with ~3k { question : answer } key pairs.
        self.question_answer_dict = quizlet.load_stanford_web_questions()

        #######################################################
        ### TODO: Get the embeddings for all the questions  ###
        #         in the question_answer_dict.                #
        #######################################################
        ### BEGIN CODE HERE (~3 lines) ###
        self.questions = None
        self.question_embeddings_CLS = None
        self.question_embeddings_pooling = None
        #######################################################
        ### END CODE HERE                                   ###
        #######################################################
        n_questions = len(self.questions)
        assert self.question_embeddings_CLS.shape == (n_questions, 768)

    def search_web(self, query, k=5, use_CLS=True):
        """
        Returns the top K question/answer pairs most similar to the input query, based on BERT embeddings.

        Arguments:
            query (str): The search query.

        Returns:
            top_questions (List[str]): The top K questions most similar to the query.
            top_answers (List[str]): The corresponding answers to the top K questions.
        """
        #########################################################
        ### TODO: Get the embedding for the query and find the  #
        #         cosine similarity between the query and all   #
        #         questions. Return the top k questions and     #
        #         their answers                                 #
        #########################################################
        ### BEGIN CODE HERE (~6-8 lines) ###
        ### HINT: use the cosine_similarity_2d function between your query and the question embeddings you stored above.
        
        top_questions, top_answers = [], [] # you'll likely want to delete this line
        ### END CODE HERE ###
        return top_questions, top_answers
    
    def test_web_search(self, query, k, use_CLS=True):
        top_questions, top_answers = self.search_web(query, k, use_CLS)
        print(f"Top {k} questions similar to '{query}':")
        for question, answer in zip(top_questions, top_answers):
            print(f"- Did you mean.... {question}")
            print(f"    Answer(s):  {answer}")

search_engine = BertWebSearch()

In [None]:
query = "What do people speak in Spain"
search_engine.test_web_search(query, 5, use_CLS=True)

We will test our web search function on three queries. You get full points if it runs, but you should see some semantic similarity between your query and the returned questions. Try testing it on some queries of your own! And experiment with the `use_CLS` boolean, see if you notice any qualitative differences in the output.

In [None]:
part4 = quizlet.Part4_Runner(search_engine.search_web, get_bert_sentence_embeddings).evaluate()

### Congrats on finishing!

As a parting thought, we hope that these past several assignments have got you thinking about how large scale algorithms on text function and how we can improve them at scale. We only implemented small parts at a time, but hopefully these foundations are helpful in thinking about how language modeling algorithms shape how information is stored and retrieved online.

### If you collaborated with a partner, describe below

In [None]:
def collaboration():
    '''
    Returns:
        answer (str): what you and your partner did each / together
    '''
    return ""

## Submission and Cleanup

Once you're ready to submit, you can run the cell below to prepare and zip up your solution:

If you're running on Google Colab, see the README for instructions on how to submit.


In [None]:
%%bash

if [[ ! -f "./pa5.ipynb" ]]
then
    echo "WARNING: Did not find notebook in Jupyter working directory. This probably means you're running on Google Colab. You'll need to go to File->Download .ipynb to download your notebok and other files, then zip them locally. See the README for more information."
else
    echo "Found notebook file, creating submission zip..."
    zip -r submission.zip pa5.ipynb deps/
fi

__Some reminders for submission:__
 * Make sure you didn't accidentally change the name of your notebook file, (it should be `pa5.ipynb`) as that is required for the autograder to work.
* Go to Gradescope (gradescope.com), find the PA5 Quizlet assignment and upload your zip file (`submission.zip`) as your solution.
* Wait for the autograder to run and check that your submission was graded successfully! If the autograder fails, or you get an unexpected score it may be a sign that your zip file was incorrect.

How to delete BERT from your computer:

On MacOS, navigate in Finder to your huggingface cache (likely `cd ~/.cache/huggingface/`)

use `ls` to see what files are there. It may be inside another folder, e.g., `hub`

find and delete the model `rm -rf models--bert-base-uncased`

In [None]:
# Can't find your cache? This prints the path to the cache directory where the HuggingFace models are stored
print(file_utils.default_cache_path)
# Within this directory, it may be in a folder named 'transformers' or 'models--bert-base-uncased'

This directory is hidden by default on MacOS. With Finder open, you can use the 'Go' menu (or the Shift+Command+G macro) and type in the path for the folder you want to reach. You should also know that to toggle the Show Hidden Files option in Finder, the keybind/macro is: (Shift+Command+Period). You can delete it with the graphical interface of Finder, or in a terminal window / jupyter code cell using the aforementioned command `rm -rf <file/folder>`. Fun fact! on MacOS, you can open a new terminal session at a specific directory by dragging that folder from Finder onto the terminal icon