# NLP 2025
# Lab 2: Word Vectors and Information Retrieval

During the first few weeks, we discussed various ways to represent text 📝. One key question was: What should be the basic unit of representation? Words are the fundamental building blocks 🧱.

In this lab, we will explore different text representation models, such as Bag-of-Words (BoW), TF-IDF and word embeddings 🔤➡️🔢. Among these, word embeddings are the most effective in terms of performance. They represent each word as a vector of numbers, where each vector captures the meaning of the word 🧠📊.

These numerical representations (or weights) are learned using machine learning models 🤖. We’ll dive deeper into how these vectors are learned in the next lecture 📚.

For now, we’ll focus on how different representation methods affect performance in an information retrieval task 🔍.

By the end of this lab, you should be able to:

+ 🧼🔁 Implement and/or use built-in functions to preprocess your data (once again!)
+ 🧱👜 Build a Bag-of-Words representation of the dataset
+ 📊✨ Implement TF-IDF
+ 📥🔤 Load pre-trained word embeddings
+ 🔍🧠 Inspect and test word embedding properties
+ 🗣️➡️📐 Use word embeddings to get sentence representations (aka sentence embeddings)
+ 🧩🔎 Use sentence embeddings to solve more complex tasks like information retrieval
+ 🧪📏 Design evaluation frameworks for specific NLP tasks and assess their difficulty

### Score breakdown

| Exercise            | Points |
|---------------------|--------|
| [Exercise 1](#e1)   | 1      |
| [Exercise 2](#e2)   | 1      |
| [Exercise 3](#e3)   | 1      |
| [Exercise 4](#e4)   | 1      |
| [Exercise 5](#e5)   | 1      |
| [Exercise 6](#e6)   | 2      |
| [Exercise 7](#e7)   | 10     |
| [Exercise 8](#e8)   | 5      |
| [Exercise 9](#e9)   | 15     |
| [Exercise 10](#e10) | 10     |
| [Exercise 11](#e11) | 10     |
| [Exercise 12](#e12) | 5      |
| [Exercise 13](#e13) | 15     |
| [Exercise 14](#e14) | 3      |
| [Exercise 15](#e15) | 10     |
| [Exercise 16](#e16) | 10     |
| Total               | 100    |

This score will be scaled down to 1 and that will be your final lab score.

### 📌 **Instructions for Delivery** (📅 **Deadline: 18/Apr 18:00**, 🎭 *wildcards possible*)

✅ **Submission Requirements**
+ 📄 You need to submit a **PDF of your report** (use the templates provided in **LaTeX** 🖋️ (*preferred*) or **Word** 📑) and a **copy of your notebook** 📓 with the code.
+ ⚡ Make sure that **all cells are executed properly** ⚙️ and that **all figures/results/plots** 📊 you include in the report are also visible in your **executed notebook**.

✅ **Collaboration & Integrity**
+ 🗣️ While you may **discuss** the lab with others, you must **write your solutions with your group only**. If you **discuss specific tasks** with others, please **include their names** in the appendix of the report.
+ 📜 **Honor Code applies** to this lab. For more details, check **Syllabus §7.2** ⚖️.
+ 📢 **Mandatory Disclosure**:
   - Any **websites** 🌐 (e.g., **Stack Overflow** 💡) or **other resources** used must be **listed and disclosed**.
   - Any **GenAI tools** 🤖 (e.g., **ChatGPT**) used must be **explicitly mentioned**.
   - 🚨 **Failure to disclose these resources is a violation of academic integrity**. See **Syllabus §7.3** for details.

## 0. Setup

As in the last lab, we will be using huggingface datasets library ([https://huggingface.co/datasets](https://huggingface.co/datasets)). You can find the detailed documentation and tutorials here: [https://huggingface.co/docs/datasets/en/index](https://huggingface.co/docs/datasets/en/index)

If you don't have it installed you can run the code below or install it via `pip` in your terminal. If you are using Google Colab, you can uncomment and run the code below in a code cell. Restarting of the runtime may be required after installation (Runtime/Restart session).

In [1]:
# ! pip install -U datasets~=3.5.0
# ! pip install -U gensim
# ! python -m pip install -U matplotlib
# ! pip install nltk
# ! pip install -U scikit-learn

Previously installed datasets library version of 3.2.0 had an error when combined with numpy version >2. If you encounter an error at some point it might require to update the datasets library to the newer version. You can do that by running the code below. If you are using Google Colab, you can run the code below in a code cell. If you are using Jupyter Notebook, you can run the code below in a code cell or in a terminal.

In [2]:
# ! pip install --upgrade --force-reinstall datasets

As usual, we start by importing some essential Python libraries and we will be using. Apart from `gensim` (which is going to be used for word embeddings), we have already seen the others.

In [1]:
import re

import numpy as np
import matplotlib.pyplot as plt
import datasets
import tqdm
import gensim
import ssl
import string

import nltk
nltk.download('stopwords')
from nltk.corpus import stopwords
nltk.download('word_tokenize')
from nltk.tokenize import word_tokenize
nltk.download('wordnet')
from nltk.stem import WordNetLemmatizer


try:
    _create_unverified_https_context = ssl._create_unverified_context
except AttributeError:
    pass
else:
    ssl._create_default_https_context = _create_unverified_https_context



[nltk_data] Downloading package stopwords to
[nltk_data]     /home/moonshrike/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Error loading word_tokenize: Package 'word_tokenize' not
[nltk_data]     found in index
[nltk_data] Downloading package wordnet to
[nltk_data]     /home/moonshrike/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


## 1. Load and Preprocess Data

*Sentence compression* involves rephrasing sentences to make them shorter while still retaining the original meaning. A reliable compression system would be valuable for mobile devices and could also serve as a component in an extractive summarization system.

The dataset we are going to use can be found on [Huggingface](https://huggingface.co/datasets/embedding-data/sentence-compression). It concerns a set of 180,000 pairs of sentences, aka it is a parallel corpus of sentences and their equivalent compressions. It has been collected by harvesting news articles from the Internet where the headline appears to be similar to the first sentence and that property is used to find an "extractive" compression of the sentence.

For example, for the sentence

`"Regulators Friday shut down a small Florida bank, bringing to 119 the number of US bank failures this year amid mounting loan defaults"`

the compressed equivalent (based on the dataset) is:

`"Regulators shut down small Florida bank"`.


For more information you can read the original paper (from Google) [here](https://aclanthology.org/D13-1155.pdf). We strongly recommend going over the paper to gain further insights. Notice that the paper is from 2013, therefore word embeddings have not been widely introduced yet in NLP tasks, meaning that the methods applied were based on the traditional NLP pipeline (feature extraction + ML).

### 1.1 Loading the Dataset

The dataset will be loaded as a Pandas DataFrame. This may take a few minutes because of the large size of the data.

Make sure to inspect the dataset and make sure it is imported properly.

In [2]:
ds = datasets.load_dataset('embedding-data/sentence-compression')
print(ds)

DatasetDict({
    train: Dataset({
        features: ['set'],
        num_rows: 180000
    })
})


In [3]:
for i in range(10):
    print(ds['train'][i])

{'set': ["The USHL completed an expansion draft on Monday as 10 players who were on the rosters of USHL teams during the 2009-10 season were selected by the League's two newest entries, the Muskegon Lumberjacks and Dubuque Fighting Saints.", 'USHL completes expansion draft']}
{'set': ['Major League Baseball Commissioner Bud Selig will be speaking at St. Norbert College next month.', 'Bud Selig to speak at St. Norbert College']}
{'set': ["It's fresh cherry time in Michigan and the best time to enjoy this delicious and nutritious fruit.", "It's cherry time"]}
{'set': ['An Evesham man is facing charges in Pennsylvania after he allegedly dragged his girlfriend from the side of his pickup truck on the campus of Kutztown University in the early morning hours of Dec. 5, police said.', 'Evesham man faces charges for Pa.']}
{'set': ["NRT LLC, one of the nation's largest residential real estate brokerage companies, announced several executive appointments within its Coldwell Banker Residential B

### 

The dataset comes with only the `train` split so we will have to split it ourselves.

In [4]:
split_ds = ds['train'].train_test_split(test_size=0.2)
print(split_ds)

DatasetDict({
    train: Dataset({
        features: ['set'],
        num_rows: 144000
    })
    test: Dataset({
        features: ['set'],
        num_rows: 36000
    })
})


### 1.2 Preprocessing the dataset
In this section we will prepare the dataset, aka clean the sentences and tokenize.

First, let's write the function to clean the text. It can be similar to the one from the previous lab (Lab1) but make sure that it makes sense for this dataset and task.

More specifically, think about lower-casing, punctuation, stop-words and lemmatization/stemming and the impact it might have on the dataset. Also reflect on the fact that with word embeddings we want to uncover semantic relationships between words, whereas with bag-of-words we were trying to capture different morphological variations.

<a name='e1'></a>
### Exercise 1: Clean function
(1p) Fill in the following function ot clean the dataset. Implement at least 3 different steps.

In [5]:
# initialize once for more efficiency
stop_words = stopwords.words('english')
lemmatizer = lemmantizer = WordNetLemmatizer()

def clean(text):
    """
    Cleans the given text
    Args:
        text: a str with the text to clean

    Returns: a str with the cleaned text

    """

    # Empty text
    if text == '':
        return text

    # 'text' from the example can be of type numpy.str_, let's convert it to a python str
    text = str(text)

    #you might need more
    #add them here

    ### YOUR CODE HERE
    
    # make all letters lowercase
    text = text.lower()

    # remove puncation
    for char in text:
        if (char in string.punctuation ):
            text = text.replace(char,"")
        
    # split text into words
    text_words = text.split()
    
    # remove stop words
    cleared_words = [] # cleared text as an array
    lemmantizer = WordNetLemmatizer() # create a lemmantizer
    for word in text_words:
        # add only non-stop words and lemantise them before adding
        if(word not in stop_words):
            cleared_words.append(lemmantizer.lemmatize(word))
        
    text = ' '.join(cleared_words) # cleared text as a string

    ### YOUR CODE ENDS HERE

    text = text.strip()

    # Update the example with the cleaned text
    return text



In [6]:

#print(clean("I was running with my 2 dogs and I saw a cat."))

The following function will apply the function (sic) you just wrote to the whole dataset. More specifically, it takes the first entry (`sentence`) from the set of uncompressed/compressed pairs, applies the `clean` function and saves the processed sentence in the field `clean_sentence`. The same is dome for the compressed version of the sentence (saved as `clean_compressed`).

In [7]:
def clean_dataset(example):
    """
    Cleans the sentence and compressed sentence in the example from the Dataset
    Args:
        example: an example from the Dataset

    Returns: updated example with 'clean_sentence' and 'clean_compressed' cleaned

    """
    sentence, compressed = example['set']
    clean_sentence = clean(sentence)
    clean_compressed = clean(compressed)
    example['clean_sentence'] = clean_sentence
    example['clean_compressed'] = clean_compressed
    return example



Below we apply the function to the whole dataset (using `map`) and we can also inspect the result.

In [8]:
split_ds = split_ds.map(clean_dataset)
print(split_ds)



Map:   0%|          | 0/144000 [00:00<?, ? examples/s]

Map:   0%|          | 0/36000 [00:00<?, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['set', 'clean_sentence', 'clean_compressed'],
        num_rows: 144000
    })
    test: Dataset({
        features: ['set', 'clean_sentence', 'clean_compressed'],
        num_rows: 36000
    })
})


Let's examine some examples from the dataset and make sure that we got the results we wanted. At this step, it might be necessary to revisit some pre-processing steps if you are not happy with the results.

In [9]:
for i in range(10):
    print(split_ds['train'][i])

{'set': ['Jeremy Hellickson of the Tampa Bay Rays has been named the American League rookie of the year by the Baseball Writers Association of America.', 'Jeremy Hellickson named American league rookie of the year'], 'clean_sentence': 'jeremy hellickson tampa bay ray named american league rookie year baseball writer association america', 'clean_compressed': 'jeremy hellickson named american league rookie year'}
{'set': ['A US delegation on Monday met Tibetan spiritual leader, the Dalai Lama, in this north Indian town to finalise details of his proposed US visit next month, a Tibetan official said.', 'US delegation meets Dalai Lama'], 'clean_sentence': 'u delegation monday met tibetan spiritual leader dalai lama north indian town finalise detail proposed u visit next month tibetan official said', 'clean_compressed': 'u delegation meet dalai lama'}
{'set': ['Foreign Minister Shah Mehmood Qureshi on Sunday said that Faisal Shahzad, the man involved in failed attack at Times Squre, is not 

<a name='e2'></a>
### Exercise 2: Tokenize function

(1p) As always, we will need to tokenize the dataset in order to create bat-of-words and TF-IDF representations in the next sections. We will use the [Natural Language Toolkit (NLTK) library]([https://www.nltk.org/]) (https://www.nltk.org/). Complete the following function to split the text into tokens using the `word_tokenize()` function. Check the [documentation](https://www.nltk.org/api/nltk.tokenize.word_tokenize.html?highlight=word_tokenize) first.
Note that there are different tokenizers e.g. `RegexpTokenizer` where you can enter your own regexp, `WhitespaceTokenizer` (similar to Python's string.split()) and `BlanklineTokenizer`.

In [10]:
def tokenize(text):
    """
    Tokenizes the `text` parameter using nltk library
    Args:
        text: a string representing a sentence to be tokenized

    Returns: a list of tokens (strings)

    """

    ### YOUR CODE HERE

    tokens = word_tokenize(text)

    ### YOUR CODE ENDS HERE
    return tokens

Next, the function will be applied to the whole dataset (as we did with the pre-processing) and `sentence_tokens` field will be created to store the result.

In [11]:
def tokenize_dataset(example):
    """
    Tokenizes 'clean_sentence' columns in the example from the Dataset
    Args:
        example: an example from the Dataset

    Returns: updated example with 'sentence_tokens' columns

    """
    example['sentence_tokens'] = tokenize(example['clean_sentence'])
    example['compressed_tokens'] = tokenize(example['clean_compressed'])
    return example

In [12]:
split_ds = split_ds.map(tokenize_dataset)

Map:   0%|          | 0/144000 [00:00<?, ? examples/s]

Map:   0%|          | 0/36000 [00:00<?, ? examples/s]

In [13]:
for i in range(10):
    print(split_ds['train'][i])

{'set': ['Jeremy Hellickson of the Tampa Bay Rays has been named the American League rookie of the year by the Baseball Writers Association of America.', 'Jeremy Hellickson named American league rookie of the year'], 'clean_sentence': 'jeremy hellickson tampa bay ray named american league rookie year baseball writer association america', 'clean_compressed': 'jeremy hellickson named american league rookie year', 'sentence_tokens': ['jeremy', 'hellickson', 'tampa', 'bay', 'ray', 'named', 'american', 'league', 'rookie', 'year', 'baseball', 'writer', 'association', 'america'], 'compressed_tokens': ['jeremy', 'hellickson', 'named', 'american', 'league', 'rookie', 'year']}
{'set': ['A US delegation on Monday met Tibetan spiritual leader, the Dalai Lama, in this north Indian town to finalise details of his proposed US visit next month, a Tibetan official said.', 'US delegation meets Dalai Lama'], 'clean_sentence': 'u delegation monday met tibetan spiritual leader dalai lama north indian town 

Since we will need the tokenized sentences, we can use the following statement to extract them from the `train` split of our dataset.

In [14]:
tokenized_sentences = split_ds['train']['sentence_tokens']
print(len(tokenized_sentences))
print(tokenized_sentences[:10])

144000
[['jeremy', 'hellickson', 'tampa', 'bay', 'ray', 'named', 'american', 'league', 'rookie', 'year', 'baseball', 'writer', 'association', 'america'], ['u', 'delegation', 'monday', 'met', 'tibetan', 'spiritual', 'leader', 'dalai', 'lama', 'north', 'indian', 'town', 'finalise', 'detail', 'proposed', 'u', 'visit', 'next', 'month', 'tibetan', 'official', 'said'], ['foreign', 'minister', 'shah', 'mehmood', 'qureshi', 'sunday', 'said', 'faisal', 'shahzad', 'man', 'involved', 'failed', 'attack', 'time', 'squre', 'pakistani', 'naturalized', 'american', 'citizen'], ['prime', 'minister', 'ukraine', 'yulia', 'tymoshenko', 'say', 'corruption', 'team'], ['russian', 'president', 'dmitry', 'medvedev', 'said', 'wednesday', 'turkey', 'russia', 'real', 'strategic', 'partner'], ['lawyer', 'iraqi', 'interpreter', 'worked', 'british', 'army', 'lost', 'court', 'battle', 'asylum', 'condition', 'relaxed', 'yesterday'], ['manchester', 'england', 'brazil', 'striker', 'robinho', 'said', 'tuesday', 'dispute',

In [15]:
tokenized_compressed = split_ds['train']['compressed_tokens']
print(len(tokenized_compressed))
print(tokenized_compressed[:10])

144000
[['jeremy', 'hellickson', 'named', 'american', 'league', 'rookie', 'year'], ['u', 'delegation', 'meet', 'dalai', 'lama'], ['faisal', 'shahzad', 'pakistani'], ['corruption', 'team'], ['medvedev', 'say', 'turkey', 'russia', 'real', 'strategic', 'partner'], ['iraqi', 'interpreter', 'lose', 'court', 'battle'], ['robinho', 'say', 'dispute', 'man', 'city'], ['european', 'stock', 'take', 'breather'], ['shammi', 'kapoor'], ['arsenal', 'offer', 'arsene', 'wenger', 'job', 'life']]


Notice the difference in the types of the different structures we use. Run the following cell to check the types. Do they make sense to you?

In [16]:
#type of original dataset
print(type(split_ds))
print("--")
#type of original sentence
print(split_ds['train'][1])
print(type(split_ds['train'][1]))
print("--")
#type of pre-proceesed sentence
print(split_ds['train']['clean_sentence'][1])
print(type(split_ds['train']['clean_sentence'][1]))
print("--")
#type of tokenized sentence
print(split_ds['train']['sentence_tokens'][1])
print(type(split_ds['train']['sentence_tokens'][1]))
print("--")

<class 'datasets.dataset_dict.DatasetDict'>
--
{'set': ['A US delegation on Monday met Tibetan spiritual leader, the Dalai Lama, in this north Indian town to finalise details of his proposed US visit next month, a Tibetan official said.', 'US delegation meets Dalai Lama'], 'clean_sentence': 'u delegation monday met tibetan spiritual leader dalai lama north indian town finalise detail proposed u visit next month tibetan official said', 'clean_compressed': 'u delegation meet dalai lama', 'sentence_tokens': ['u', 'delegation', 'monday', 'met', 'tibetan', 'spiritual', 'leader', 'dalai', 'lama', 'north', 'indian', 'town', 'finalise', 'detail', 'proposed', 'u', 'visit', 'next', 'month', 'tibetan', 'official', 'said'], 'compressed_tokens': ['u', 'delegation', 'meet', 'dalai', 'lama']}
<class 'dict'>
--
u delegation monday met tibetan spiritual leader dalai lama north indian town finalise detail proposed u visit next month tibetan official said
<class 'str'>
--
['u', 'delegation', 'monday', 'm

## 2. Bag of Words
In this section you will built a bag-of-words representation of the dataset. We will use numpy arrays to store the results. The bag-of-words representation is a simple and effective way to represent text data. It involves creating a vocabulary of unique words from the dataset and representing each sentence as a vector of word counts. We first need the vocabulary, which we will build from both the full sentences and the compressed sentences. Similar to the first lab, the vocabulary will be a list of unique words from the dataset.

<a name='e3'></a>
### Exercise 3: Extracting vocabulary counts

(1p) In the following cell, you will implement a function that takes a list of tokenized sentences and returns a dictionary with the counts of each word in the vocabulary. The dictionary should be of the form {word: count}. As in previous lab, you will use the `Counter` class from the `collections` module to do this.

In [17]:
from collections import Counter


def extract_vocabulary_counts(tokenized_sentences):
    """
    Extracts the vocabulary from the tokenized sentences
    Args:
        tokenized_sentences: a list of lists of tokens

    Returns: a Counter object with the counts of each word in the vocabulary
    """

    ### YOUR CODE HERE

    # retun token count for all sentences in tokenized_sentences
    return Counter(token for sentence in tokenized_sentences for token in sentence)

    ### YOUR CODE ENDS HERE

In [18]:
vocab_counter = extract_vocabulary_counts(tokenized_sentences + tokenized_compressed)
print(len(vocab_counter))
print(vocab_counter.most_common(10))

111524
[('new', 20259), ('said', 19924), ('year', 12255), ('man', 12153), ('u', 11607), ('today', 10136), ('police', 9782), ('two', 9317), ('state', 8957), ('say', 8843)]


As you can see the size of the vocabulary is quite large. Like the last time, we will limit the vocabulary to the most frequent words. The next cell will create a dictionary that maps each word to an index in the vocabulary. This will be used to create the bag-of-words representation of the sentences.

In [19]:
vocab_size = 10_000
vocab = vocab_counter.most_common(vocab_size)
token_to_id = {word: i for i, (word, _) in enumerate(vocab)}
#print(token_to_id)

<a name='e4'></a>
### Exercise 4: Bag of Words
(1p) Here we will create the bag-of-words representation of the sentences. The function will take a single sentence (list of tokens) and return an array of size `vocab_size` with the counts of each word in the vocabulary. The
`vocab_size` is calculated as the length of the passed `token_to_id` dictionary. The resulting array should have zeros everywhere but the indices corresponding to the words in the vocabulary where it should have the counts of the words in the sentence. For example, if the sentence is `['fox', 'and', 'deer']` and the vocabulary is `{'fox': 0, 'and': 1, 'deer': 2}`, the resulting array should be `[1, 1, 1]`. If the sentence is `['fox', 'and', 'fox', 'deer']`, the resulting array should be `[2, 1, 1]`.

In [20]:
def bag_of_words(sentence, token_to_id):
    """
    Creates a bag-of-words representation of the sentence
    Args:
        sentence: a list of tokens
        token_to_id: a dictionary mapping each word to an index in the vocabulary

    Returns: a numpy array of size vocab_size with the counts of each word in the vocabulary

    """
    vocab_size = len(token_to_id)
    bow = np.zeros(vocab_size, dtype=int)

    ### YOUR CODE HERE

    # iterate through words
    for word in sentence:
        if word in token_to_id:
            bow[token_to_id[word]] += 1

    ### YOUR CODE ENDS HERE

    return bow

def bag_of_words_boost(sentence, token_to_id):
    vocab_size = len(token_to_id)
    bow = np.zeros(vocab_size, dtype=int)
    
    for word in sentence:
            # if a word is in vocabulary
            if(word in token_to_id):
                idx = token_to_id[word]
                if vocab_counts:
                    # boost rare words using inverse frequency
                    word_freq = vocab_counts.get(word, 1)
                    boost = 1 + np.log10(1 + 1 / word_freq)
                    bow[idx] += boost
                else:
                    # regular count
                    bow[idx] += 1
    

    return bow


Let's see how the function works on a single sentence. The output should be a numpy array of size `vocab_size` with the counts of each word in the vocabulary.

In [21]:
print('Tokenized sentence:')
print(tokenized_sentences[0])
sentence_bow = bag_of_words(tokenized_sentences[0], token_to_id)

print('Bag of words:')
print(sentence_bow)
print('Type of bag of words:')
print(type(sentence_bow))
print('Shape of bag of words:')
print(sentence_bow.shape)
print('Non-zero elements in bag of words:')
print(np.nonzero(sentence_bow)[0])

Tokenized sentence:
['jeremy', 'hellickson', 'tampa', 'bay', 'ray', 'named', 'american', 'league', 'rookie', 'year', 'baseball', 'writer', 'association', 'america']
Bag of words:
[0 0 1 ... 0 0 0]
Type of bag of words:
<class 'numpy.ndarray'>
Shape of bag of words:
(10000,)
Non-zero elements in bag of words:
[   2   89  125  359  466  612  798  962 1130 2031 2400 2884 3366]


We can also check in detail what words and their counts are in the bag-of-words representation.

In [22]:
sentence_non_zero_bow = np.nonzero(sentence_bow)[0]
print('Non-zero elements in bag of words:')
print(sentence_non_zero_bow)
for i in sentence_non_zero_bow:
    print(vocab[i][0], ':', sentence_bow[i])

Non-zero elements in bag of words:
[   2   89  125  359  466  612  798  962 1130 2031 2400 2884 3366]
year : 1
american : 1
league : 1
america : 1
named : 1
association : 1
bay : 1
baseball : 1
writer : 1
ray : 1
tampa : 1
rookie : 1
jeremy : 1


The following function will apply all the steps we implemented to a single sentence. It returns a bag of words representation that we will use to calculate the similarity between different sentences.

In [23]:
def embed_text(text, clean_fn, tokenize_fn, embed_fn):
    cleaned = clean_fn(text)
    tokens = tokenize_fn(cleaned)
    embedding = embed_fn(tokens)
    return embedding

<a name='e5'></a>
### Exercise 5: Cosine Similarity between two vectors

(1p) Complete the following function that given any two vectors will compute the cosine similarity. If you don't remember the formula for the cosine similarity, revisit the course material. Notice that the function receives numpy arrays and recall that you can express cosine similarity as a dot product. Use numpy functions to write an efficient implementation.

In [24]:
def cosine_similarity1(vector1, vector2):
    """
    Computes the cosine similarity between two vectors
    Args:
        vector1: numpy array of the first vector
        vector2: numpy array of the second vector

    Returns: cosine similarity

    """
    ### YOUR CODE HERE
    
    # calculate the numenator and the denumenator of the formula
    numerator = np.dot(vector1, vector2)
    denom = np.linalg.norm(vector1) * np.linalg.norm(vector2)
    # return calculated cosine similarities between vectors using the formula if the denominator is non zero, and 0 otherwise
    similarity = np.divide(numerator, denom, out=np.zeros_like(numerator), where=denom!=0)[0]

    return similarity
    ### YOUR CODE ENDS HERE

In [25]:
def cosine_similarity(vector1, vector2):
    """
    Computes the cosine similarity between two vectors
    Args:
        vector1: numpy array of the first vector
        vector2: numpy array of the second vector

    Returns: cosine similarity
    """
    numerator = np.dot(vector1, vector2)
    denom = np.linalg.norm(vector1) * np.linalg.norm(vector2)
    
    if denom != 0:
        return numerator / denom
    else:
        return 0.0


In [26]:
cosine_similarity(np.array([0, 1, 2]), np.array([0, 2, 4]))

0.9999999999999998

In [27]:
sentences = [
    'The quick brown fox jumps over the lazy dog.',
    'Some interesting document containin sentences.',
    'The quick brown fox jumps over the lazy cat and some other stuff.',
    'Fox and deer are not friends.',
    'Fox and deer are not friends. But this document is a lot longer than the previous one. We can add sentence by sentence and see how the embeddings change.',
]
embedded_sentences = [
    embed_text(sentence, clean, tokenize, lambda x: bag_of_words(x, token_to_id))
    for sentence in sentences
]

query = 'fox and deer'
embedded_query = embed_text(query, clean, tokenize, lambda x: bag_of_words(x, token_to_id))

cosine_similarities = [
    cosine_similarity(embedded_query, embedded_sentence)
    for embedded_sentence in embedded_sentences
]
print(f'Query: {query}')
for sent, cos_sim in zip(sentences, cosine_similarities):
    print(f'Cosine Similarity: {cos_sim:.4f} - Sentence: {sent}')

Query: fox and deer
Cosine Similarity: 0.3162 - Sentence: The quick brown fox jumps over the lazy dog.
Cosine Similarity: 0.0000 - Sentence: Some interesting document containin sentences.
Cosine Similarity: 0.2887 - Sentence: The quick brown fox jumps over the lazy cat and some other stuff.
Cosine Similarity: 0.8165 - Sentence: Fox and deer are not friends.
Cosine Similarity: 0.3651 - Sentence: Fox and deer are not friends. But this document is a lot longer than the previous one. We can add sentence by sentence and see how the embeddings change.


We will apply the function to the whole dataset. This might take a while, so be patient. The result will be stored in the `sentence_bow` and `compressed_bow` fields of the dataset.

In [28]:
def bag_of_words_dataset(example):
    """
    Creates a bag-of-words representation of the sentence and compressed sentence in the example from the Dataset
    Args:
        example: an example from the Dataset

    Returns: updated example with 'sentence_bow' and 'compressed_bow' columns

    """
    sentence_tokens = example['sentence_tokens']
    compressed_tokens = example['compressed_tokens']

    sentence_bow = bag_of_words(sentence_tokens, token_to_id)
    compressed_bow = bag_of_words(compressed_tokens, token_to_id)

    example['sentence_bow'] = sentence_bow
    example['compressed_bow'] = compressed_bow
    return example

The following cell will apply the function to the whole dataset. The result will be stored in the `sentence_bow` and `compressed_bow` fields of the dataset. We will also convert the dataset's fields `sentence_bow` and `compressed_bow` to numpy format for easier manipulation.

In [29]:
test_ds = split_ds['test'].map(bag_of_words_dataset)
test_ds = test_ds.with_format('np', columns=['sentence_bow', 'compressed_bow'], dtype=float)
print(test_ds)

Map:   0%|          | 0/36000 [00:00<?, ? examples/s]

Dataset({
    features: ['set', 'clean_sentence', 'clean_compressed', 'sentence_tokens', 'compressed_tokens', 'sentence_bow', 'compressed_bow'],
    num_rows: 36000
})


Let's check the results. The `sentence_bow` and `compressed_bow` fields should contain the bag-of-words representation of the sentences and compressed sentences, respectively.

In [30]:
print(test_ds[0])
sentence_non_zero_bow = np.nonzero(test_ds[0]['sentence_bow'])[0]
print('Non-zero elements in bag of words:')
print(sentence_non_zero_bow)
for i in sentence_non_zero_bow:
    print(vocab[i][0], ':', sentence_bow[i])

{'sentence_bow': array([0., 1., 0., ..., 0., 0., 0.]), 'compressed_bow': array([0., 0., 0., ..., 0., 0., 0.])}
Non-zero elements in bag of words:
[   1    3   29   60   66   85  328  503 1603]
said : 0
man : 0
service : 0
found : 0
fire : 0
house : 0
body : 0
emergency : 0
severe : 0


In [31]:
sentences_bows = test_ds['sentence_bow']
print(sentences_bows.shape)

(36000, 10000)


Now we can start building a retriever based on the bag of words representation. The first step is to calculate the cosine similarity between two vectors.

<a name='e6'></a>
### Exercise 6: Cosine Similarity between a vector and an array of vectors

(2p) The next step in our retrieval system, would be to calculate the proximity of a query to our retrieval corpus (in our case that is all the sentences).

Complete the following function to calculate the cosine similarity between a vector (first parameter `vector`, that will usually be the query vector) and all other vectors (second parameter `other_vectors`, that will be the sentence embeddings in our case). Note that the `other_vectors` parameter is a single numpy array of size `N x D`, where $N$ is the number of vectors and $D$ is the dimension of each vector.

For maximum efficiency (we will need it) do not use loops. Try to write the implementation with numpy functions. Hint: matrix multiplication can be seen as calculating the dot product between rows and columns of the multiplied matrices.

In [32]:
def cosine_similarity_1_to_n(vector, other_vectors):
    """
    Calculates the cosine similarity between a single vector and other vectors.
    Args:
        vector: a numpy array representing a vector of D dimensions
        other_vectors: a 2D numpy array representing other vectors (of the size NxD, where N is the number of vectors and D is their dimension)

    Returns: a 1D numpy array of size N containing the cosine similarity between the vector and all the other vectors

    """

    #### YOUR CODE HERE

    # calculate the numenator and the denumenator of the formula
    numerator = np.dot(other_vectors, vector)
    denom = np.linalg.norm(vector) * np.linalg.norm(other_vectors, axis=1)
    # return calculated cosine similarities between vectors using the formula if the denominator is non zero, and 0 otherwise
    return np.divide(numerator, denom, out=np.zeros_like(numerator), where=denom!=0)

    ### YOUR CODE ENDS HERE

We will use the function to calculate the similarity of all sentences in the dataset to our query.

In [33]:
query = 'fox and deer'
embedded_query = embed_text(query, clean, tokenize, lambda x: bag_of_words(x, token_to_id))

In [34]:
query_similarity = cosine_similarity_1_to_n(embedded_query, sentences_bows)
print(query_similarity.shape)
for notnan in query_similarity:
    if notnan is np.nan:
        print(notnan)

(36000,)


The following cell will select the most similar sentence.

In [35]:
most_similar = int(np.argmax(query_similarity))
print(most_similar)
print(query_similarity[most_similar])
print(split_ds['test'][most_similar]['set'][0])

18943
0.4472135954999579
Information is being sought in a deer poaching case east of Deer Lodge this week.


The following function will return the indices of the top-k elements in the array.

In [36]:
def top_k_indices(array, k, sorted=True):
    """
    Returns top-k indices from the 1D array. If `sorted` is `True` the returned indices are sorted in the descending order
    Args:
        array: a 1D numpy array
        k: a number of top indices to return
        sorted: if True, the returned indices are sorted in descending order

    Returns: a 1D array containing top-k indices

    """
    top_k = np.argpartition(array, -k)[-k:]
    if sorted:
        selected = array[top_k]
        sorted_selected = (-selected).argsort()
        top_k = top_k[sorted_selected]
    return top_k

In [37]:
top_indices = top_k_indices(query_similarity, k=10).tolist()
for idx in top_indices:
    print(split_ds['test'][idx]['set'][0])
    print(f'similarity: {query_similarity[idx]}')

Information is being sought in a deer poaching case east of Deer Lodge this week.
similarity: 0.4472135954999579
Fox News announced Friday that it has hired Herman Cain, the spirited former presidential candidate and ex-pizza CEO, as a contributor for Fox News Channel and Fox Business Network.
similarity: 0.41602514716892186
Deer ticks are apparently on the rise.
similarity: 0.40824829046386296
A Deer Lodge man was shot and killed in a weekend hunting accident at a campground southeast of Deer Lodge.
similarity: 0.36514837167011066
Large-scale power upgrades being eyed for the Red Deer region fail to look at renewable alternatives, says a Red Deer city councillor.
similarity: 0.32444284226152503
The Red Sox today claimed Twins pitcher Matt Fox on waivers, less than a week after Fox made a strong start for Minnesota in his major league debut.
similarity: 0.3086066999241838
-Megan Fox is to play Catwoman in the next Batman film.
similarity: 0.2886751345948129
In his broadcast pilot direc

<a name='e7'></a>
### Exercise 7: Analyzing and improving BOW search results

Experiment with different queries (taking into account the nature of the dataset and your insights from the analysis so far).
Answer the following questions:
- (5p) Does the search perform well? When does it fail? Discuss several examples that are we get an expected but also unexpected results (find at least 3 from each category). Provide reasons for the good/bad result in each case (e.g. is there some error in the data, is there some linguistic phenomenon that we don't capture, is something wrong with our modeling with average embeddings, ...)
- (5p) If you see problems with search, how could you improve your implementation? Change the functions above, if you think there is room for improvement. Describe your changes and how they made the search better or (in case you made no changes) explain what made the search robust enough to work well.

In [38]:
vocab_counts = Counter(word for sentence in tokenized_sentences for word in sentence)

In [39]:
#### YOUR CODE HERE

queries = [
    "police arrest",
    "house fire",
    "dog barking",
    "man died",
    "a woman reading a book",     
    "a dog is chasing a cat",
    "he discussed neutrinos",
    "she saw a bat",
    "money bank",
    "river bank",
   "shoots suspect",
    "suspect shoots"
]

for query in queries:
    print(f"\nQuery: '{query}'")

    embedded_query = embed_text(query, clean, tokenize, lambda x: bag_of_words(x, token_to_id))

    similarity_scores = cosine_similarity_1_to_n(embedded_query, test_ds["sentence_bow"])

    top_indices = top_k_indices(similarity_scores, k=3)

    for rank, idx in enumerate(top_indices):
        idx = int(idx)
        original_sentence = split_ds["test"][idx]["set"][0]
        compressed_version = split_ds["test"][idx]["set"][1]
        score = similarity_scores[idx]
        print(f"Rank {rank + 1}:")
        print(f"Original:   {original_sentence}")
        print(f"Compressed: {compressed_version}")
        print(f"Similarity: {score:.4f}\n")


### YOUR CODE ENDS HERE


Query: 'police arrest'
Rank 1:
Original:   A man is under arrest Tuesday morning after leading police on a chase, ramming a police cruiser, and then trying to run from officers.
Compressed: Man arrested after ramming police cruiser
Similarity: 0.5669

Rank 2:
Original:   The man who claims Wilkes-Barre police used excessive force to arrest him was formally charged Thursday with resisting arrest and disorderly conduct.
Compressed: Man charged with resisting arrest
Similarity: 0.5477

Rank 3:
Original:   A Honolulu Police officer lost part of his finger during an arrest.
Compressed: Officer loses part of finger during arrest
Similarity: 0.5345


Query: 'house fire'
Rank 1:
Original:   A Clarksville house caught fire Friday afternoon after a back-porch grill exploded, igniting the house in flames, authorities said.
Compressed: Clarksville house catches fire
Similarity: 0.6124

Rank 2:
Original:   A house in the Huntley area surrounded by floodwaters caught fire over the weekend.
Compress

// your comments

## 3. Term Frequency - Inverse Document Frequency (TF-IDF)

In this section we will implement the TF-IDF algorithm. While BOW is a simple way to represent the documents, it has some limitations. For example, it does not take into account the importance of each word in the document. TF-IDF representation takes into account the frequency of each word in the document and the frequency of the word in the whole dataset. It is a widely used technique in information retrieval and text mining. Refer to the lecture slides for more details.

<a name='e8'></a>
### Exercise 8: Inverse Document Frequency (IDF)
(5p) In this exercise, you will implement the TF-IDF algorithm. First, calculate Inverse Document Frequency (IDF) for each word in the vocabulary. Intuitively, it is a measure of how informative a word is based on the whole dataset. Consult the lecture slides for the details. The IDF is calculated as follows:
$$
IDF(t) = log_{10}(N/df(t))$$
where $N$ is the total number of documents (sentences) in the dataset and $df(t)$ is the number of documents containing the word $t$.


In [40]:
def calculate_idf(bows):
    """
    Calculates the IDF for each word in the vocabulary
    Args:
        bows: numpty array of size (N x D) where N is the number of documents and D is the vocabulary size

    Returns: a numpy array of size D with IDF values for each token
    """

    ### YOUR CODE HERE

    N = bows.shape[0] # number of documents
    df = np.sum(bows > 0, axis = 0) # document frequency = how many documents contain each word
    idf = np.log10(N / (df + 1)) # smoothing to avoid division by zero

    return idf

    ### YOUR CODE ENDS HERE

idf = calculate_idf(sentences_bows)

<a name='e9'></a>
### Exercise 9: TF-IDF
- (5p) Calculate TF-IDF on the `test` subset of the dataset.
- (5p) Analyze the search results based on your implemented TF-IDF. Does the search perform well? When does it fail? Discuss several examples that are we get an expected but also unexpected results (find at least 3 from each category). Provide reasons for the good/bad result in each case (e.g. is there some error in the data, is there some linguistic phenomenon that we don't capture, is something wrong with our modeling with average embeddings, ...)
- (5p) Compare the results with the ones you got with the bag-of-words representation. Discuss the differences and similarities. Do you think TF-IDF is a better representation for this task? Why or why not? Provide examples to support your arguments.


In [41]:
### YOUR CODE HERE

def calculate_tfidf(bows, idf):
    """
    Calculates the TF- IDF matrix from BoW and IDF values
    Args:
        bows: numpy array of size (N x D) where N is the number of documents and D is the vocabulary size
        idf: numpy array of size D with IDF values for each token
    Returns: 
        tfidf_matrix: numpy array of size (N x D) with TF-IDF values
    """

    # TF formula
    tf = np.log10(1 + bows)

    # multiply TF x IDF
    tfidf = tf * idf

    return tfidf


### YOUR CODE ENDS HERE

### YOU CAN ADD MORE CELLS

In [42]:
# calculate TF-IDF on the test subset 
test_tfidf_sentence = calculate_tfidf(test_ds["sentence_bow"], idf).astype(np.float32)
test_tfidf_compressed = calculate_tfidf(test_ds["compressed_bow"], idf).astype(np.float32)
print("TF-IDF shape:", test_tfidf_compressed.shape)


TF-IDF shape: (36000, 10000)


In [45]:
queries = [
    #"house explosion", 
    "volcano eruption",
    #"child kidnapped",
   # "child abducted",
    #"earthquake damage",
   # "wildfire",
  #  "ebola",
   # "woman killed",
   # "female dies",
    
    #"police arrest",
   #"house fire",
   #"dog barking",
   #"man died",
   #"a woman reading a book",     
   #"a dog is chasing a cat",
    #"he discussed neutrinos",
   # "she saw a bat",
   #"money bank",
  # "river bank",
  # "shoots suspect",
  # "suspect shoots"
]

for query in queries:
    print(f"\nQuery: '{query}'")

    embedded_query = embed_text(query, clean, tokenize, lambda x: bag_of_words(x, token_to_id))

    query_tfidf = np.log10(1 + embedded_query) * idf 

    similarity_scores = cosine_similarity_1_to_n(query_tfidf, test_tfidf_sentence)

    top_indices = top_k_indices(similarity_scores, k=3)

    #print(f"Query BoW: {query_bow}")
    print(f"Query TF-IDF: {query_tfidf}")
    print(f"Sum TF-IDF: {np.sum(query_tfidf)}")

    for rank, idx in enumerate(top_indices):
        idx = int(idx)
        original_sentence = split_ds["test"][idx]["set"][0]
        compressed_version = split_ds["test"][idx]["set"][1]
        score = similarity_scores[idx]
        print(f"Rank {rank + 1}:")
        print(f"Original:   {original_sentence}")
        print(f"Compressed: {compressed_version}")
        print(f"Similarity: {score:.4f}\n")


Query: 'volcano eruption'
Query TF-IDF: [0. 0. 0. ... 0. 0. 0.]
Sum TF-IDF: 1.0843281035708567
Rank 1:
Original:   Big Island police are investigating two shooting incidents in Volcano late Tuesday.
Compressed: Big Island police investigate two shootings
Similarity: 0.5148

Rank 2:
Original:   Katy Perry has climbed a volcano, taken up meditation and got ``s**t out of her chakras''.
Compressed: Katy Perry climbs volcano
Similarity: 0.4774

Rank 3:
Original:   Philippines authorities warned that Mayon, one of the country's most active volcanoes, is showing signs of life and could erupt again soon.
Compressed: Philippines warns Mayon volcano may erupt soon
Similarity: 0.4414



## 4. Word Embeddings

In this section you will load the pre-trained word embeddings model - Glove. You can read more about it [here](https://aclanthology.org/D14-1162/) ([https://aclanthology.org/D14-1162/](https://aclanthology.org/D14-1162/)). The embeddings are trained on a large corpus of text and are available in different dimensions. We will start with the dimension of 100, but later you will be asked to experiment with other dimensions.
Gensim library maintains a storage containing some pre-trained models. You can read more about it [here](https://github.com/piskvorky/gensim-data) ([https://github.com/piskvorky/gensim-data](https://github.com/piskvorky/gensim-data)). Be sure to read the README of this repository.

Let's first load the info of what models are available.

In [47]:
import json
import gensim.downloader as api

info = api.info()  # show info about available models/datasets
print(json.dumps(info['models'], indent=2))

{
  "fasttext-wiki-news-subwords-300": {
    "num_records": 999999,
    "file_size": 1005007116,
    "base_dataset": "Wikipedia 2017, UMBC webbase corpus and statmt.org news dataset (16B tokens)",
    "reader_code": "https://github.com/RaRe-Technologies/gensim-data/releases/download/fasttext-wiki-news-subwords-300/__init__.py",
    "license": "https://creativecommons.org/licenses/by-sa/3.0/",
    "parameters": {
      "dimension": 300
    },
    "description": "1 million word vectors trained on Wikipedia 2017, UMBC webbase corpus and statmt.org news dataset (16B tokens).",
    "read_more": [
      "https://fasttext.cc/docs/en/english-vectors.html",
      "https://arxiv.org/abs/1712.09405",
      "https://arxiv.org/abs/1607.01759"
    ],
    "checksum": "de2bb3a20c46ce65c9c131e1ad9a77af",
    "file_name": "fasttext-wiki-news-subwords-300.gz",
    "parts": 1
  },
  "conceptnet-numberbatch-17-06-300": {
    "num_records": 1917247,
    "file_size": 1225497562,
    "base_dataset": "ConceptN

In [48]:
glove_model = api.load("glove-wiki-gigaword-100")

We can use the loaded model's `key_to_index` attribute to retrieve the whole vocabulary (aka for how many words we learned embeddings for).

In [49]:
vocab = list(glove_model.key_to_index)
print(len(vocab))

400000


Let's explore a bit further the embeddings. In the following cells, the embedding of a single word is returned. Double-check the dimensions (as sanity check). This is like inspecting the `W` matrix (weights) that we discussed in the lecture.

In [49]:
# vector of a particular model. note that it is 100 dimensional as specified.
glove_model['what']

array([-1.5180e-01,  3.8409e-01,  8.9340e-01, -4.2421e-01, -9.2161e-01,
        3.7988e-02, -3.2026e-01,  3.4119e-03,  2.2101e-01, -2.2045e-01,
        1.6661e-01,  2.1956e-01,  2.5325e-01, -2.9267e-01,  1.0171e-01,
       -7.5491e-02, -6.0406e-02,  2.8194e-01, -5.8519e-01,  4.8271e-01,
        1.7504e-02, -1.2086e-01, -1.0990e-01, -6.9554e-01,  1.5600e-01,
        7.0558e-02, -1.5058e-01, -8.1811e-01, -1.8535e-01, -3.6863e-01,
        3.1650e-02,  7.6616e-01,  8.4041e-02,  2.6928e-03, -2.7440e-01,
        2.1815e-01, -3.5157e-02,  3.2569e-01,  1.0032e-01, -6.0932e-01,
       -7.0316e-01,  1.8299e-01,  3.3134e-01, -1.2416e-01, -9.0542e-01,
       -3.9157e-02,  4.4719e-01, -5.7338e-01, -4.0172e-01, -8.2234e-01,
        5.5740e-01,  1.5101e-01,  2.4598e-01,  1.0113e+00, -4.6626e-01,
       -2.7133e+00,  4.3273e-01, -1.6314e-01,  1.5828e+00,  5.5081e-01,
       -2.4738e-01,  1.4184e+00, -1.6867e-02, -1.9368e-01,  1.0090e+00,
       -5.9864e-02,  9.1853e-01,  4.3022e-01, -2.0624e-01,  7.61

Gensim objects offers different methods to easily run very common tasks. For example, there are different functions to find the most similar words.

Check the documentation on how [`most_similar`](https://tedboy.github.io/nlps/generated/generated/gensim.models.Word2Vec.most_similar.html) and [`similar_by_word`](https://tedboy.github.io/nlps/generated/generated/gensim.models.Word2Vec.similar_by_word.html) can be used.

In [50]:
# most similar words to a given word
print(glove_model.most_similar('what', topn=10))

# also u can use
print(glove_model.similar_by_word('miss', topn=5))

[('how', 0.9303215742111206), ('why', 0.9196363091468811), ('fact', 0.906943678855896), ('know', 0.8876389265060425), ('that', 0.8810365796089172), ('think', 0.8772969841957092), ('so', 0.8753098249435425), ('even', 0.8751895427703857), ('something', 0.874744176864624), ('if', 0.8702542781829834)]
[('play', 0.6266524791717529), ('missed', 0.608065128326416), ('she', 0.596325695514679), ('chance', 0.5839369297027588), ('tournament', 0.572258710861206)]


In [51]:
print(glove_model.most_similar('why', topn=10))

[('know', 0.944094181060791), ('what', 0.9196362495422363), ('think', 0.9086559414863586), ('how', 0.9020735621452332), ('tell', 0.8923122882843018), ("n't", 0.8890628814697266), ('sure', 0.8870969414710999), ('thought', 0.8747684955596924), ('believe', 0.8745115399360657), ('say', 0.8730075359344482)]


In [52]:
print(glove_model.similar_by_word('who', topn=5))

[('whom', 0.8642492890357971), ('he', 0.8201969861984253), ('whose', 0.8143677711486816), ('had', 0.8035843968391418), ('others', 0.7708418965339661)]


We can now compare our implementation with the one in the pre-trained model and confirm what we already expected.

In [None]:
# simalarity between two words
word1 = 'alive'
word2 = 'biology'
print(glove_model.similarity(word1, word2))
print(cosine_similarity(glove_model[word1], glove_model[word2]))

In [None]:
# simalarity between two words. similar words
word1 = 'alive'
word2 = 'life'
print(glove_model.similarity(word1, word2))
print(cosine_similarity(glove_model[word1], glove_model[word2]))

In [None]:
# simalarity between two words. dissimilar words
word1 = 'alive'
word2 = 'dead'
print(glove_model.similarity(word1, word2))
print(cosine_similarity(glove_model[word1], glove_model[word2]))

In [None]:
# simalarity between two words. unrelated words
word1 = 'alive'
word2 = 'horse'
print(glove_model.similarity(word1, word2))
print(cosine_similarity(glove_model[word1], glove_model[word2]))

In [None]:
# simalarity between two SAME words
glove_model.similarity('equal', 'equal')
word1 = 'equal'
word2 = 'equal'
print(glove_model.similarity(word1, word2))
print(cosine_similarity(glove_model[word1], glove_model[word2]))

The next function contains the code to plot a similarity matrix between multiple words (e.g. if we want to compare 10 words and their pair-wise similarities). It requires a matrix with similarities (as input) and labels (aka the words) to display in the final figure.

In [None]:
def plot_similarity_matrix(matrix, labels):
    """
    Displays a plot of the `matrix` of size (N x N) with the labels specified as a list of size N
    Args:
        matrix: a square-sized (N x N) numpy array
        labels: a list of strings of hte size N
    """

    fig, ax = plt.subplots()
    im = ax.imshow(matrix)

    # Show all ticks and label them with the respective list entries
    ax.set_xticks(np.arange(len(labels)), labels=labels)
    ax.set_yticks(np.arange(len(labels)), labels=labels)

    # Rotate the tick labels and set their alignment.
    plt.setp(ax.get_xticklabels(), rotation=45, ha="right",
             rotation_mode="anchor")

    # Loop over data dimensions and create text annotations.
    for i in range(len(labels)):
        for j in range(len(labels)):
            text = ax.text(j, i, f'{matrix[i, j]:.2f}',
                           ha="center", va="center", color="w")

    # ax.set_title("Give a title if you want")
    fig.tight_layout()
    plt.show()

<a name='e10'></a>
### Exercise 10: Plotting similarities between words

(10p) In the following, we will explore some properties of word embeddings through some examples. We will use 6 example words for this purpose but experiment with other set of words as well. Fill in the next cell to create a similarity matrix between a list of words.

Experiment with different words and their similarities plotted. Try at least 3 different sets of words of at least 6 words each. Use the `plot_similarity_matrix` function to visualize the results.
Comment on the results. Do they make sense? Why some words are closer to each other than others? What does it mean?

In [None]:
list_of_words = ['love', 'hate', 'life', 'equal', 'alive', 'dead']

similarity_matrix = np.zeros((len(list_of_words), len(list_of_words)), dtype=float)

### YOUR CODE HERE

embeddings = [glove_model[word] for word in list_of_words] # get embeddings from the glove_model

for i in range (len(list_of_words)):
    for j in range (len(list_of_words)):
        similarity_matrix[i, j] = glove_model.similarity(list_of_words[i], list_of_words[j])
        

### YOUR CODE ENDS HERE

plot_similarity_matrix(similarity_matrix, list_of_words)

In [None]:
def get_similarity_matrix(word_list, model):
    """
    Computes the cosine similarity matrix for a list of words using a given Gensim model.
    
    Args:
        word_list (list): List of words (strings) to compare.
        model (KeyedVectors): Pretrained word embedding model.
    
    Returns:
        np.ndarray: Square similarity matrix (len(word_list) x len(word_list)).
    """
    size = len(word_list)
    sim_matrix = np.zeros((size, size))

    for i in range(size):
        for j in range(size):
            try:
                sim_matrix[i, j] = model.similarity(word_list[i], word_list[j])
            except KeyError:
                sim_matrix[i, j] = np.nan  # in case word is not in vocabulary

    return sim_matrix


In [None]:
#### YOUR CODE HERE

# countries and capitals
list_of_words1 = ['france', 'paris', 'italy', 'rome', 'germany', 'berlin']
matrix1 = get_similarity_matrix(list_of_words1, glove_model)
plot_similarity_matrix(matrix1, list_of_words1)

# technology
list_of_words2 = ['computer', 'keyboard', 'internet', 'email', 'phone', 'screen']
matrix2 = get_similarity_matrix(list_of_words2, glove_model)
plot_similarity_matrix(matrix2, list_of_words2)

# fruit and vegetables
list_of_words3 = ['apple', 'banana', 'orange', 'fruit', 'vegetable', 'cucumber', 'carrot']
matrix3 = get_similarity_matrix(list_of_words3, glove_model)
plot_similarity_matrix(matrix3, list_of_words3)

# multiple meanings of bank 
list_of_words4 = ['bank', 'river', 'loan', 'water', 'money', 'shore']
matrix4 = get_similarity_matrix(list_of_words4, glove_model)
plot_similarity_matrix(matrix4, list_of_words4)

# animals
list_of_words5 = ['dog', 'cat', 'mouse', 'lion', 'tiger', 'wolf']
matrix5 = get_similarity_matrix(list_of_words5, glove_model)
plot_similarity_matrix(matrix5, list_of_words5)

# emotions
list_of_words6 = ['happy', 'sad', 'angry', 'joyful', 'depressed', 'excited']
matrix6 = get_similarity_matrix(list_of_words6, glove_model)
plot_similarity_matrix(matrix6, list_of_words6)

# animals
list_of_words7 = ['doctor', 'nurse', 'teacher', 'housekeeper', 'scientist', 'engineer', 'man', 'woman']
matrix7 = get_similarity_matrix(list_of_words7, glove_model)
plot_similarity_matrix(matrix7, list_of_words7)


### YOUR CODE ENDS HERE

<a name='e11'></a>
### Exercise 11: Other pre-trained word embeddings
(10p) For this exercise, experiment with at least one different word embedding model. You can choose Glove with different dimensions or other pre-trained models. Use the gensim library to download and use the models.
Plot similarity matrices between sets of words you used in the previous exercise and compare the results. Are there noticeable differences? Why (not)?

In [None]:
#### YOUR CODE HERE
# Glove with 200 dimensions
glove_model_200 = api.load("glove-wiki-gigaword-200")

# PLOT SIMILARITY MATRICES
# emotions and cocepts
list_of_words0 = ['love', 'hate', 'life', 'equal', 'alive', 'dead']
matrix0 = get_similarity_matrix(list_of_words0, glove_model_200)
plot_similarity_matrix(matrix0, list_of_words0)

# countries and capitals
list_of_words1 = ['france', 'paris', 'italy', 'rome', 'germany', 'berlin']
matrix1 = get_similarity_matrix(list_of_words1, glove_model_200)
plot_similarity_matrix(matrix1, list_of_words1)

# technology
list_of_words2 = ['computer', 'keyboard', 'internet', 'email', 'phone', 'screen']
matrix2 = get_similarity_matrix(list_of_words2, glove_model_200)
plot_similarity_matrix(matrix2, list_of_words2)

# fruit and vegetables
list_of_words3 = ['apple', 'banana', 'orange', 'fruit', 'vegetable', 'cucumber', 'carrot']
matrix3 = get_similarity_matrix(list_of_words3, glove_model_200)
plot_similarity_matrix(matrix3, list_of_words3)

# multiple meanings of bank 
list_of_words4 = ['bank', 'river', 'loan', 'water', 'money', 'shore']
matrix4 = get_similarity_matrix(list_of_words4, glove_model_200)
plot_similarity_matrix(matrix4, list_of_words4)

# animals
list_of_words5 = ['dog', 'cat', 'mouse', 'lion', 'tiger', 'wolf']
matrix5 = get_similarity_matrix(list_of_words5, glove_model_200)
plot_similarity_matrix(matrix5, list_of_words5)

# emotions
list_of_words6 = ['happy', 'sad', 'angry', 'joyful', 'depressed', 'excited']
matrix6 = get_similarity_matrix(list_of_words6, glove_model_200)
plot_similarity_matrix(matrix6, list_of_words6)

# animals
list_of_words7 = ['doctor', 'nurse', 'teacher', 'housekeeper', 'scientist', 'engineer', 'man', 'woman']
matrix7 = get_similarity_matrix(list_of_words7, glove_model_200)
plot_similarity_matrix(matrix7, list_of_words7)


### YOUR CODE ENDS HERE

In [None]:
#### YOUR CODE HERE
# Glove with 200 dimensions
glove_model_300 = api.load("glove-wiki-gigaword-300")

# PLOT SIMILARITY MATRICES
# emotions and cocepts
list_of_words0 = ['love', 'hate', 'life', 'equal', 'alive', 'dead']
matrix0 = get_similarity_matrix(list_of_words0, glove_model_300)
plot_similarity_matrix(matrix0, list_of_words0)

# countries and capitals
list_of_words1 = ['france', 'paris', 'italy', 'rome', 'germany', 'berlin']
matrix1 = get_similarity_matrix(list_of_words1, glove_model_300)
plot_similarity_matrix(matrix1, list_of_words1)

# technology
list_of_words2 = ['computer', 'keyboard', 'internet', 'email', 'phone', 'screen']
matrix2 = get_similarity_matrix(list_of_words2, glove_model_300)
plot_similarity_matrix(matrix2, list_of_words2)

# fruit and vegetables
list_of_words3 = ['apple', 'banana', 'orange', 'fruit', 'vegetable', 'cucumber', 'carrot']
matrix3 = get_similarity_matrix(list_of_words3, glove_model_300)
plot_similarity_matrix(matrix3, list_of_words3)

# multiple meanings of bank 
list_of_words4 = ['bank', 'river', 'loan', 'water', 'money', 'shore']
matrix4 = get_similarity_matrix(list_of_words4, glove_model_300)
plot_similarity_matrix(matrix4, list_of_words4)

# animals
list_of_words5 = ['dog', 'cat', 'mouse', 'lion', 'tiger', 'wolf']
matrix5 = get_similarity_matrix(list_of_words5, glove_model_300)
plot_similarity_matrix(matrix5, list_of_words5)

# emotions
list_of_words6 = ['happy', 'sad', 'angry', 'joyful', 'depressed', 'excited']
matrix6 = get_similarity_matrix(list_of_words6, glove_model_300)
plot_similarity_matrix(matrix6, list_of_words6)

# animals
list_of_words7 = ['doctor', 'nurse', 'teacher', 'housekeeper', 'scientist', 'engineer', 'man', 'woman']
matrix7 = get_similarity_matrix(list_of_words7, glove_model_300)
plot_similarity_matrix(matrix7, list_of_words7)


### YOUR CODE ENDS HERE

// your comments

## 5. Sentence Embeddings by Averaging Word Embeddings

Word embeddings are a powerful model for representing words and their meaning (in terms of distributional similarity). As we discussed in class, we can use them in a wide variety of tasks with more complex architectures. Word vectors offer a dense vector for each word. What if we wanted to represent a sentence (or a document) based on word vectors. How can we do that?

In the course, we will see different architectures that take into account the sequence of words (by combining their vectors). A first naive but simple and sometimes (as we are going to see) quite effective approach would be to represent a sentence with an embedding vector that is the average of the word vectors that form the sentence.

So formally, this is what we are aiming for:

$
\text{Sentence_Embedding} = \frac{1}{N} \sum_{i=1}^{N} \text{Word_Embedding}_i
$

where:
* $N$ is the number of words in a sentence
* $\text{Word_Embedding}_i$ is the word vector for the $i$-th in the sentence.

Things to note:
* The embedding vector for the sentence will obviously have the same dimension as the word embedding.
* This representation ignores the word order (like bag-of-words). During the course we will see how we can overcome this limitation by using sequence models.

<a name='e12'></a>
### Exercise 12: Sentence Embedding

(10p) Complete the function below that takes as input the sentence in the form of tokens (so it's a list of words) and calculates the sentence embedding vector. First, we would need to retrieve the word embeddings for each word from our loaded model and then average the vectors.

Note: There can be cases where all tokens from a sentence are out-of-vocabulary words (OOV). Think what to do in this case and make sure to discuss it in the report.

In [43]:
def embed_sentence_word_model(tokens, model):
    """
    Calculates the sentence embedding by averaging the embeddings of the tokens
    Args:
        tokens: a list of words from the sentence
        model: a trained word embeddings model

    Returns: a numpy array of the sentence embedding

    """
    #### YOUR CODE HERE
    #### CAUTION: be sure to cover the case where all tokens are out-of-vocabulary!!!
    
    embeddings = []

    for token in tokens:
        if token in model:
            embeddings.append(model[token])
    
    if embeddings:
        return np.mean(embeddings, axis=0).astype(np.float32)
    else:
        return np.zeros(model.vector_size, dtype=np.float32)


    ### YOUR CODE ENDS HERE

Now we can apply the function to the whole dataset. Here we do it both for the sentence and the compressed version. You should know it by now, but this operation might take some time. The next cells will apply your function to the whole dataset.

In [44]:
def embed_sentence_word_model_dataset(batch, model):
    """
    Embeds the sentence and the compressed sentence in the example from the Dataset
    Args:
        example: an example from the Dataset
        model: a trained word embeddings model

    Returns: updated example with 'sentence_embedding' and 'compressed_embedding' columns

    """
    sentence_embeddings = []
    compressed_embeddings = []

    for sentence_tokens, clean_compressed in zip(batch['sentence_tokens'], batch['clean_compressed']):
        compressed_tokens = tokenize(clean_compressed)

        sentence_emb = embed_sentence_word_model(sentence_tokens, model)
        compressed_emb = embed_sentence_word_model(compressed_tokens, model)

        sentence_embeddings.append(sentence_emb)
        compressed_embeddings.append(compressed_emb)

    return {
        'sentence_embedding': sentence_embeddings,
        'compressed_embedding': compressed_embeddings
    }

In [45]:
# print(embed_sentence_word_model(tokenize(test_ds["clean_compressed"][9000]), glove_model))

In [50]:
# print(test_ds)
# test_ds = test_ds.drop(labels=[3998,3999,4000], axis=0)
test_ds = test_ds.map(embed_sentence_word_model_dataset, batched=True, fn_kwargs={'model': glove_model})
print(test_ds)

Map:   0%|          | 0/36000 [00:00<?, ? examples/s]

Dataset({
    features: ['set', 'clean_sentence', 'clean_compressed', 'sentence_tokens', 'compressed_tokens', 'sentence_bow', 'compressed_bow', 'sentence_embedding', 'compressed_embedding'],
    num_rows: 36000
})


In [51]:
for i in range(10):
    print(test_ds[i])

{'sentence_bow': array([0., 1., 0., ..., 0., 0., 0.]), 'compressed_bow': array([0., 0., 0., ..., 0., 0., 0.]), 'sentence_embedding': array([-0.14741662,  0.07123321,  0.10402681, -0.10834732, -0.15147848,
        0.205651  , -0.099011  ,  0.54496443, -0.0609332 ,  0.21596679,
       -0.07181107, -0.00627701,  0.45264497,  0.21881101,  0.1735857 ,
       -0.071009  ,  0.1037539 , -0.08429476, -0.54376632, -0.041152  ,
        0.136438  ,  0.0103625 ,  0.24306491, -0.21950999,  0.06047898,
        0.09458999, -0.26778138, -0.2930851 , -0.27239969,  0.19758835,
        0.25323242,  0.11666069, -0.13734141,  0.13306598, -0.10340102,
        0.104436  , -0.0991879 ,  0.10467128,  0.36001641,  0.16786169,
       -0.54087913, -0.25011152,  0.28711182, -0.26517931,  0.393758  ,
        0.20817521, -0.10303351, -0.04896061,  0.0264765 , -0.42847919,
        0.1686606 , -0.30473787,  0.20146599,  0.88145906, -0.060067  ,
       -1.99929404, -0.05896598, -0.1532636 ,  1.46415293,  0.68316501,
   

Here you can see that the new dataset returned a single numpy array containing all sentence embeddings in our dataset. This is a lot more efficient than returning a list of arrays (which is the default behaviour). Below we check the type and the dimensionality.

We will be using `text` subset from our dataset to not use too much RAM.

In [52]:
sent_embedding = test_ds['sentence_embedding']
compr_embedding = test_ds['compressed_embedding']
print(type(sent_embedding))
print(sent_embedding.shape)
print(type(compr_embedding))
print(compr_embedding.shape)

<class 'numpy.ndarray'>
(36000, 100)
<class 'numpy.ndarray'>
(36000, 100)


Next we try the condensed representatin based on a simple query. Feel free to try different queries with different words. What happens if we have OOV words in a query?

In [53]:
query = 'fox and deer'
print(query)

query_embedding = embed_text(query, clean, tokenize, lambda x: embed_sentence_word_model(x, glove_model))
print(query_embedding.shape)
print(query_embedding)

fox and deer
(100,)
[-0.083385   -0.63719     0.254075   -0.65408504 -0.12458149 -0.57695
  0.1151895   0.47937998 -0.05304    -1.16922    -0.479265   -0.117604
  0.67665505 -0.2181845   0.86132     0.46395     0.149115    0.352275
 -0.00642499  0.42123    -0.108593    0.51331     0.1666895  -0.19212
  0.552625    0.777605   -0.17123    -0.00441501 -0.62225     0.180445
  0.263865    0.50187397  0.22722     0.504755    0.49651998  0.29571
 -0.273479    0.186705    0.66161     0.42644    -0.10568     0.00512
 -0.715195    0.4271345  -0.19899699  0.007035    0.217155   -0.270865
 -0.20389499 -0.205845   -0.847275   -0.370445    0.102978    0.56117
 -0.4170775  -1.50664    -0.1817145   0.170645    0.528605    0.030635
  0.0378895   0.82403004  0.55395997 -0.18147999  1.257       0.539876
  0.069414   -0.21291    -0.002545    1.04205    -0.26154    -0.20421065
 -0.1738185  -0.096082   -0.14954099 -0.209345   -0.749255    0.131181
 -0.7742      0.45635498  0.2506     -0.2636805  -0.01942   

<a name='e13'></a>
### Exercise 13: Analyze sentence embeddings
- (5p) Calculate similarity between the word embeddings representations of the selected queries and the dataset sentences.
- (5p) Analyze the search results. Does the search work as expected? Discuss the results.
- (5p) Compare the results with the ones you got with the bag-of-words and TF-IDF representation. Discuss the differences and similarities.

In [54]:
### YOUR CODE HERE
queries = [
    "police arrest",
    "house fire",
    "dog barking",
    "man died",
    "a woman reading a book",     
    "a dog is chasing a cat",
    "he discussed neutrinos",
    "she saw a bat",
    "money bank",
    "river bank",
    "shoots suspect",
    "suspect shoots"
]

for query in queries:
    print(f"\nQuery: '{query}'")

    query_embedding = embed_text(query, clean, tokenize, lambda x: embed_sentence_word_model(x, glove_model))

    similarity_scores_sentence = cosine_similarity_1_to_n(query_embedding, test_ds['sentence_embedding'])
    similarity_scores_compressed = cosine_similarity_1_to_n(query_embedding, test_ds['compressed_embedding'])


    top_indices = top_k_indices(similarity_scores_sentence, k=3)

    for rank, idx in enumerate(top_indices):
        idx = int(idx)
        original_sentence = split_ds["test"][idx]["set"][0]
        compressed_version = split_ds["test"][idx]["set"][1]
        score_sent = similarity_scores_sentence[idx]
        scire_comp = similarity_scores_compressed[idx]
        print(f"Rank {rank + 1}:")
        print(f"Original:   {original_sentence}")
        print(f"Compressed: {compressed_version}")
        print(f"Similarity Original: {score_sent:.4f}\n")
        print(f"Similarity Compressed: {scire_comp:.4f}\n")



# To be discussed in the report

### YOUR CODE ENDS HERE


Query: 'police arrest'
Rank 1:
Original:   A Cumberland veteran is in custody after threatening to shoot police officers if they tried to arrest him.
Compressed: Cumberland veteran 'threatened to shoot police'
Similarity Original: 0.9005

Similarity Compressed: 0.7945

Rank 2:
Original:   Easley police are investigating an attempted armed robbery of a cash-advance store, according to police.
Compressed: Easley police investigating attempted robbery
Similarity Original: 0.8968

Similarity Compressed: 0.8383

Rank 3:
Original:   A man, who was arrested on the charge of pilferage of copper wire, allegedly committed suicide in police custody at Dehri police station in Bihar's Rohtas district on Saturday, police said.
Compressed: Man allegedly commits suicide in police custody
Similarity Original: 0.8863

Similarity Compressed: 0.8690


Query: 'house fire'
Rank 1:
Original:   A Clarksville house caught fire Friday afternoon after a back-porch grill exploded, igniting the house in flames, a

## 6. Evaluating Retrieval

In this last section we will try to evaluate how good our sentence retrieval system is. To keep the computational resources manageable, we will use the test set for that as its size is more manageable.

Recall from the lecture in IR that there are several metrics to evaluate retrieval performance by taking into account the relevance of the retrieved results to the query. We will use Recall@K here (for more metrics and more details refer to the lecture slides and the textbooks).

RRecall@K is a metric used to measure the effectiveness of a search system in retrieving relevant documents within the top $K$ retrieved documents. It calculates the proportion of relevant documents retrieved within the top-$K$ results, compared to the total number of relevant documents in the collection.

$
\text{Recall@K} = \frac{\text{Number of relevant documents retrieved in the top }-K}{\text{Total number of relevant documents}}
$

In our case, we have a sentence, and it's compressed version. To test our system, we will treat compressed sentences as the queries. Each query will have only a single relevant sentence - the corresponding uncompressed sentence.

Therefore, for the calculation of Recall@K we will take into account whether the correct retrieved result is contained within the first $K$ retrieved results. For example, if for a query (i.e. a compressed sentence) we retrieve 10 results and within these we see the relevant one (i.e. the full sentence), then Recall@10 = 1.

<a name='e14'></a>
### Exercise 14: Cosine similarity between two sets of vectors

(3p) In this exercise you will revisit your implementation of the cosine similarity. Generalize it so that it can accept two arrays containing two sets of vectors (first one containing $M$ vectors and the second one $N$ vectors). Compute the cosine similarity between each pair of vectors coming from the two sets. The result should be an array of size $M x N$.

Once again, try to write an efficient code. This means no loops. Remember the relation between matrix multiplication and dot product. (Depending on your implementation of the previous function calculating cosine similarity, this one can be almost the same)

In [55]:
def cosine_similarity_m_to_n(vectors, other_vectors):
    """
    Calculates the cosine similarity between a multiple vectors and other vectors.
    Args:
        vectors: a numpy array representing M number of vectors of D dimensions (of the size MxD)
        other_vectors: a 2D numpy array representing other vectors (of the size NxD, where N is the number of vectors and D is their dimension)

    Returns: a numpy array of cosine similarity between all the vectors and all the other vectors

    """

    #### YOUR CODE HERE
    
    dot_product = np.dot(vectors, other_vectors.T)  # shape: (M, N)

    # Compute norms
    vectors_norm = np.linalg.norm(vectors, axis=1, keepdims=True)  # shape: (M, 1)
    other_vectors_norm = np.linalg.norm(other_vectors, axis=1, keepdims=True).T  # shape: (1, N)

    # Compute cosine similarity
    denom = vectors_norm * other_vectors_norm  # shape: (M, N)
    cosine_sim = np.divide(dot_product, denom, out=np.zeros_like(dot_product), where=denom!=0)

    return cosine_sim

    ### YOUR CODE ENDS HERE

The following function will use your implementation to calculate Recall@K based on the similarity matrix.

In [56]:
def calculate_recall(queries, sentences, k, batch_size=1000):
    """
    Calculates recall@k given the embeddings of the queries and sentences.
    Assumes that only a single sentence with the same index as query is relevant.
    Batching is implemented to avoid high memory usage.
    Args:
        queries: a numpy array with the embeddings of N queries
        sentences: a numpy array with the embeddings of N sentences available for retrieval
        k: number of top results to search for the relevant sentence
        batch_size: number of queries to process at a time

    Returns: calculated recall@k

    """
    n_queries = queries.shape[0]
    correct = np.zeros(n_queries, dtype=bool)

    with tqdm.tqdm(total=n_queries) as pbar:
        for batch_start in range(0, n_queries, batch_size):
            batch_end = min(batch_start + batch_size, n_queries)
            queries_batch = queries[batch_start:batch_end]
            batch_similarity = cosine_similarity_m_to_n(queries_batch, sentences)

            for i, similarity_row in enumerate(batch_similarity):
                query_index = batch_start + i
                top_k = top_k_indices(similarity_row, k=k, sorted=False)

                if query_index in top_k:
                    correct[query_index] = True

                pbar.update(1)

    recall = np.sum(correct) / n_queries
    return recall

You can use it like so:

In [None]:
recall_at_1 = calculate_recall(compr_embedding, sent_embedding, k=3, batch_size=100)
print(f'\n{recall_at_1 * 100:.2f}%')

<a name='e15'></a>
### Exercise 15: Evaluating retrieval methods

(10p) Calculate recall for different values of $K$ for all methods:
- BOW,
- TF-IDF,
- Pre-trained embeddings.

Discuss the results.
Comment on how recall changes based on the value of $K$. Are the results expected or surprising?

In [57]:
print(test_ds)

Dataset({
    features: ['set', 'clean_sentence', 'clean_compressed', 'sentence_tokens', 'compressed_tokens', 'sentence_bow', 'compressed_bow', 'sentence_embedding', 'compressed_embedding'],
    num_rows: 36000
})


In [58]:
#### YOUR CODE HERE

methods = {
    "BOW" : (test_ds["sentence_bow"], test_ds["compressed_bow"]),
    "TFIDF" : (test_tfidf_sentence, test_tfidf_compressed),
    "model" : (test_ds["sentence_embedding"], test_ds["compressed_embedding"])
}

k_vals = [1, 3, 5, 10]

# results = {}

for method, (sentences, queries) in methods.items():
    print(f"\nEvaluating method: {method}")
    # results[method] = []
    for k in k_vals:
        recall = calculate_recall(queries, sentences, k=k, batch_size=1000)
        print(f"Recall@{k}: {recall * 100:.2f}%")
        # results[method].append((k, recall))


### YOUR CODE ENDS HERE


Evaluating method: BOW


100%|███████████████████████████████████████████████| 36000/36000 [06:14<00:00, 96.18it/s]


Recall@1: 78.97%


100%|███████████████████████████████████████████████| 36000/36000 [06:18<00:00, 95.14it/s]


Recall@3: 88.73%


100%|███████████████████████████████████████████████| 36000/36000 [06:46<00:00, 88.52it/s]


Recall@5: 91.31%


100%|███████████████████████████████████████████████| 36000/36000 [07:14<00:00, 82.93it/s]


Recall@10: 93.86%

Evaluating method: TFIDF


100%|██████████████████████████████████████████████| 36000/36000 [03:53<00:00, 153.92it/s]


Recall@1: 80.43%


100%|██████████████████████████████████████████████| 36000/36000 [02:43<00:00, 219.57it/s]


Recall@3: 89.71%


100%|██████████████████████████████████████████████| 36000/36000 [02:28<00:00, 241.62it/s]


Recall@5: 92.28%


100%|██████████████████████████████████████████████| 36000/36000 [02:31<00:00, 237.90it/s]


Recall@10: 94.59%

Evaluating method: model


100%|█████████████████████████████████████████████| 36000/36000 [00:14<00:00, 2507.00it/s]


Recall@1: 62.64%


100%|█████████████████████████████████████████████| 36000/36000 [00:23<00:00, 1512.66it/s]


Recall@3: 73.52%


100%|█████████████████████████████████████████████| 36000/36000 [00:23<00:00, 1531.26it/s]


Recall@5: 77.58%


100%|█████████████████████████████████████████████| 36000/36000 [00:22<00:00, 1622.58it/s]

Recall@10: 82.41%





In [None]:
# discussion in the report

<a name='e16'></a>
### Exercise 16: Improving retrieval

(10p) Imagine that you work at a company and are tasked with delivering the best retrieval method. Select the most promising one and try to improve the scores (e.g. by changing the vocab size, loading different model, etc.).
Discuss the results you achieve, even if you didn't manage to improve the scores.

In [None]:
#### YOUR CODE HERE
ftmodel = api.load("conceptnet-numberbatch-17-06-300")


[====----------------------------------------------] 9.7% 113.0/1168.7MB downloaded

IOPub message rate exceeded.
The Jupyter server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--ServerApp.iopub_msg_rate_limit`.

Current values:
ServerApp.iopub_msg_rate_limit=1000.0 (msgs/sec)
ServerApp.rate_limit_window=3.0 (secs)





IOPub message rate exceeded.
The Jupyter server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--ServerApp.iopub_msg_rate_limit`.

Current values:
ServerApp.iopub_msg_rate_limit=1000.0 (msgs/sec)
ServerApp.rate_limit_window=3.0 (secs)





IOPub message rate exceeded.
The Jupyter server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--ServerApp.iopub_msg_rate_limit`.

Current values:
ServerApp.iopub_msg_rate_limit=1000.0 (msgs/sec)
ServerApp.rate_limit_window=3.0 (secs)





IOPub message rate exceeded.
The Jupyter server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--ServerApp.iopub_msg_rate_limit`.

Current values:
ServerApp.iopub_msg_rate_limit=1000.0 (msgs/sec)
ServerApp.rate_limit_window=3.0 (secs)





IOPub message rate exceeded.
The Jupyter server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--ServerApp.iopub_msg_rate_limit`.

Current values:
ServerApp.iopub_msg_rate_limit=1000.0 (msgs/sec)
ServerApp.rate_limit_window=3.0 (secs)





IOPub message rate exceeded.
The Jupyter server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--ServerApp.iopub_msg_rate_limit`.

Current values:
ServerApp.iopub_msg_rate_limit=1000.0 (msgs/sec)
ServerApp.rate_limit_window=3.0 (secs)





In [None]:
def embed_sentence_word_ftmodel_dataset(batch, model):
    """
    Embeds the sentence and the compressed sentence in the example from the Dataset
    Args:
        example: an example from the Dataset
        model: a trained word embeddings model

    Returns: updated example with 'sentence_embedding' and 'compressed_embedding' columns

    """
    sentence_embeddings = []
    compressed_embeddings = []

    for sentence_tokens, clean_compressed in zip(batch['sentence_tokens'], batch['clean_compressed']):
        compressed_tokens = tokenize(clean_compressed)

        sentence_emb = embed_sentence_word_model(sentence_tokens, model)
        compressed_emb = embed_sentence_word_model(compressed_tokens, model)

        sentence_embeddings.append(sentence_emb)
        compressed_embeddings.append(compressed_emb)

    return {
        'sentence_ft': sentence_embeddings,
        'compressed_ft': compressed_embeddings
    }

In [None]:
test_ds = test_ds.map(embed_sentence_word_w2vmodel_dataset, batched=True, fn_kwargs={'model': ftmodel})
print(test_ds)

In [None]:
methods = {
    "ft" : (test_ds['sentence_ft'], test_ds['compressed_ft'])
}

k_vals = [1, 3, 5, 10]


for method, (sentences, queries) in methods.items():
    print(f"\nEvaluating method: {method}")
    for k in k_vals:
        recall = calculate_recall(queries, sentences, k=k, batch_size=1000)
        print(f"Recall@{k}: {recall * 100:.2f}%")


### YOUR CODE ENDS HERE

// your comments