# CS549 Machine Learning
# Assignment 10: Word embedding

**Total points: 10**

In this assignment, you will exercise using pre-trained word embeddings for word semantic tasks.

## Task 1. Cosine similarity and Euclidean distance

**Points: 2**

First, you can load the pre-trained embeddings using `torchtext` package.

In [58]:
import torchtext
import torch
import os

In [59]:
# Create a folder to save cached vectors.
cache_dir = os.path.join(os.getcwd(), '.vector_cache')
if not os.path.exists(cache_dir):
    os.mkdir(cache_dir)

glove = torchtext.vocab.GloVe(name="6B", # trained on Wikipedia 2014 corpus
                              dim=50, # embedding size = 50
                              cache=cache_dir # You can change it to a different directory where you wish to save the vectors
                              )

fasttext = torchtext.vocab.FastText(language='en',
                                    cache=cache_dir) # The fasttext cache file is ~6GB

Examine the loaded pre-trained embeddings. You can use `.vectors` attribute to access the embedding matrix, which is a `torch.Tensor` of shape (vocab_size, embedding_size).

In [60]:
print(type(glove.vectors))
print(glove.vectors.shape)

print(type(fasttext.vectors))
print(fasttext.vectors.shape)

<class 'torch.Tensor'>
torch.Size([400000, 50])
<class 'torch.Tensor'>
torch.Size([2519370, 300])


---
Check out the word vectors

In [61]:
# print(glove['cat'])
print(glove['cat'].shape)

# print(fasttext['cat'])
print(fasttext['cat'].shape)

torch.Size([50])
torch.Size([300])


---

First, implement the function for computing Euclidean distance using `torch.norm()`. The function should return a `torch.Tensor`, so the `.item()` method in the testing code can retrieve the value for a scalar tensor.

The Euclidean distance between two vectors $x=[x_1,x_2,...x_n]$  and $y=[y_1,y_2,...y_n]$ is the 2-norm of their difference x−y:

$$Euclidean\_distance = \sqrt{\sum_i(x_i - y_i)^2}$$

Larger Euclidean distance between words means that they are more semantically apart.

In [62]:
### START YOUR CODE ###
def euclid_dist(vec1, vec2) -> torch.Tensor:
    return (vec1 - vec2).pow(2).sum().sqrt()
### END YOUR CODE ###

# Do not change the code below
print('GloVe distance:')
print('cat <-> dog: {:.4f}'.format(euclid_dist(glove['cat'], glove['dog']).item()))
print('cat <-> tree: {:.4f}'.format(euclid_dist(glove['cat'], glove['tree']).item()))

print('Fasttext distance:')
print('cat <-> dog: {:.4f}'.format(euclid_dist(fasttext['cat'], fasttext['dog']).item()))
print('cat <-> tree: {:.4f}'.format(euclid_dist(fasttext['cat'], fasttext['tree']).item()))

GloVe distance:
cat <-> dog: 1.8846
cat <-> tree: 4.5569
Fasttext distance:
cat <-> dog: 3.4784
cat <-> tree: 4.9178


**Expected output**

GloVe distance:\
cat <-> dog: 1.8846\
cat <-> tree: 4.5569

Fasttext distance:\
cat <-> dog: 3.4784\
cat <-> tree: 4.9178

---

The other metric, cosine similarity, measures the similarity rather than distance. Thus, larger cosine similarity score between two words indicates closer meanings.

In [63]:
# GloVe
v1 = glove['dog'].unsqueeze(0)
v2 = glove['cat'].unsqueeze(0)
v3 = glove['tree'].unsqueeze(0)

### START YOUR CODE ###
s1 = torch.cosine_similarity(v1, v2) # Compute the cosine similarity between v1 and v2 using torch.cosine_similarity()
s2 = torch.cosine_similarity(v2, v3) # between v2 and v3
### END YOUR CODE ###

print('GloVe distance:')
print('dog <-> cat: {:.4f}'.format(s1.item()))
print('cat <-> tree: {:.4f}'.format(s2.item()))

GloVe distance:
dog <-> cat: 0.9218
cat <-> tree: 0.5661


**Expected output**

GloVe distance:\
dog <-> cat: 0.9218\
cat <-> tree: 0.5661

---

In [64]:
# Fasttext
v1 = fasttext['dog'].unsqueeze(0)
v2 = fasttext['cat'].unsqueeze(0)
v3 = fasttext['tree'].unsqueeze(0)

### START YOUR CODE ###
s1 = torch.cosine_similarity(v1, v2) # Compute the cosine similarity between v1 and v2 using torch.cosine_similarity()
s2 = torch.cosine_similarity(v2, v3) # between v2 and v3
### END YOUR CODE ###

print('Fasttext distance:')
print('dog <-> cat: {:.4f}'.format(s1.item()))
print('cat <-> tree: {:.4f}'.format(s2.item()))

Fasttext distance:
dog <-> cat: 0.6381
cat <-> tree: 0.3314


**Expected output**

Fasttext distance:\
dog <-> cat: 0.6381\
cat <-> tree: 0.3314

---

## Task 2. Nearest words in embedding space

**Points: 4**

Look through our entire vocabulary for words that are closest to a point in the embedding space

You can use `embeddings.vectors` to access the entire embedding matrix for all the words. `torch.cosine_similarity()` can be applied to the entire embedding matrix and the target embedding, and the resulting similarity scores are between the target word and each word in the vocabulary (including the target word itself!). So, when you use `topk()` to pick the top `n` highest similar words, you need to use `n+1`, because the top 1st one is always the word itself.

In the case for computing Euclidean distances, the top $n$ smallest similar words should be picked, and you can set `largest=False` in `torch.topk()`.

`torch.topk()` returns the top *k* **values** and their **indices** in a tensor. Refer to the document here: <https://pytorch.org/docs/stable/generated/torch.topk.html>

In [67]:
def nearest_words(embeddings, target_word, method='cosine', n=5):
    assert method in ['cosine', 'euclidean']

    ### START YOUR CODE ###
    target_emb = embeddings[target_word] # Get the embedding of target word
    ### END YOUR CODE ###

    if method == 'cosine':
        ### START YOUR CODE ###
        scores = torch.cosine_similarity(target_emb, embeddings.vectors) # Compute similarity scores between target word and all the words in vocabulary
        values, indices = torch.topk(scores, n+1) # Hint: use torch.topk(), with n+1
        ### END YOUR CODE ###
    else:
        ### START YOUR CODE ###
        scores = torch.norm(target_emb - embeddings.vectors, dim=1) # Compute Euclidean distances between target word and all the words in vocabulary
        values, indices = torch.topk(scores, n+1, largest=False) # Hint: use torch.topk(), with n+1
        ### END YOUR CODE ###

    for val, idx in zip(values[1:], indices[1:]):
        print('{}: {:.4f}'.format(embeddings.itos[idx], val.item()))

What is the closest word to "cat", according to cosine similarity?

In [68]:
# Do not change the test code here
# Glove
print('Glove')
nearest_words(glove, 'cat', n=5)

# Fasttext
print()
print('Fasttext:')
nearest_words(fasttext, 'cat', method='euclidean', n=5)

Glove
dog: 0.9218
rabbit: 0.8488
monkey: 0.8041
rat: 0.7892
cats: 0.7865

Fasttext:
cats: 3.2251
dog: 3.4784
kitten: 3.5513
kittens: 3.6621
fluffykittens: 3.7981


**Expected output**

Glove\
dog: 0.9218\
rabbit: 0.8488\
monkey: 0.8041\
rat: 0.7892\
cats: 0.7865

Fasttext:\
cats: 3.2251\
dog: 3.4784\
kitten: 3.5513\
kittens: 3.6621\
fluffykittens: 3.7981

---

## Task 3. Word analogies

**Points: 4**

Implement the classical word analogy task.

You are given three words, for instance, w1 = *boy*, w2 = *girl*, w3 = *brother*, and you are to find which word is to *brother* as *girl* is to *boy*. In this example, the mostly likely answer is *sister*.

Assume the embeddings for w1, w2, and w3 are e1, e2, and e3, respectively. You should use the subtraction $d = e2 - e1$, and add it to e3, $e3 + d$, as the target embedding to search for candidate words. You first compute the cosine similarity scores between this target embedding and all the word embeddings in vocabulary, and then find the top *n* top *n* candidate words using `torch.topk()`.

**Note** that you should NOT use the `nearest_words()` function, because target embedding is not from a specifc word. But similar internal code can be adopted, including the detail of using `n+1` in `topk()`. This is because even if the target embedding is different from $e3$ for a small off $d$, its closest neighbour is still very likely to be w3.

In [71]:
def word_analogy(embeddings, w1, w2, w3, n=5):
    ### START YOUR CODE ###
    e1 = embeddings[w1]
    e2 = embeddings[w2]
    e3 = embeddings[w3]
    d = e2-e1
    target_emb = e3+d

    # cosine similarity
    scores = torch.cosine_similarity(target_emb, embeddings.vectors)
    values, indices = torch.topk(scores, n+1)
    ### END YOUR CODE ###

    for val, idx in zip(values[1:], indices[1:]):
        print('{}: {:.4f}'.format(embeddings.itos[idx], val.item()))

In [72]:
# Do not change the test code here
word_analogy(glove, 'man', 'woman', 'king', n=1)

word_analogy(fasttext, 'man', 'woman', 'king', n=1)

word_analogy(fasttext, 'boy', 'girl', 'dad', n=1)

word_analogy(fasttext, 'future', 'tomorrow', 'past', n=1)

queen: 0.8610
queen: 0.6803
mom: 0.6897
tomorrow,: 0.6179


**Expected output**

queen: 0.8610\
queen: 0.6803\
mom: 0.6897\
tomorrow,: 0.6179