## Assignment: week 4

##### Task

With the pretrained GloVe embeddings, find the word vectors for the three words "man", "woman", and "king". With these, calculate the vector obtained from the expression

vec("woman") - vec("man) + vec("king")

and find the nearest vector(s) to it, using the cosine similarity as the distance measure. You can use the code in weekly material as the starting point.

Can you explain your result?

#### Use Torch for Faster Calculations

In [14]:
import torch
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

#### Download Data

In [15]:
import kagglehub
import os

# Download latest version
path = kagglehub.dataset_download("anmolkumar/glove-embeddings")

print("Path to dataset files:", path)

glove_file = os.path.join(path, 'glove.6B.100d.txt')

Path to dataset files: C:\Users\M_Hin\.cache\kagglehub\datasets\anmolkumar\glove-embeddings\versions\1


In [16]:
import numpy as np

embeddings = {}

with open(glove_file, 'r', encoding='utf-8') as f:
    for line in f:
        values = line.split()
        word = values[0]
        vector = np.array(values[1:], dtype='float32')
        embeddings[word] = vector
        
print(f'Loaded {len(embeddings)} word vectors')

Loaded 400000 word vectors


#### Model

In [17]:
from sklearn.metrics.pairwise import cosine_similarity

embeddings_torch = {word: torch.tensor(vec, device=device) for word, vec in embeddings.items()}

# Get the vectors for the words
man_vector = embeddings_torch.get('man')
woman_vector = embeddings_torch.get('woman')
king_vector = embeddings_torch.get('king')

if man_vector is None or woman_vector is None or king_vector is None:
    print("Warning: One or more words not found in the embeddings")
else:
    result_vector = woman_vector - man_vector + king_vector
    
    all_words = list(embeddings_torch.keys())
    all_vectors = torch.stack([embeddings_torch[word] for word in all_words])
    
    # Compute cosine similarity
    result_vector_norm = result_vector / result_vector.norm()
    all_vectors_norm = all_vectors / all_vectors.norm(dim=1, keepdim=True)
    similarities = torch.matmul(all_vectors_norm, result_vector_norm)

    # Get top 10 most similar words (excluding the exact match)
    topk = torch.topk(similarities, 11)
    for idx, score in zip(topk.indices[1:], topk.values[1:]):
        print(f"{all_words[idx]}: {score.item():.4f}")

queen: 0.7834
monarch: 0.6934
throne: 0.6833
daughter: 0.6809
prince: 0.6713
princess: 0.6644
mother: 0.6579
elizabeth: 0.6563
father: 0.6392
wife: 0.6352


#### Explanation

Model captures relationships between word vectors and can this way calculate the nearest similar vector (which represents word).

When we do the calculation **vec("woman") - vec("man") + vec("king")**, we get a vector close to **"queen"**. This shows that the model understands the relationship between these words, such as "king" is to "queen" as "man" is to "woman".