## Assignment: week 4

##### Task

With the pretrained GloVe embeddings, find the word vectors for the three words "man", "woman", and "king". With these, calculate the vector obtained from the expression

vec("woman") - vec("man) + vec("king")

and find the nearest vector(s) to it, using the cosine similarity as the distance measure. You can use the code in weekly material as the starting point.

Can you explain your result?

#### Download dataset

In [1]:
import kagglehub
import os

# Download latest version
path = kagglehub.dataset_download("anmolkumar/glove-embeddings")

print("Path to dataset files:", path)

glove_file = os.path.join(path, 'glove.6B.100d.txt')

  from .autonotebook import tqdm as notebook_tqdm


Path to dataset files: C:\Users\M_Hin\.cache\kagglehub\datasets\anmolkumar\glove-embeddings\versions\1


In [2]:
import numpy as np

embeddings = {}

with open(glove_file, 'r', encoding='utf-8') as f:
    for line in f:
        values = line.split()
        word = values[0]
        vector = np.array(values[1:], dtype='float32')
        embeddings[word] = vector
        
print(f'Loaded {len(embeddings)} word vectors')

Loaded 400000 word vectors


In [12]:
from sklearn.metrics.pairwise import cosine_similarity

# Get the vectors for the words
man_vector = embeddings.get('man')
woman_vector = embeddings.get('woman')
king_vector = embeddings.get('king')

if man_vector is None or woman_vector is None or king_vector is None:
    print("Warning: One or more words not found in the embeddings")
else:
    # Calculate the vector: woman - man + king
    result_vector = woman_vector - man_vector + king_vector
    
    # Find the nearest word(s) using cosine similarity
    similarities = {}
    for word, vector in embeddings.items():
        # Calculate cosine similarity between result_vector and each word's vector
        # We reshape to get the right dimensions for cosine_similarity
        sim = cosine_similarity(result_vector.reshape(1, -1), vector.reshape(1, -1))[0][0]
        similarities[word] = sim
    
    # Sort by similarity
    sorted_words = sorted(similarities.items(), key=lambda x: x[1], reverse=True)
    
    print("Top 10 words closest to (woman - man + king):")
    for word, sim in sorted_words[1:10+1]:
        print(f"{word}: {sim:.4f}")

Top 10 words closest to (woman - man + king):
queen: 0.7834
monarch: 0.6934
throne: 0.6833
daughter: 0.6809
prince: 0.6713
princess: 0.6644
mother: 0.6579
elizabeth: 0.6563
father: 0.6392
wife: 0.6352


In [None]:

def find_similar_words(target_word, top_n=10):
    
    target_vector = embeddings[target_word]
    
    if target_word is None: 
        return f'Word "{target_word}" not in vocabulary'
    
    similarities = {}
    for word, vector in embeddings.items():
        sim = cosine_similarity(target_vector.reshape(1, -1), vector.reshape(1, -1))
        similarities[word] = sim
    
    sorted_words = sorted(similarities.items(), key=lambda x: x[1], reverse=True)

    return sorted_words[top_n+1]
    
similar_words = find_similar_words("code")
print(f"Words similar to {similar_words}")
for word, sim in similar_words:
    print(f'{word}: {sim:.4f}')

In [16]:
print(similar_words)

NameError: name 'similar_words' is not defined