# Word Embedding Evaluation: Analogy Task

## Objective
In this notebook, we evaluate word embeddings by solving an analogy task. The goal is to find a word `d` that satisfies the analogy "a : b = c : d," using the vector representations (embeddings) of the words.

## Methodology
We will follow the approach proposed by [Mikolov et al. (2013)](https://arxiv.org/pdf/1301.3781), where we use simple vector algebra to compute word analogies. Specifically, given the embeddings for three words `(a, b, c)`, we seek the word `d` that maximizes the following expression from the pre-trained embedding:

$$
d = \arg \max_{w \in V \setminus \{b, a, c\}} \cos\left(\mathbf{v}_w, \mathbf{v}_a - \mathbf{v}_b + \mathbf{v}_c \right)
$$

where:
- ​$\mathbf{v}_a, \mathbf{v}_b, \mathbf{v}_c, \mathbf{v}_d$, are the embeddings (vector representations) of the words `a`, `b`, `c`, and `d`.
- The set  `V` represents the vocabulary, and we exclude the words \( a, b, \) and \( c \) from consideration.
- The cosine similarity $\cos(\cdot)$ is used to measure the closeness between word vectors.




We aim to find the word `d` whose vector embedding satisfies this relationship as closely as possible.

## Code

### Import necessary libraries

In [2]:
import gensim
import os
import numpy as np
from typing import Dict, List, Tuple
import re
import random
from scipy.spatial.distance import cosine


### Loading the pre-trained models

In this notebook, we will be using pre-trained embeddings for our analogy task. 

#### 1. Pretrained Word2Vec 

Google published pre-trained vectors trained on part of Google News dataset (about 6 billion words in the corpus). For this project, we will be using the 50 dimensional embedding model for evaluation. 

Model Configuration: 
- 6B corpus
- 3M words vocab 
- 300 dims vector
- 10 ~ context window
- used negative sampling for training

Article can be found [here](https://code.google.com/archive/p/word2vec/)

Model can be downloaded [here](https://drive.google.com/file/d/0B7XkCwpI5KDYNlNUTTlSS21pQmM/edit?resourcekey=0-wjGZdNAUop6WykTtMip30g)

In [3]:
# Path to the binary Word2Vec model
current_directory = os.getcwd()
dataset_folder = os.path.join(current_directory, 'dataset/')
model_path = dataset_folder + 'GoogleNews-vectors-negative300.bin'

# Load the model
word2vec_word_embeddings = gensim.models.KeyedVectors.load_word2vec_format(model_path, binary=True)
words = word2vec_word_embeddings.index_to_key
word2vec_word_embeddings_dict = {}
for word in words:
    word2vec_word_embeddings_dict[word] = word2vec_word_embeddings[word]
print("Embeddings loaded successfully!")

Embeddings loaded successfully!


### 2. Pretrained GloVe Model: 

GloVe is an unsupervised learning algorithm for obtaining vector representations for words. Training is performed on aggregated global word-word co-occurrence statistics from a corpus, and the resulting representations showcase interesting linear substructures of the word vector space.

Model Configuration: 
- Trained on Wikipedia 2014 + Gigaword 5
- 6B tokens
- 400K vocab
- uncased
- 50d, 100d, 200d, & 300d vectors

Article link and model can be downloaded [here](https://nlp.stanford.edu/projects/glove/)

In [4]:

def get_glove_embeddings(path: str) -> Dict[str, List[float]]:
    """Preprocess .txt file into a Dict with word as key and value is embedding vector"""
    word_embeddings_dict: Dict[str, List[float]] = {}
    with open(path, 'r') as file:
        for line in file:
            individual_dims = line.strip().split()
            word = individual_dims[0]
            word_embedding = np.array(individual_dims[1:], dtype=np.float32)
            word_embeddings_dict[word] = word_embedding

    return word_embeddings_dict

# GloVe requires some preprocessing as the embeddings come in a .txt file
model_path = dataset_folder + 'glove/glove.6B.50d.txt'
glove_word_embeddings_dict = get_glove_embeddings(model_path)
print("Embeddings loaded successfully!")

Embeddings loaded successfully!


In this notebook, we will be working only on a subset (few groups) of Mikolov's Analogy Dataset.

Dataset can be found [here](https://www.fit.vut.cz/person/imikolov/public/rnnlm/word-test.v1.txt)


In [5]:
def preprocess_dataset(path: str):
    processed_dataset_dict: Dict[str, List[Tuple]] = {}
    # Goal is to pre-process into a dict of valid groups {'group name': [(a, b, c, d), ....]}
    valid_groups = ['capital-world', 'currency', 'city-in-state', 
                    'family', 'gram1-adjective-to-adverb', 'gram2-opposite', 
                    'gram3-comparative', 'gram6-nationality-adjective']
    
    group_name = ''
    with open(path, 'r') as file:
        for line in file:
            line = line.strip()
            if line.startswith('//'): 
                continue # Ignore the first line

            if line.startswith(': '): # check if line begins with ':'
                match = re.match(r':\s*(\S+)', line)  # Match a group name after ': ' using RegEx
                group_name = match.group(1)
                if group_name in valid_groups:
                    processed_dataset_dict[group_name] = []
                    continue
                else: 
                    group_name = ''

            if group_name in processed_dataset_dict:
                words = line.split()

                # Arrange to expected format (a, b, c, d)
                words = (words[0], words[1], words[2], words[3])
                processed_dataset_dict[group_name].append(words) 
            
    return processed_dataset_dict

dataset_path = dataset_folder + 'word-test.v1.txt'
dataset = preprocess_dataset(dataset_path)
print('Dataset preprocessed successfully!')

Dataset preprocessed successfully!


## Evaluation

Let's write the methods that will help us measure the cosine similarity and the accuracy of the pretrained embeddings on Mikolov's Analogy Dataset

In [9]:
def compute_analogy(a, b, c, embeddings):
    """Computes the best fit word `d` in (a, b, c, d) using cosine similarity"""
    if a not in embeddings or b not in embeddings or c not in embeddings:
        return None
    
    a_vec = embeddings[a]
    b_vec = embeddings[b]
    c_vec = embeddings[c]
    
    result_vec = b_vec - a_vec + c_vec
    
    # Find the most similar word, ie. word with the least cosine distance
    best_similarity = float('inf')
    best_word = None
    
    for word, vec in embeddings.items():
        if word in {a, b, c}: 
            continue

        similarity = cosine(result_vec, vec)
        if similarity < best_similarity:
            best_similarity = similarity
            best_word = word
    
    return best_word.lower()

In [10]:
def measure_accuracy(embeddings, dataset, rand = True):
    """Computes the accuracy of the embeddings passed w.r.t the dataset questions
    Also, prints the accuracy obtained on individual groups in the dataset.
    """
    total_correct = 0
    cummulative_total = 0
    for group, quads in dataset.items():
        correct = 0
        total = 0
        for (a, b, c, d) in random.sample(quads, 25) if rand else quads:
            word_pred = compute_analogy(a.lower(), b.lower(), c.lower(), embeddings)
            
            if word_pred == d.lower():
                correct += 1
                total_correct += 1
            
            total += 1
            cummulative_total += 1

        # Accuracy is calculated by comparing `d` with the most similar word in the embedding space,
        # ie. performing similarity check or every single embedding available in the pre-trained embeddings.
        print(f'Accuracy on group {group}: {correct / total}')

    return total_correct / cummulative_total

**NOTE**

Since, calculating the accuracy of every analogy question for every group in the dataset is very very computational expensive (~5000 questions for each group), I'll be picking 25 random analogy questions from every group. 

PS: this required an hour of computation



#### Comparing accuracies on both sets of pretrained embeddings

In [11]:
print('------------- Evaluating the Word2Vec Model --------------\n')
word2vec_acc = measure_accuracy(word2vec_word_embeddings_dict, dataset)
print(f'Overall accuracy across all groups: {word2vec_acc}\n\n')


print('-------------- Evaluating the GloVe Model ----------------')
glove_acc = measure_accuracy(glove_word_embeddings_dict, dataset)
print(f'Overall accuracy across all groups: {glove_acc}')

------------- Evaluating the Word2Vec Model --------------

Accuracy on group capital-world: 0.04
Accuracy on group currency: 0.08
Accuracy on group city-in-state: 0.04
Accuracy on group family: 0.88
Accuracy on group gram1-adjective-to-adverb: 0.24
Accuracy on group gram2-opposite: 0.2
Accuracy on group gram3-comparative: 0.96
Accuracy on group gram6-nationality-adjective: 0.28
Overall accuracy across all groups: 0.34


-------------- Evaluating the GloVe Model ----------------
Accuracy on group capital-world: 0.68
Accuracy on group currency: 0.12
Accuracy on group city-in-state: 0.16
Accuracy on group family: 0.68
Accuracy on group gram1-adjective-to-adverb: 0.0
Accuracy on group gram2-opposite: 0.12
Accuracy on group gram3-comparative: 0.44
Accuracy on group gram6-nationality-adjective: 0.84
Overall accuracy across all groups: 0.38


**Observation**

Since, we are selecting 25 random questions from every group and comparing, we can't really measure which set of embeddings are better for the given dataset. Measuring the accuracy on the entire dataset, will give us a better idea of the performance of the pre-trained embeddings in our use case.

But, overall we can see that on few groups both the pre-trained sets is able to generalize well on the dataset. However, this highlights the importance of training embeddings for every use case would provide more quality in the output, however, at the cost of computation. 

### Embeddings and Antonyms

A major problem with word embeddings is that antonyms (words with meanings considered to be opposites) often have SIMILAR embeddings. 

Let's verify this in practice!

In [12]:
sample_words = ['increase', 'exit', 'simplify', 'clean']

# NOTE: most_similar() func computes cosine similarity between a simple mean of the 
# projection weight vectors of the given keys and the vectors for each key in the model.
# The method corresponds to the `word-analogy` and `distance` scripts in the original
# word2vec implementation.

def compute_most_similar(embeddings, term, topn = 10):
    word_score_dict = {}
    a_vec = embeddings[term]
    
    for word, vec in embeddings.items():
        similarity = cosine(vec, a_vec)
        word_score_dict[word] = similarity
    
    # We sort in ascending order and find the words with least scores because we are using cosine distanc here.
    
    # The cosine distance between two vectors is calculated by subtracting the cosine similarity between those 
    # vectors from 1, essentially measuring the angular difference between them where a value closer to 0 
    # indicates greater similarity and a value closer to 2 indicates greater dissimilarity

    sorted_word_scores = sorted(word_score_dict.items(), key=lambda item: item[1])
    most_similar_words = sorted_word_scores[:topn]
    
    return most_similar_words

for word in sample_words:
    similar = compute_most_similar(word2vec_word_embeddings_dict, word, topn=10)
    print(f"Words similar to '{word}':\n{similar}\n")

Words similar to 'increase':
[('increase', 0.0), ('decrease', 0.16296813227921048), ('increases', 0.22906224667508612), ('increased', 0.2421957607942684), ('reduction', 0.30917797221922527), ('increasing', 0.3128383733584331), ('decreases', 0.3183826885550045), ('rise', 0.36473526109661814), ('decreasing', 0.37813741818797175), ('decline', 0.38713580180755414)]

Words similar to 'exit':
[('exit', 0.0), ('exits', 0.3061424059722557), ('exiting', 0.35319962207434497), ('Rockhouse_stumbled_toward', 0.4604811824901355), ('Exit', 0.46161186317846514), ('departure', 0.49619647976589976), ('Exiting', 0.49877576433953585), ('entrance', 0.5159069269023668), ('Inya_Lake_Hotel', 0.5164870685719852), ('exited', 0.5175610373713373)]

Words similar to 'simplify':
[('simplify', 0.0), ('streamline', 0.2007646530841164), ('simplifying', 0.20776002332833066), ('simplified', 0.2956007112766764), ('Simplifying', 0.3483464454665547), ('simplifies', 0.3527117110292194), ('automate', 0.3580227096056031), ('s

**Why does this happen?**

If we go back to see how embeddings are trained, using approaches like Word2Vec, GloVe, etc, we know that main idea is similar. The embeddings are constructed based on the context around the word. This could fool the model into learning the word and its antonym have the same semantic meaning in the context. 

For example, let's consider the below two sentences:
- 'I love Tom and his family'
- 'I hate Tom and his family'

In the above two sentences, the only change is the word love <-> hate, everything else around the word remains the same. Hence, this makes the model construct the similar embeddings for a word and its antonym.

Some resources to tackle this issue:
- [Paper](https://arxiv.org/pdf/2004.12835#:~:text=Modern%20word%20embeddings%2C%20such%20as,antonyms.%20%5B)
- [Paper](https://link.springer.com/chapter/10.1007/978-3-319-99501-4_6)

### Alternate Analogy Tests

Let's try coming up with our own analogies. 

**1. New Analogy Test 1: capital-currency**

Here, we follow a similar approach as Mikolov's analogy dataset. We are trying to see the similarity between the capital of a country and the country's currency. Basically, we are trying to evaluate if the embeddings understand that a capital belongs to a particular country, and will it be able to relate this to the country's currency.

Example:

Capital of India is Delhi, and currency of India is Rupee. So, we form the question like:

('Delhi', 'rupee', 'Abu Dhabi', 'dirhams')


**2. New Analogy Test 2: sport-type_of_sport**

Here, we are trying to see the similarity between a sport and whether the sport is a team sport or a solo sport. We want to evaluate if the model not only identifies a sport but understands the sport and how it is played. 

Example: 

Cricket is a sport, and it is a team game. So, we form the question like: 

('cricket', 'team', 'swimming', 'solo')


In [13]:
questions = {
    'capital-currency': [
    ('Delhi', 'rupee', 'Tehran', 'riel'),
    ('Bangkok', 'baht', 'Luanda', 'kwanza'),
    ('Moscow', 'ruble', 'Tokyo', 'yen')
],

'sport-type_of_sport': [
    ('cricket', 'team', 'tennis', 'solo'),
    ('football', 'group', 'badminton', 'single'),
    ('baseball', 'team', 'pickleball', 'duo'),
    ('racing', 'solo', 'athletics', 'solo'),
]
}

In [14]:
print('------------- Evaluating the Word2Vec Model --------------\n')
word2vec_acc = measure_accuracy(word2vec_word_embeddings_dict, questions, rand=False)
print(f'Overall accuracy from both groups: {word2vec_acc}\n\n')


print('-------------- Evaluating the GloVe Model ----------------')
glove_acc = measure_accuracy(glove_word_embeddings_dict, questions, rand=False)
print(f'Overall accuracy from both groups: {glove_acc}')

------------- Evaluating the Word2Vec Model --------------

Accuracy on group capital-currency: 0.0
Accuracy on group sport-type_of_sport: 0.0
Overall accuracy from both groups: 0.0


-------------- Evaluating the GloVe Model ----------------
Accuracy on group capital-currency: 0.0
Accuracy on group sport-type_of_sport: 0.0
Overall accuracy from both groups: 0.0


**Observation**

Both sets of pre-trained embeddings don't seem to work well on the two analogy tasks above. It shows that the embeddings do not understand the inner meaning of the words is unable to connect one word to another and form a sort of relationship or graphical connection. Like even if it is able to understand a country's capital from an embedding, it is unable to relate it to the country's currency from the country's capital. Basically, it is unable to detect in-direct relationships/connections between words. 