# Analogy function


Let us implement our own function to perform the analogy task. We will use the same distance metric as in  Mikolov, Tomas, et al. "Efficient Estimation of Word Representations in Vector Space" Advances in neural information processing systems. 2013. With this function, we want to be able to answer whether an analogy like: "a king is to a queen as a man is to a woman" (e_king - e_queen + e_man is e_woman) is true. In a perfect scenario, we would like that this analogy (e_king - e_queen + e_man) results in the embedding of the word "man". However, it does not always result in exactly the same word embedding. In this context, we will call "man" the true or the actual word. We want to find the word W in the vocabulary, where the embedding of W i.e e_W is the closest to the predicted embedding (i.e., the result of the formula). Then, we can check if W is the same word as the true word.

In [3]:
import numpy as np
import pandas as pd

from tensorflow.keras.preprocessing.text import Tokenizer
from sklearn.metrics.pairwise import cosine_similarity
from gensim.models import Word2Vec
from tensorflow.keras.models import load_model
from scipy.spatial.distance import cosine, cdist
import keras.backend as K

In [4]:
# Skipgram Model
skipgram = load_model('models/sgm.h5')

# CBOW Model
cbow = load_model('models/cbm.h5')

# Fasttext model
fasttext = Word2Vec.load('models/fasttext.model')

# LSA
df_svd = pd.read_csv("models/lsa.csv")
df_svd = df_svd.set_index("words")
vocab = list(df_svd.index)

# Glove
from gensim.models import KeyedVectors
glove = KeyedVectors.load('models/glove-wiki-gigaword-300.model')

In [6]:
df = pd.read_csv('bbc-text.csv')
articles = list(df['text'])

sentences = []

for i in articles[:200]:
    sentences += i.split('.')

# Remove sentences with fewer than 3 words
corpus = [sentence for sentence in sentences if sentence.count(" ") >= 12]

# Remove punctuation in text and fit tokenizer on entire corpus
tokenizer = Tokenizer(filters='!"#$%&()*+,-./:;<=>?@[\]^_`{|}~\t\n'+"'")
tokenizer.fit_on_texts(corpus)

# Convert text to sequence of integer values
corpus = tokenizer.texts_to_sequences(corpus)
n_samples = sum(len(s) for s in corpus) # Total number of words in the corpus
V = len(tokenizer.word_index) + 1 # Total number of unique words in the corpus

In [7]:
def get_skipgram_embeddings():
    word_vectors = {}
    i=0
    with open("models/vectors_skipgram_300.txt", "r", encoding="utf-8") as file:
        for line in file:
            i+=1
            if i == 1:
                continue
            parts = line.strip().split()
            word = parts[0]
            vector = np.array([float(x) for x in parts[1:]])
            word_vectors[word] = vector
    return word_vectors


def get_cbow_embeddings():
    word_vectors = {}
    i=0
    with open("models/vectors_cbow_300.txt", "r", encoding="utf-8") as file:
        for line in file:
            i+=1
            if i == 1:
                continue
            parts = line.strip().split()
            word = parts[0]
            vector = np.array([float(x) for x in parts[1:]])
            word_vectors[word] = vector
    return word_vectors


def get_fasttext_embeddings():
    fasttext_model = Word2Vec.load('models/fasttext.model')
    word_vectors = fasttext_model.wv
    return word_vectors

def get_lsa_embeddings():
    word_vectors = {}
    i=0
    with open("models/vectors_lsa_300.txt", "r", encoding="utf-8") as file:
        for line in file:
            i+=1
            if i == 1:
                continue
            parts = line.strip().split()
            word = parts[0]
            vector = np.array([float(x) for x in parts[1:]])
            word_vectors[word] = vector
    return word_vectors
    

In [8]:
def get_k_nearest(k, target_word, word_vectors, model):
    if target_word not in word_vectors:
        print(f"'{input_word}' is not present in the vocabulary.")
        return -1

    # for fasttext
    if model == 'fasttext':
        embedding_vector = word_vectors[target_word]
        similar_words = fasttext_model.wv.similar_by_word(target_word)

        print(f"\nThe {k} nearest words to '{target_word}' are: ")
        nearest_words = [(word,e) for word, e in similar_words[:k]]
        for i in (nearest_words):
            print(i)
        return nearest_words        
        
    # for skipgram and cbow
    if model in ['skipgram', 'cbow', 'lsa']: 
        # Calculate cosine similarities with all words in the vocabulary
        similarities = {}
        target_vector = word_vectors[target_word]
        for word, vector in word_vectors.items():
            if word != target_word:
                cosine_sim = cosine_similarity([target_vector], [vector])
                similarities[word] = cosine_sim[0][0]

        # Sort the words by their cosine similarity scores in descending order
        sorted_similarities = sorted(similarities.items(), key=lambda x: x[1], reverse=True)

        # Select the top-k words as the k-nearest words
        nearest_words = [(word, round(e,4)) for word, e in sorted_similarities[:k]]

        # Print the k-nearest words
        print(f"\nThe {k} nearest words to '{target_word}' are: ")
        for i in (nearest_words):
            print(i)
        return nearest_words

    
def get_k_nearest_using_embedding(k, embed_prediction, word_vectors, model, fw):   
    
    if model == 'fasttext':
        
        vectors = fw.wv.vectors
        words = fw.wv.index_to_key

        # Calculate cosine similarity between the target vector and all other vectors
        similarity_scores = [np.dot(embed_prediction, vectors[i]) / (np.linalg.norm(embed_prediction) * np.linalg.norm(vectors[i])) for i in range(len(vectors))]

        # Find the indices of the k most similar words
        top_k_indices = np.argsort(similarity_scores)[-k:][::-1]

        # Get the corresponding words and their similarity scores
        nearest_words = [(words[i], round(similarity_scores[i], 4)) for i in top_k_indices]

        return nearest_words

        
    else: 
        embedding_matrix = [word_vectors[i] for i in word_vectors]
        
        similarity_scores = cosine_similarity([embed_prediction], embedding_matrix)[0]

        top_k_indices = np.argsort(similarity_scores)[-k:][::-1]
        nearest_words = [(word,similarity_scores[tokenizer.word_index[word]]) for word in tokenizer.word_index if tokenizer.word_index[word] in top_k_indices]

        return nearest_words




In [9]:
def get_embeddings(model_name):
    
    if model_name == 'fasttext':
        return get_fasttext_embeddings()
    
    if model_name == 'skipgram':
        return get_skipgram_embeddings()
    
    if model_name == 'cbow':
        return get_cbow_embeddings()
    
    if model_name == 'lsa':
        return get_lsa_embeddings()


The method we have created above is relatively simple. Let us consider the major steps of the method. The method boils down to: 

1) concatenating all models such that it is easier to iterate over all models

2) get the embeddings of each model such that we can easily iterate over them

3) store the model names in a list such that we can easily iterate over them
4) create a list of tuples of size four where each word in the tuple represents a word in the analogy
5) iterate over each tuple in the analogies we want to look at
6) compute the embedding of each word in the tuple
7) fill in the analogy function using the first three words
8) make a prediction based on the outcome of the analogy function and return the nr nearest words using the cosine distance
9) compare if the actual word (given as input parameter) is equal to the predicted word. 

This is the main idea behind the method. We have also made it easier to return the top 10 of nearest words and print the top 10 nearest words for each prediction together with the cosine distances to give us more of an idea as to what the model is predicting.

In [14]:
def print_analogy(analogy, model_names):
    
    # Retrieve the words from the analogy we need to compute
    word_a, word_b, word_c, word_true = analogy    
        
    # Formulate the analogy task
    analogy_task = f"{word_a} is to {word_b} as {word_c} is to ?"

    print(f"Analogy Task: {analogy_task}")
    print("---------------------------------------------------")
    
    g_word = glove.most_similar(positive=[word_a, word_c], negative=[ word_b], topn=1)
    print(f"Glove prediction for Analogy is : {g_word}\n\n")
    
    
    if word_true not in vocab or word_a not in vocab or word_b not in vocab or word_c not in vocab:
        model_names.remove('lsa')
        print("Some input word or words not in vocab of LSA\n\n")
    
    if word_true not in tokenizer.word_index or word_a not in tokenizer.word_index or word_b not in tokenizer.word_index or word_c not in tokenizer.word_index:
        model_names.remove('skipgram')
        model_names.remove('cbow')
        print("Some input word or words not in vocab of skipgram and cbow\n\n")


    for model in model_names:
        embeddings = get_embeddings(model)
        embed_a, embed_b, embed_c, embed_true = embeddings[word_a],embeddings[word_b],embeddings[word_c],embeddings[word_true]
        embed_prediction = embed_b - embed_a + embed_c
        sim1 = round(cosine(embed_true, embed_prediction), 4)
        nearest_words = get_k_nearest_using_embedding(10, embed_prediction, embeddings, model, fasttext)  
        sorted_similarities = sorted(nearest_words, key=lambda x: x[1], reverse=True)
        
        word_prediction, sim2 = sorted_similarities[0]

    
        # Print whether or not the true word was in the top nr 
        partially_correct = word_true in [word[0] for word in nearest_words]
        
        print(f"Embedding: {model}")
        # Print all top nr words with their distance
        for word in nearest_words:
            print(f"{word[0]} => {round(word[1], 4)}")
        print(f"Predicted: {word_prediction} ({round(sim2, 4)}) - True: {word_true} ({sim1})")
        print(f"Correct? {word_prediction == word_true} - In the top {10}? {partially_correct}")
        print("---------------------------------------------------\n")


In [15]:
%%time
analogies = [('he', 'is', 'we', 'are'), ('love', 'hate', 'little', 'large'), ('small', 'smaller', 'large', 'larger'), ('man', 'woman', 'king', 'queen'), ('mouse', 'mice', 'cat', 'cats')]
for analogy in analogies:
    print_analogy(analogy, ['skipgram', 'cbow', 'fasttext', 'lsa'])

Analogy Task: he is to is as we is to ?
---------------------------------------------------
Glove prediction for Analogy is : [('did', 0.6850758194923401)]


Some input word or words not in vocab of LSA


Embedding: skipgram
it => 0.6
had => 0.5821
grow => 0.289
missing => 0.2676
debts => 0.2881
examining => 0.2983
clarion => 0.2867
yankee => 0.3059
surround => 0.2824
encountered => 0.2754
Predicted: it (0.6) - True: are (0.8858)
Correct? False - In the top 10? False
---------------------------------------------------

Embedding: cbow
it => 0.6132
had => 0.4804
competition => 0.2715
gave => 0.2701
fogg => 0.2793
alive => 0.2659
examining => 0.3121
prioritised => 0.2855
fishermen => 0.3056
osborne => 0.2678
Predicted: it (0.6132) - True: are (0.918)
Correct? False - In the top 10? False
---------------------------------------------------

Embedding: fasttext
is => 0.7296000123023987
we => 0.5976999998092651
re => 0.5037999749183655
why. => 0.3529999852180481
are => 0.34950000047683716
w



## Conclusion
In terms of performance, we observe that none of the six models managed to correctly predict any of the analogies; all predictions were incorrect. The major reason for this concerns the size of our corpus. Our corpus, namely alice.txt, has only 26,283 words before processing the file, which is a lot less than the millions of words that are mentioned by Mikolov, Tomas, et al. "Efficient Estimation of Word Representations in Vector Space" Advances in neural information processing systems. 2013.

This suggests that our models are simply not able to learn much from our corpus since there is not that much to learn from.

**Note** : However, for analogy "small is to smaller as large is to ? (larger)" Fasttext predicted "smaller" and for "man is to woman as king is to ? (queen)" Fasttext predicted king.
Also, "small is to smaller as large is to ? (larger)" LSA predicted one of the simlar word as "bigger" which is somewhat related to "larger".

Given the poor performance of both all 4 models on the analogies (on our corpus), it is tough to say which model performs better. To be able to draw sound conclusions on which type of model is more suitable for predicting analogies, we believe that we would require a larger corpus. If we were to draw a conclusion from the results we have right now, we would argue that Fasttext performs slightly better since it almost predicted the somewhat similar word, as mentioned above. Also as per "Comparative study of LSA vs Word2vec embeddings
in small corpora: a case study in dreams database" by Edgar Altszyler commented that when the corpus size is reduced, Word2vec performance has a severe decrease, thus LSA becoming the more suitable tool comapred to Skipgram. Same is concluded from above analogy results.
