# Semantic Similarity Among Words

This file represents the semantic similarity between words which are attained by finding word embeddings for these words. Word Embeddings are like features or attributes of a word. They store information about the word in a vector space and by semantic similarity we mean that two words having similar meaning(or words which are used mostly in the same context) will be closer in the vector space compared to the words which have dissimilar meaning.

## Importing libraries

In [26]:
import numpy as np
from tqdm import tqdm
from scipy.spatial.distance import cosine
from sklearn.decomposition import PCA
%matplotlib notebook
import matplotlib.pyplot as plt
from tqdm import tqdm

## Importing Pre-Trained Word Embeddings

Here, we will be using pre-trained glove embeddings stored in the file *glove.6B.50d.txt* . Here, 50d represents the dimensions of these embedding vectors. It is a text file which contains words followed by their word vectors.

In [1]:
f = open('../input/glove.6B.50d.txt')

We will create a dictionary of words and their embeddings by parsing this file. The result is a python dictionary in which words are keys and their corresponding vectors are the values.

In [5]:
embedding_values = {}
for line in tqdm(f):
    value = line.split(' ')
    word = value[0]
    coef = np.array(value[1:], dtype='float32')
    embedding_values[word] = coef

400000it [00:05, 66863.00it/s]


Here, we will be creating two dictionaries which will help us locate words based on their indices.<br>
>ix_to_word : It stores the index as key and the word as value;<br>
>word_to_ix : It stores the word as key and corresponding index as its value;

In [29]:
ix_to_word = {}
word_to_ix = {}

for word in tqdm(embedding_values):
    ix_to_word[len(ix_to_word)] = word
    word_to_ix[word] = len(ix_to_word)

100%|██████████| 400000/400000 [00:00<00:00, 839942.23it/s]


100%|██████████| 400000/400000 [00:17<00:00, 22258.38it/s]


Now, let's create a function which gives us 'n' most similar words to a given word and see if the results are relevant.

In [34]:
def most_similar(word, count):
    cos = []
    for i in tqdm(embedding_values):
        cos.append(cosine(embedding_values[word], embedding_values[i]))
    temp = cos.copy()
    temp.sort()
    for i in range(count):
        id = cos.index(temp[i])
        print(ix_to_word[id])

Let's check for the word *king*.<br>

Note : The function takes some time as we are going through all the words in the dictionary and then matching the similarity for the given word.

In [37]:
most_similar('king', 10)

100%|██████████| 400000/400000 [00:18<00:00, 21920.25it/s]


king
prince
queen
ii
emperor
son
uncle
kingdom
throne
brother


As we can see, the results are very much as we expected them to be.

### Finding analogies

Since now we know the similarity between two words, it could also help us find answer some analogy based questions such as <br> *“man is to king as woman is to ..?”*<br>
This was explained in this paper. <a href = "https://arxiv.org/pdf/1901.09813.pdf">Link to the paper</a>

The above problem can be solved by finding the distance between first two words and based on that distance and the third word we will try to locate our answer in the vector space. Let's try to implement it.

In [72]:
def analogy(word1, word2, word3):
    embeds = embedding_values[word2]+embedding_values[word3]-embedding_values[word1]

    cos = []
    for i in tqdm(embedding_values):
        cos.append(cosine(embeds, embedding_values[i]))

    idx = np.array(cos).argsort()[1]
    word4 = ix_to_word[idx]
    
    return word4


In [76]:
analogy('man', 'king', 'woman')

100%|██████████| 400000/400000 [00:17<00:00, 23124.75it/s]


'queen'

In [78]:
analogy('india', 'delhi', 'italy')

100%|██████████| 400000/400000 [00:17<00:00, 23291.04it/s]


'rome'