# Anaogies with Word Embeddings



For this exercise we will be asking the model the **Analogy** questions:

### If Italy is to Italian then Spain is to? 	\_\_\_\_\_\_\_\_\_\_\_

### If India is to delhi then Japan is to? 	\_\_\_\_\_\_\_\_\_\_\_ 

### If man is to  woman then boy is to? 	\_\_\_\_\_\_\_\_\_\_\_  

### If small is to  smaller then large is to? 	\_\_\_\_\_\_\_\_\_\_\_   

Using a pre-trained GloVe word vector I will make a model that will accurately answer the questions above.



## GloVe: Global Vectors for Word Representation
https://nlp.stanford.edu/pubs/glove.pdf

GloVe is an unsupervised learning algorithm for obtaining vector representations for words. Training is performed on aggregated global word-word co-occurrence statistics from a corpus, and the resulting representations showcase interesting linear substructures of the word vector space.

### Imports

In [1]:
import numpy as np
import pandas as pd

### Read the glove model  

In [2]:
def read_glove_vecs(glove_file):
    with open(glove_file, 'r',encoding='utf8') as f:
        words = set()
        word_to_vec_map = {}
        
        for line in f:
            line = line.strip().split()
            curr_word = line[0]
            words.add(curr_word)
            word_to_vec_map[curr_word] = np.array(line[1:], dtype=np.float64)
            
    return words, word_to_vec_map


In [3]:
words, word_to_vec_map = read_glove_vecs('glove.6B.50d.txt')

### Set of words available in the model

In [4]:
sorted_words = list(words)

In [5]:
print('There are '+str(len(sorted_words))+' words available')

There are 400000 words available


In [6]:
sorted_words = np.array(sorted_words).reshape([1000,400])

### Sample Words

In [7]:
pd.DataFrame(sorted_words).iloc[400:420,100:110]

Unnamed: 0,100,101,102,103,104,105,106,107,108,109
400,61508,girouard,hamoked,capitals,seder,2328,hopsin,cherkessia,antigone,tenser
401,wagp,bitterlich,thot,3969,werke,37,thonet,caymmi,41.83,halflife
402,scheving,powderly,toribio,sisman,admissions,someplace,mcmeans,bundesstraßen,ypsilon,spiciness
403,euro595,memorably,non-essential,maculata,batalov,lms,deyong,swpa,4-cd,hendrickx
404,reincorporation,balkline,cooler,subsides,grandkids,inside-forward,mingus,byplay,hemudu,stocco
405,3229,mangano,tdcs,ppf,cullen@globe.com,dispositions,kitto,re-surfaced,renin-angiotensin,defrantz
406,sandretti,tuckwell,ckgm,alodia,fioravanti,shenay,wanchai,siewierz,filer,satpayev
407,hirschsprung,municipales,july/august,krupskaya,pittermann,yearslong,mapudungun,wollman,tubau,clayderman
408,cctlds,mithradates,80.33,thompson,86.95,bernie,esoterica,itaim,1500cc,straps
409,charismatic,dairy,hoopoes,abida,frankton,1269,fotis,mcconaughy,-9:00,alltime


### Word to Vec Map (Embedding Matrix)

In [8]:
print('Our embedding has a dimension of '+str(pd.DataFrame(word_to_vec_map).shape[0]))

Our embedding has a dimension of 50


### Brief Explanation of Embedding

Below is a simple example of Emedding. Basically, each row is an attribute which can be anything from gender, color, smell etc. If the word is said to satisfy this attribute, it will be labeled a number between zero and one. The more it satisfies the attribute the closer it is to one. If it is the opposite of the attribute, it will be labeled a number from zero to negative one. The more opposite it is, the closer it is to negative one. In addition, a row can be a combination of a number of attributes and an embedding that has more dimension(rows) can store more information.

|        | husband|wife  | 
| :----| :----: |:----:| 
|Attribute 1: Male  |    0.98   | -.99   | 
|Attribute 2|  0.01  | 0.06 |  
|Attribute 3| -0.03  | 0.02 |   
| ...|  ...  |  ... |   

### Real Data

As you can see below, the real data wont make much sense because it is not labeled and each row can be a combination of different attributes that we dont know. A lot of insights can be gotten if each row is labeled.

In [9]:
pd.DataFrame(word_to_vec_map).loc[7:13,['husband','wife']]

Unnamed: 0,husband,wife
7,0.1954,0.64556
8,-0.60738,-0.46543
9,0.008143,-0.51727
10,-0.07934,-0.15117
11,0.37204,0.42836
12,-0.25451,-0.16713
13,-0.45528,-0.69901


### Getting the distance between two words: Cosine Similarity

There are a lot of distance metrics avaialable (euclidean distance, manhattan distance etc.), but the most appropriate for this situation is the use of Cosine Similarity. The formula is as follows.


![title](cossimilarity.png)

In [10]:
def cosine_similarity(a, b):
    distance = 0.0
    
    #dot product (numerator)
    numerator = np.dot(a,b)
    
    #L2 norm:
    l2_a = np.sqrt(np.sum(a**2,axis=0))
    l2_b = np.sqrt(np.sum(b**2,axis=0))
    
    #denominator:
    denominator = l2_a*l2_b

    return numerator/denominator

In [11]:
husband = word_to_vec_map["husband"]
wife = word_to_vec_map["wife"]
print('Cosine Similarity of "husband" and "wife" is '+str(cosine_similarity(husband, wife)))

Cosine Similarity of "husband" and "wife" is 0.950674320925


The words "husband" and "wife" are usually used in the same context

In [12]:
husband = word_to_vec_map["begin"]
wife = word_to_vec_map["start"]
print('Cosine Similarity of "begin" and "start" is '+str(cosine_similarity(husband, wife)))

Cosine Similarity of "begin" and "start" is 0.8819238975


Words that are synonymous also have high Cosine Similarities.

In [13]:
husband = word_to_vec_map["hot"]
wife = word_to_vec_map["cold"]
print('Cosine Similarity of "hot" and "cold" is '+str(cosine_similarity(husband, wife)))

Cosine Similarity of "hot" and "cold" is 0.801052788872


Even words that are opposite also have high Cosine Similarities because they are used in the same context. For the example above, they are used when talking about the temperature.

In [14]:
husband = word_to_vec_map["sink"]
wife = word_to_vec_map["wisdom"]
print('Cosine Similarity of "sink" and "wisdom" is '+str(cosine_similarity(husband, wife)))

Cosine Similarity of "sink" and "wisdom" is 0.285277721749


Words that are completely irrelevant will have low Cosine Similarity.

In [37]:
def analogy(word1, word2, word3, word_to_vec_map):

    #Convert the words into lowercase
    word1, word2, word3 = word1.lower(), word2.lower(), word3.lower()
    
    #Get the subset of the embedding for the word 
    embed_w1, embed_w2, embed_w3 = word_to_vec_map[word1],word_to_vec_map[word2],word_to_vec_map[word3]

    #Get the words
    words = word_to_vec_map.keys()
    
    # Initialize max_cosine_sim to a large negative number
    max_cosine_sim = -100              
    
    # Initialize best_word with None
    best_word = None                   

    #Loop though the words
    for word in words:        
        # to avoid best_word being one of the input words, pass on them.
        if word in [word1, word2, word3] :
            continue
        
        # Compute cosine similarity between the vector (e_b - e_a) and the vector ((w's vector representation) - e_c)  
        cosine_sim = cosine_similarity(np.subtract(embed_w2,embed_w1),np.subtract(word_to_vec_map[word],embed_w3) )
        
        # If the cosine_sim is more than the max_cosine_sim seen so far,
            # then: set the new max_cosine_sim to the current cosine_sim and the best_word to the current word 
        if cosine_sim > max_cosine_sim:
            max_cosine_sim = cosine_sim
            best_word = word
        
    return best_word

In [38]:
triads_to_try = [('italy', 'italian', 'spain'), ('india', 'delhi', 'japan'), ('man', 'woman', 'boy'), ('small', 'smaller', 'large')]
for triad in triads_to_try:
    print ('{} -> {} :: {} -> {}'.format( *triad, analogy(*triad,word_to_vec_map)))

italy -> italian :: spain -> spanish
india -> delhi :: japan -> tokyo
man -> woman :: boy -> girl
small -> smaller :: large -> larger
