# Word2Vec Similiarity



##### Introduction

- We will borrow Google's massive set of word vectors trained on the web ([Google Vectors]()).
- Gensim is used toe explore the basic semantic queries that you can perform with Word2Vec

In [2]:
# NLP tools
import nltk
import gensim

# Data tools
import numpy as np
import pandas as pd


In [3]:
# Load the Google vectors

google_vec_file = "/Users/tanpohkeam/Downloads/GoogleNews-vectors-negative300.bin.gz"
google_model = gensim.models.KeyedVectors.load_word2vec_format(google_vec_file, binary=True)

Google's model contains an extensive vocabulary

### Compare similarities between words.
We can use the similarity() method to compare words for similarity
- Let us compare the same pair of words. We should get a similarity score of 1.0
- Also, let us compare an odd couple of words. We should get a similarity score of close to 0.




In [4]:
google_model.similarity('ocean', 'ocean')

1.0

In [5]:
google_model.similarity('shark', 'volcano')

0.19793199

### Exercise A:
- For the following parts of words, use your human intuition to rank the level of similar between the pair of words. 
- Pair 1: (ocean and sea)
- Pair 2: (ocean and river)
- Pair 3: (river and salmon)
- Pair 4: (ocean and mountain)
- Pair 5: (frog and snail)

- Then write the codes to get the similarity score. Was the result of Word2Vec consistent with human knowledge?

Note: There is no known threshold for what consistuents a threshold for similarity.
What is interesting to know is that the Word2Vec is able to learn the vector representation that results in similiarity score that is close to human judgement.

In [6]:
# your codes


In [7]:
# Answer:
    
print('ocean-sea similarity score : ' , google_model.similarity('ocean', 'sea'))
print('ocean-river similarity score : ' , google_model.similarity('ocean', 'river'))
print('river-salmon similarity score : ' , google_model.similarity('river', 'salmon'))
print('ocean-mountain similarity score : ' , google_model.similarity('ocean', 'mountain'))
print('frog-snail similarity score : ' , google_model.similarity('frog', 'snail'))

ocean-sea similarity score :  0.76435417
ocean-river similarity score :  0.47718135
river-salmon similarity score :  0.29229903
ocean-mountain similarity score :  0.27277803
frog-snail similarity score :  0.60943407


### Exercise B
Let see how the pretrained model handles names of personalities (real or fake) who frequents the news

- Find the similarity all pair combination between 'Lee_Kuan_Yew',Mahathir_Mohamad, Clinton , Bush, Spiderman, Superman


In [8]:
# Your codes


In [9]:
# Answer
word_list  = ['Lee_Kuan_Yew', 'Mahathir_Mohamad', 'Clinton' , 'Bush', 'Spiderman', 'Superman']

from itertools import compress 
pairs= list(zip(word_list, word_list[1:] + word_list[:1])) 
  
for pair in pairs:
    w0, w1 = pair[0], pair[1]
    print ('Similarity between {} and {} is {}'.format(w0, w1, google_model.similarity(w0, w1) ))


Similarity between Lee_Kuan_Yew and Mahathir_Mohamad is 0.5671272277832031
Similarity between Mahathir_Mohamad and Clinton is 0.11909453570842743
Similarity between Clinton and Bush is 0.6407691240310669
Similarity between Bush and Spiderman is 0.07075392454862595
Similarity between Spiderman and Superman is 0.5477700233459473
Similarity between Superman and Lee_Kuan_Yew is 0.14365069568157196


###  Find Similar words

Word2Vec method similar_by_word() can be used to return a list of similar words.
Let us try to find some similar words for broom

In [10]:
google_model.similar_by_word('broom')

[('brooms', 0.7050668001174927),
 ('broomstick', 0.6114087104797363),
 ('dustpan', 0.5667167901992798),
 ('fly_swatter', 0.5471037030220032),
 ('whisk_broom', 0.5430972576141357),
 ('dustpans', 0.5339849591255188),
 ('wand', 0.531521201133728),
 ('shovel', 0.5279634594917297),
 ('pitchfork', 0.524329662322998),
 ('mattock', 0.521378755569458)]

### Exercise C:
- Suppose you are writing an article and find that you are using the word "happy" too often. You like to use alternative words that can replace "happy". State 5 words that comes to mind?

- Ask Word2Vec to suggest some replacement words. Are the returned words suitable as replacement?

In [11]:
# your code

In [12]:
# answer

google_model.similar_by_word('happy')

[('glad', 0.7408890128135681),
 ('pleased', 0.6632171273231506),
 ('ecstatic', 0.6626912355422974),
 ('overjoyed', 0.6599286794662476),
 ('thrilled', 0.6514049768447876),
 ('satisfied', 0.6437950134277344),
 ('proud', 0.636042058467865),
 ('delighted', 0.627237856388092),
 ('disappointed', 0.6269949674606323),
 ('excited', 0.6247666478157043)]

### Find similar words with some negative examples
We can also find-tune finding similar words by providing dissimilar as  example of negative words.
In the first example, Word2Vec returns countries in Europe.
In the second example, Word2Vec returns countries in Asia.
In the first example, Word2Vec does NOT return countries in Europe, but rather terms associated with the terms in the positive words.

In [13]:
#Example 1:
google_model.most_similar(positive=['Norway', 'Spain', 'Germany', "Austria"])

[('Sweden', 0.766769528388977),
 ('Switzerland', 0.7535009384155273),
 ('Denmark', 0.7359018325805664),
 ('Netherlands', 0.7167850732803345),
 ('Belgium', 0.7037012577056885),
 ('Hungary', 0.6940023899078369),
 ('Finland', 0.6854315996170044),
 ('Italy', 0.6712003946304321),
 ('Slovakia', 0.668652355670929),
 ('Croatia', 0.6683071255683899)]

In [14]:
#Example 2:
google_model.most_similar(positive=['Korea', 'Japan', 'Vietnam', "Indonesia"])

[('South_Korea', 0.7823339700698853),
 ('Thailand', 0.704065203666687),
 ('Southeast_Asia', 0.69697105884552),
 ('Viet_Nam', 0.6925818920135498),
 ('China', 0.6915749311447144),
 ('Taiwan', 0.6532434225082397),
 ('Philippines', 0.6405261158943176),
 ('Malaysia', 0.6370214819908142),
 ('Cambodia', 0.6269365549087524),
 ('Asia', 0.6238887906074524)]

In [15]:
#Example 3:
google_model.most_similar(positive=['Norway', 'Spain', 'Germany', 'Austria'], negative=['Sweden','Italy', 'England'])

[('Austrians', 0.5269070863723755),
 ('Austrian', 0.5137248039245605),
 ('Andrea_Rothfuss', 0.4956299066543579),
 ('Juergen', 0.49309805035591125),
 ('German', 0.49204903841018677),
 ('Andreas', 0.490742564201355),
 ('Baden_Wurttemberg', 0.4817953109741211),
 ('Müller', 0.48091602325439453),
 ('Holger', 0.4781285226345062),
 ('Switzerland', 0.4780736267566681)]

### Exercise D:
Run the codes below. Describe the 'similar' words that were returned by Word2Vec.

In [16]:
google_model.most_similar(positive=['Korea', 'Japan', 'Vietnam', "Indonesia"], negative = ['Thailand', 'China', 'Brunei'])

# your answer

[('South_Korea', 0.5299241542816162),
 ('Korean', 0.4800218939781189),
 ('CAMP_HUMPHREYS_South', 0.4583066403865814),
 ('kamikaze_pilots', 0.44935142993927),
 ('Japanese', 0.44640052318573),
 ('Vietnam_War', 0.43605631589889526),
 ('World_War_II', 0.43605202436447144),
 ('WWII', 0.4276370108127594),
 ('Seoul', 0.4271702170372009),
 ('Coldest_War', 0.4266488254070282)]

### Exercise E:
- Find similar words to 'Norway', 'Spain', 'Egypt, 'Norway', 'Spain', 'Egypt', 'Thailand'
- Repeat the above, but see if you can exclude countries from Europe from been listed

In [17]:
# your codes


In [18]:
#answer

google_model.most_similar(positive=['Norway', 'Spain', 'Egypt', 'Thailand'], topn=5)#

[('Morocco', 0.7394067049026489),
 ('Denmark', 0.6790488958358765),
 ('Sweden', 0.6737656593322754),
 ('Algeria', 0.6614785194396973),
 ('Indonesia', 0.6557962894439697)]

In [19]:
# answer

google_model.most_similar(positive=['Norway', 'Spain', 'Egypt', 'Thailand'], negative=['Germany', 'Sweden', 'France'], topn=5)

[('Thai', 0.46678560972213745),
 ('Siam', 0.4549752175807953),
 ('Indonesia', 0.44277215003967285),
 ('Thais', 0.44274333119392395),
 ('Myanmar', 0.4401722848415375)]

### for lessons only ###


In [20]:
google_model.similar_by_word('Thailand')

[('Thai', 0.7535779476165771),
 ('Cambodia', 0.7131428718566895),
 ('Bangkok', 0.7014991044998169),
 ('Thais', 0.6784130334854126),
 ('Malaysia', 0.6679352521896362),
 ('Indonesia', 0.6669518947601318),
 ('Chiang_Mai', 0.6564016938209534),
 ('Thailands', 0.6523814797401428),
 ('Southeast_Asia', 0.621160089969635),
 ('Laos', 0.6205481290817261)]

In [23]:
google_model.similar_by_word('Lee_Kuan_Yew')

[('Lee_Kwan_Yew', 0.7458120584487915),
 ('Lee_Hsien_Loong', 0.6708043217658997),
 ('Goh_Chok_Tong', 0.6635027527809143),
 ('prime_minister_Lee_Kuan', 0.644353985786438),
 ('Goh_Keng_Swee', 0.6256564855575562),
 ('LKY', 0.6218506693840027),
 ('Singapore_Lee_Kuan', 0.6171931028366089),
 ('Dr_Mahathir', 0.6003636717796326),
 ('PAP_strongman', 0.5983213186264038),
 ('Jeyaretnam', 0.5875519514083862)]

In [37]:
google_model.similar_by_word('fever')

[('fevers', 0.6539823412895203),
 ('mania', 0.5566962957382202),
 ('feverish', 0.5537868142127991),
 ('fever_vomiting', 0.5078603625297546),
 ('sore_throat', 0.507377564907074),
 ('sore_throat_fever', 0.5043977499008179),
 ('coughing_runny_nose', 0.5000587701797485),
 ('fever_sore_throat', 0.4963510036468506),
 ('fever_aches', 0.49502936005592346),
 ('chills_sore_throat', 0.4949982166290283)]

In [35]:
A = google_model.word_vec('France')
B = google_model.word_vec('Paris')
C = google_model.word_vec('Japan')
D = A - B - B - B + C
google_model.similar_by_word(D)

[('Japan', 0.31452077627182007),
 ('HAMAMATSU', 0.2824726104736328),
 ('SASEBO_NAVAL_BASE', 0.27741649746894836),
 ('JAPAN', 0.2766423225402832),
 ('captain_Makoto_Hasebe', 0.2756442725658417),
 ('Nippon_Ham', 0.27251601219177246),
 ('Naoto_Kan', 0.2715759575366974),
 ('Shunsuke_Watanabe', 0.26907438039779663),
 ("JAPAN_'S", 0.26834049820899963),
 ('Aomori_prefectures', 0.26726919412612915)]