# Word2Vec Similiarity



##### Introduction

- We will borrow Google's massive set of word vectors trained on the web ([Google Vectors]()).
- Gensim is used toe explore the basic semantic queries that you can perform with Word2Vec

In [1]:
# NLP tools
import nltk
import gensim

# Data tools
import numpy as np
import pandas as pd


In [2]:
# Load the Google vectors

#google_vec_file = "/Users/tanpohkeam/Downloads/GoogleNews-vectors-negative300.bin.gz"
google_vec_file = "GoogleNews-vectors-negative300.bin.gz"
google_model = gensim.models.KeyedVectors.load_word2vec_format(google_vec_file, binary=True)

Google's model contains an extensive vocabulary

### Compare similarities between words.
We can use the similarity() method to compare words for similarity
- Let us compare the same pair of words. We should get a similarity score of 1.0
- Also, let us compare an odd couple of words. We should get a similarity score of close to 0.




In [3]:
google_model.similarity('ocean', 'ocean')

1.0

In [4]:
google_model.similarity('shark', 'volcano')

0.19793196

### Exercise A:
- For the following parts of words, use your human intuition to rank the level of similar between the pair of words. 
- Pair 1: (ocean and sea)
- Pair 2: (ocean and river)
- Pair 3: (river and salmon)
- Pair 4: (ocean and mountain)
- Pair 5: (frog and snail)

- Then write the codes to get the similarity score. Was the result of Word2Vec consistent with human knowledge?

Note: There is no known threshold for what consistuents a threshold for similarity.
What is interesting to know is that the Word2Vec is able to learn the vector representation that results in similiarity score that is close to human judgement.

In [6]:
# your codes
w1 = 'ocean'
#w2 = 'sea'
w2 = 'mountain'
google_model.similarity(w1, w2)


0.27277806

### Exercise B
Let see how the pretrained model handles names of personalities (real or fake) who frequents the news

- Find the similarity all pair combination between 'Lee_Kuan_Yew',Mahathir_Mohamad, Clinton , Bush, Spiderman, Superman


In [9]:
# Your codes
w1 = 'Superman'
#w2 = 'sea'
w2 = 'Joker'
google_model.similarity(w1, w2)


0.4515921

###  Find Similar words

Word2Vec method similar_by_word() can be used to return a list of similar words.
Let us try to find some similar words for broom

In [10]:
google_model.similar_by_word('broom')

[('brooms', 0.7050667405128479),
 ('broomstick', 0.6114087104797363),
 ('dustpan', 0.5667167901992798),
 ('fly_swatter', 0.5471036434173584),
 ('whisk_broom', 0.5430972576141357),
 ('dustpans', 0.533984899520874),
 ('wand', 0.531521201133728),
 ('shovel', 0.5279634594917297),
 ('pitchfork', 0.524329662322998),
 ('mattock', 0.521378755569458)]

### Exercise C:
- Suppose you are writing an article and find that you are using the word "happy" too often. You like to use alternative words that can replace "happy". State 5 words that comes to mind?

- Ask Word2Vec to suggest some replacement words. Are the returned words suitable as replacement?

In [11]:
# your code
google_model.similar_by_word('happy')

[('glad', 0.7408890724182129),
 ('pleased', 0.6632170677185059),
 ('ecstatic', 0.6626912355422974),
 ('overjoyed', 0.6599286794662476),
 ('thrilled', 0.6514049768447876),
 ('satisfied', 0.6437948942184448),
 ('proud', 0.636042058467865),
 ('delighted', 0.6272379159927368),
 ('disappointed', 0.6269949674606323),
 ('excited', 0.6247665882110596)]

### Find similar words with some negative examples
We can also find-tune finding similar words by providing dissimilar as  example of negative words.
In the first example, Word2Vec returns countries in Europe.
In the second example, Word2Vec returns countries in Asia.
In the first example, Word2Vec does NOT return countries in Europe, but rather terms associated with the terms in the positive words.

In [12]:
#Example 1:
google_model.most_similar(positive=['Norway', 'Spain', 'Germany', "Austria"])

[('Sweden', 0.766769528388977),
 ('Switzerland', 0.7535009980201721),
 ('Denmark', 0.7359017729759216),
 ('Netherlands', 0.7167849540710449),
 ('Belgium', 0.703701376914978),
 ('Hungary', 0.6940024495124817),
 ('Finland', 0.6854315996170044),
 ('Italy', 0.6712003946304321),
 ('Slovakia', 0.6686522960662842),
 ('Croatia', 0.6683070659637451)]

In [13]:
#Example 2:
google_model.most_similar(positive=['Korea', 'Japan', 'Vietnam', "Indonesia"])

[('South_Korea', 0.7823339104652405),
 ('Thailand', 0.7040652632713318),
 ('Southeast_Asia', 0.69697105884552),
 ('Viet_Nam', 0.6925818920135498),
 ('China', 0.6915749311447144),
 ('Taiwan', 0.6532434225082397),
 ('Philippines', 0.6405261158943176),
 ('Malaysia', 0.6370214819908142),
 ('Cambodia', 0.6269365549087524),
 ('Asia', 0.6238888502120972)]

In [14]:
#Example 3:
google_model.most_similar(positive=['Norway', 'Spain', 'Germany', 'Austria'], negative=['Sweden','Italy', 'England'])

[('Austrians', 0.5269070863723755),
 ('Austrian', 0.5137249231338501),
 ('Andrea_Rothfuss', 0.4956299066543579),
 ('Juergen', 0.49309805035591125),
 ('German', 0.49204903841018677),
 ('Andreas', 0.49074259400367737),
 ('Baden_Wurttemberg', 0.4817953109741211),
 ('MÃ¼ller', 0.48091599345207214),
 ('Holger', 0.4781285226345062),
 ('Switzerland', 0.4780736267566681)]

### Exercise D:
Run the codes below. Describe the 'similar' words that were returned by Word2Vec.

In [15]:
google_model.most_similar(positive=['Korea', 'Japan', 'Vietnam', "Indonesia"], negative = ['Thailand', 'China', 'Brunei'])

# your answer

[('South_Korea', 0.5299240946769714),
 ('Korean', 0.4800219237804413),
 ('CAMP_HUMPHREYS_South', 0.45830661058425903),
 ('kamikaze_pilots', 0.44935142993927),
 ('Japanese', 0.4464004337787628),
 ('Vietnam_War', 0.4360562860965729),
 ('World_War_II', 0.4360520541667938),
 ('WWII', 0.4276370108127594),
 ('Seoul', 0.42717018723487854),
 ('Coldest_War', 0.4266488254070282)]

### Exercise E:
- Find similar words to 'Norway', 'Spain', 'Egypt, 'Norway', 'Spain', 'Egypt', 'Thailand'
- Repeat the above, but see if you can exclude countries from Europe from been listed

In [None]:
# your codes
google_model.most_similar(positive=['Norway', 'Spain', 'Germany', 'Austria'], negative=['Sweden','Italy', 'England'])