In this short programming assignment we will look into applications of word embeddings in similarity search and nearest neighbors. We will also look at creating a video based on tSNE embeddings of images.

## -1. Create a video from images
Download any 1000 or more 'appropriate' and publicly available images from the web. This could be part of a data set or something specific that you picked up or are interested in.

We discussed using tSNE to find image embeddings for these images. Apply the tSNE library to these images and construct low-dimensional embeddings for the images. Use these embeddings to then:

a) Start at any random image in the data set 

b) Sequentially chain the next image to the previous image using a scoring function/probability based on the tSNE embedding. So you want to chain the most similar image to the current one and so on. Choose a frame rate that is appropriate to convert this chain of images into a video. Your video shouldn't be more than 3 minutes long.

c) Upload this video to youtube and share a link with your submission. 

d) Feel free to share your video to Discord to see what cool videos we come up with!



https://www.youtube.com/watch?v=3nnE0nqffPc

## Diving into Cheat Sheet of Pandas Data Frame
There are some useful functions for solving problems below when it comes to index and slice data frames. Let's go over them.
More materials can be found here: https://github.com/pandas-dev/pandas/blob/master/doc/cheatsheet/Pandas_Cheat_Sheet.pdf
1. DataFrame() Construct a dataframe. Use it for putting a numpy ndarray into a dataframe.
1. DataFrame.loc() Purely label-location based indexer for selection by label. Use it for selecting word vectors in the dataframe.     
1. DataFrame.dot() Matrix multiplication with DataFrame. Use it for dot product of word vectors.
1. DataFrame.sort_values() Sort by the values along either axis. Use it for sorting distance short to long.

Below are examples of using these functions. You don't have to code anything in this block, just focus on understanding the functions and how it works in pandas.


In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
import pandas as pd
import numpy as np

# Define a ndarray
d = np.array([[0.1,0.3,0.4,0.5],[0.3,0.4,0.9,0.5],[0.2,0.8,0.7,0.5]], dtype=float, order='F')
print("Define sample word vectors")
print(d)
#Construct a dataframe from ndarray and index each row as word vectors
df = pd.DataFrame(d,index = ['word1','word2','word3'])
print("\nPandas Data Frame for word vectors")
print(df)
#Select word vector1 by its label
print("\nFind the row corresponding to word1")
print(df.loc['word1'])
#Calculate dot product of word vector1 and word vector2
print("\nCalculate the dot product between word1 and word2")
dot_product = df.loc['word2'].dot(df.loc['word1'])
print(dot_product)
#Calculate dot product of word vector1 to the rest of words
print("\nCalculate the dot product between word1 and rest of the words")
words_rest = ['word2','word3']
dot_product2 = df.loc[words_rest].dot(df.loc['word1'])
print(dot_product2)
#Sort Values of dot_product2 by high to low
print("\nSorted dot product values")
print(dot_product2.sort_values(ascending = False))

Define sample word vectors
[[0.1 0.3 0.4 0.5]
 [0.3 0.4 0.9 0.5]
 [0.2 0.8 0.7 0.5]]

Pandas Data Frame for word vectors
         0    1    2    3
word1  0.1  0.3  0.4  0.5
word2  0.3  0.4  0.9  0.5
word3  0.2  0.8  0.7  0.5

Find the row corresponding to word1
0    0.1
1    0.3
2    0.4
3    0.5
Name: word1, dtype: float64

Calculate the dot product between word1 and word2
0.76

Calculate the dot product between word1 and rest of the words
word2    0.76
word3    0.79
dtype: float64

Sorted dot product values
word3    0.79
word2    0.76
dtype: float64


## Dataset Details - Standford's GloVe pre-trained word vectors

GloVe is an unsupervised learning algorithm for obtaining vector representations for words. Training is performed on aggregated global word-word co-occurrence statistics from a corpus. The GloVe pre-trained word vectors dataset contains English word vectors pre-trained on the combined Wikipedia 2014 + Gigaword 5th Edition corpora (6B tokens, 400K vocab). All tokens are in lowercase. This dataset contains 50-dimensional, 100-dimensional and 200-dimensional pre trained word vectors. In this problem we are going to use the 50-dimensional dataset. 

## \# 0. Get an overview on what Glove is
Read up the documentation on glove embeddings, esp. where it gets applied here: https://nlp.stanford.edu/projects/glove/

## Load Dataset
Let's load the dataset first. Each row is indexed as a word vector. Dimension of word vectors is 50. How many words are there in this dataset? Print a few words and see what they are. You don't need to code anything here, just understand the data structure.

In [None]:
import pandas as pd
import numpy as np
import csv
# Load GloVe pre-trained vectors 
local_file1='/content/drive/MyDrive/22WINTER/596A/HW4/glove.6B.50d.txt' # Make sure this file exists!
df = pd.read_csv(local_file1,sep = ' ',index_col=0,header=None,engine='python',error_bad_lines=False,quoting = csv.QUOTE_NONE)
print("dataset shape - Rows: %d, Cols: %d" % (df.shape[0], df.shape[1]))
words = list(df.index)
print("print a few words in the dataset:", words[30:40])



  exec(code_obj, self.user_global_ns, self.user_ns)


dataset shape - Rows: 400001, Cols: 50
print a few words in the dataset: ['be', 'has', 'are', 'have', 'but', 'were', 'not', 'this', 'who', 'they']


## \# 1. Print the first few 11 rows of the pandas data frame below

In [None]:
# Your code HERE - It should execute as expected! 
# (Search for a pandas functionality that can help you do this!)
df.iloc[:11]

Unnamed: 0_level_0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50
0,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1,Unnamed: 25_level_1,Unnamed: 26_level_1,Unnamed: 27_level_1,Unnamed: 28_level_1,Unnamed: 29_level_1,Unnamed: 30_level_1,Unnamed: 31_level_1,Unnamed: 32_level_1,Unnamed: 33_level_1,Unnamed: 34_level_1,Unnamed: 35_level_1,Unnamed: 36_level_1,Unnamed: 37_level_1,Unnamed: 38_level_1,Unnamed: 39_level_1,Unnamed: 40_level_1,Unnamed: 41_level_1,Unnamed: 42_level_1,Unnamed: 43_level_1,Unnamed: 44_level_1,Unnamed: 45_level_1,Unnamed: 46_level_1,Unnamed: 47_level_1,Unnamed: 48_level_1,Unnamed: 49_level_1,Unnamed: 50_level_1
the,0.418,0.24968,-0.41242,0.1217,0.34527,-0.044457,-0.49688,-0.17862,-0.00066,-0.6566,0.27843,-0.14767,-0.55677,0.14658,-0.00951,0.011658,0.10204,-0.12792,-0.8443,-0.12181,-0.016801,-0.33279,-0.1552,-0.23131,-0.19181,-1.8823,-0.76746,0.099051,-0.42125,-0.19526,4.0071,-0.18594,-0.52287,-0.31681,0.000592,0.007445,0.17778,-0.15897,0.012041,-0.054223,-0.29871,-0.15749,-0.34758,-0.045637,-0.44251,0.18785,0.002785,-0.18411,-0.11514,-0.78581
",",0.013441,0.23682,-0.16899,0.40951,0.63812,0.47709,-0.42852,-0.55641,-0.364,-0.23938,0.13001,-0.063734,-0.39575,-0.48162,0.23291,0.090201,-0.13324,0.078639,-0.41634,-0.15428,0.10068,0.48891,0.31226,-0.1252,-0.037512,-1.5179,0.12612,-0.02442,-0.042961,-0.28351,3.5416,-0.11956,-0.014533,-0.1499,0.21864,-0.33412,-0.13872,0.31806,0.70358,0.44858,-0.080262,0.63003,0.32111,-0.46765,0.22786,0.36034,-0.37818,-0.56657,0.044691,0.30392
.,0.15164,0.30177,-0.16763,0.17684,0.31719,0.33973,-0.43478,-0.31086,-0.44999,-0.29486,0.16608,0.11963,-0.41328,-0.42353,0.59868,0.28825,-0.11547,-0.041848,-0.67989,-0.25063,0.18472,0.086876,0.46582,0.015035,0.043474,-1.4671,-0.30384,-0.023441,0.30589,-0.21785,3.746,0.004228,-0.18436,-0.46209,0.098329,-0.11907,0.23919,0.1161,0.41705,0.056763,-6.4e-05,0.068987,0.087939,-0.10285,-0.13931,0.22314,-0.080803,-0.35652,0.016413,0.10216
of,0.70853,0.57088,-0.4716,0.18048,0.54449,0.72603,0.18157,-0.52393,0.10381,-0.17566,0.078852,-0.36216,-0.11829,-0.83336,0.11917,-0.16605,0.061555,-0.012719,-0.56623,0.013616,0.22851,-0.14396,-0.067549,-0.38157,-0.23698,-1.7037,-0.86692,-0.26704,-0.2589,0.1767,3.8676,-0.1613,-0.13273,-0.68881,0.18444,0.005246,-0.33874,-0.078956,0.24185,0.36576,-0.34727,0.28483,0.075693,-0.062178,-0.38988,0.22902,-0.21617,-0.22562,-0.093918,-0.80375
to,0.68047,-0.039263,0.30186,-0.17792,0.42962,0.032246,-0.41376,0.13228,-0.29847,-0.085253,0.17118,0.22419,-0.10046,-0.43653,0.33418,0.67846,0.057204,-0.34448,-0.42785,-0.43275,0.55963,0.10032,0.18677,-0.26854,0.037334,-2.0932,0.22171,-0.39868,0.20912,-0.55725,3.8826,0.47466,-0.95658,-0.37788,0.20869,-0.32752,0.12751,0.088359,0.16351,-0.21634,-0.094375,0.018324,0.21048,-0.03088,-0.19722,0.082279,-0.09434,-0.073297,-0.064699,-0.26044
and,0.26818,0.14346,-0.27877,0.016257,0.11384,0.69923,-0.51332,-0.47368,-0.33075,-0.13834,0.2702,0.30938,-0.45012,-0.4127,-0.09932,0.038085,0.029749,0.10076,-0.25058,-0.51818,0.34558,0.44922,0.48791,-0.080866,-0.10121,-1.3777,-0.10866,-0.23201,0.012839,-0.46508,3.8463,0.31362,0.13643,-0.52244,0.3302,0.33707,-0.35601,0.32431,0.12041,0.3512,-0.069043,0.36885,0.25168,-0.24517,0.25381,0.1367,-0.31178,-0.6321,-0.25028,-0.38097
in,0.33042,0.24995,-0.60874,0.10923,0.036372,0.151,-0.55083,-0.074239,-0.092307,-0.32821,0.09598,-0.82269,-0.36717,-0.67009,0.42909,0.016496,-0.23573,0.12864,-1.0953,0.43334,0.57067,-0.1036,0.20422,0.078308,-0.42795,-1.7984,-0.27865,0.11954,-0.12689,0.031744,3.8631,-0.17786,-0.082434,-0.62698,0.26497,-0.057185,-0.073521,0.46103,0.30862,0.12498,-0.48609,-0.008027,0.031184,-0.36576,-0.42699,0.42164,-0.11666,-0.50703,-0.027273,-0.53285
a,0.21705,0.46515,-0.46757,0.10082,1.0135,0.74845,-0.53104,-0.26256,0.16812,0.13182,-0.24909,-0.44185,-0.21739,0.51004,0.13448,-0.43141,-0.03123,0.20674,-0.78138,-0.20148,-0.097401,0.16088,-0.61836,-0.18504,-0.12461,-2.2526,-0.22321,0.5043,0.32257,0.15313,3.9636,-0.71365,-0.67012,0.28388,0.21738,0.14433,0.25926,0.23434,0.4274,-0.44451,0.13813,0.36973,-0.64289,0.024142,-0.039315,-0.26037,0.12017,-0.043782,0.41013,0.1796
"""",0.25769,0.45629,-0.76974,-0.37679,0.59272,-0.063527,0.20545,-0.57385,-0.29009,-0.13662,0.32728,1.4719,-0.73681,-0.12036,0.71354,-0.46098,0.65248,0.48887,-0.51558,0.039951,-0.34307,-0.014087,0.86488,0.3546,0.7999,-1.4995,-1.8153,0.41128,0.23921,-0.43139,3.6623,-0.79834,-0.54538,0.16943,-0.82017,-0.3461,0.69495,-1.2256,-0.17992,-0.057474,0.030498,-0.39543,-0.38515,-1.0002,0.087599,-0.31009,-0.34677,-0.31438,0.75004,0.97065
's,0.23727,0.40478,-0.20547,0.58805,0.65533,0.32867,-0.81964,-0.23236,0.27428,0.24265,0.054992,0.16296,-1.2555,-0.086437,0.44536,0.096561,-0.16519,0.058378,-0.38598,0.086977,0.003387,0.55095,-0.77697,-0.62096,0.092948,-2.5685,-0.67739,0.10151,-0.48643,-0.057805,3.1859,-0.017554,-0.16138,0.055486,-0.25885,-0.33938,-0.19928,0.26049,0.10478,-0.55934,-0.12342,0.65961,-0.51802,-0.82995,-0.082739,0.28155,-0.423,-0.27378,-0.007901,-0.030231


## \# 2. Words Similarity

Similar words have similar embeddings (or vector values). We can use cosine similarity i.e. cos(u,v) = u.v/(|u||v|) to measure vector similarity. u.v is dot product of vectors, |u| is L2 norm of u. Remember, we spoke about computing similarity based on cosine-similarity (as another alternative to correlation) in class?

1. Normalize matrix df by norm of word vectors. 
1. Define a function to find words similarity to a given word.
1. Use the function defined to find the word in examples that is most similar to "happy".


In [None]:
from sklearn import preprocessing
## YOUR CODE HERE
# 1a. Calculate norm of word vectors
# What would be the dimension of the vector_norm array?
df_array = np.array(df)
m,n = np.shape(df_array)
vector_norm = []
for i in range(m):
  vec = np.sqrt(sum([key**2 for key in df_array[i]]))
  vector_norm.append(vec)
  
vector_norm = pd.DataFrame(vector_norm)
vector_norm.insert(0,'words',df.index.values)
vector_norm = vector_norm.set_index(['words'])# dimension:(40001,1)

# 1b. Normalize matrix df by norm using .div()
vector_norm = vector_norm.T
vector_norm = pd.Series(list(vector_norm.iloc[0]))
dfn = df.div(vector_norm, axis = 1)
# dfn= preprocessing.normalize(df, norm='l2')
# dfn = pd.DataFrame(dfn)
# dfn.insert(50,'words',df.index.values)
# dfn = dfn.set_index(['words'])
# 2. Define a function to find words similar to a given word in a normalized dataframe

def find_word_similarity(word, examples,dataframe):
    # Input: word - one string
    #        examples - List of strings
    #        dataframe - An indexed normalized dataframe
    ## YOUR CODE HERE
    # Calculate dot product of each word in examples to the given word, sorted by value high to low
    # Once you have the sorted values of dot products (notice because of normalization, the dot product is the cosine similarity!),
    # obtain the words corresponding to the sorted values and call it similar_words
  similar_words = pd.DataFrame(columns=['value','words'])
  i = 0
  for key in examples:
    cos_value = float(np.dot(dataframe.loc[word].values.tolist(),dataframe.loc[key].values.tolist())/(vector_norm.loc[word]*vector_norm.loc[key]))
    similar_words.loc[i] = [cos_value,key]
    i += 1
  similar_words = similar_words.sort_values(by = 'value',ascending = False)

    # Return words similar to the given word
  return similar_words
    
examples = ["sad", "bad", "evil", "healthy", "ill",
            "beaming", "cheerful", "joyful", "radiant", "glad", "upset",
            "disco", "probably", "hardly", "ephemeral", "close", "cleaning", 
            "maths", "word", "distribution"]

# 3.
# Use above function to calculate examples' similarity to happy (both "happy" and words in examples are in dfn)
print (find_word_similarity("happy", examples,df))

       value         words
9   0.865877          glad
13  0.816272        hardly
12  0.747581      probably
1   0.708395           bad
0   0.689063           sad
3   0.640579       healthy
18  0.599150          word
6   0.575719      cheerful
10  0.566424         upset
7   0.555032        joyful
15  0.554029         close
4   0.522978           ill
2   0.452148          evil
11  0.324669         disco
5   0.289510       beaming
16  0.246022      cleaning
19  0.160149  distribution
8   0.134971       radiant
14  0.132886     ephemeral
17 -0.011991         maths


**Answer:**

The word that most similar to 'happy' is glad

In sklean library,there is a cosine_similarity fuction that directly calcualtes vectors similarity (you don't need to normalize vectors first). Let's use this function to calculate similarity again to confirm we get same results. 
For more information, please see here: http://scikit-learn.org/stable/modules/generated/sklearn.metrics.pairwise.cosine_similarity.html

In [None]:
examples = ["sad", "bad", "evil", "healthy", "ill",
            "beaming", "cheerful", "joyful", "radiant", "glad", "upset",
            "disco", "probably", "hardly", "ephemeral", "close", "cleaning", 
            "maths", "word", "distribution"]

In [None]:
from sklearn.metrics.pairwise import cosine_similarity
similarity2 = pd.DataFrame(cosine_similarity(np.array(df.loc['happy']).reshape(1,50),np.array(df.loc[examples])),columns = examples)
similarity2 = similarity2.T
similarity2 = similarity2.sort_values(by = 0,ascending = False)
print(similarity2)

                     0
glad          0.865877
hardly        0.816272
probably      0.747581
bad           0.708395
sad           0.689063
healthy       0.640579
word          0.599150
cheerful      0.575719
upset         0.566424
joyful        0.555032
close         0.554029
ill           0.522978
evil          0.452148
disco         0.324669
beaming       0.289510
cleaning      0.246022
distribution  0.160149
radiant       0.134971
ephemeral     0.132886
maths        -0.011991


## \# 3. Goodness of similarity
Comment on the how good the glove embeddings are on finding similar words to a given word using cosine similarity? Glove and word2vec embeddings are based on co-occurence of words in senetences across hundreds of thousands of documents on the web. Would this help explain your observations on word similarity?
What if you replace happy with sad, how do the results change?


**Answer:**

As a word vector representation algorithm, Glove can make full use of the global statistics of LSA and the local context based of word2vec.Aditionally, Golve is very efficient, and its computational scale is proportional to the size of the corpus. When the corpus is not large enough or the word vector dimension is small, it can still maintain a good effect.

Yes, it would. Documents on the web are ubiquitous, that is to say, they contain almost all forms of human language, including formal, informal, scholarly articles, novels, even Old English, slang and some uncommon usages. This greatly enriches the type and capacity of the corpus, which is enough for the model to fully learn to make judgments, and does not make the model enter some local solutions. This has great benefits for word similarity judgment.

In [None]:
print (find_word_similarity("sad", examples,df))

       value         words
0   1.000000           sad
13  0.707888        hardly
1   0.664580           bad
9   0.644321          glad
7   0.587249        joyful
18  0.559652          word
12  0.535477      probably
6   0.531766      cheerful
10  0.501137         upset
4   0.487844           ill
2   0.401493          evil
3   0.372976       healthy
11  0.362746         disco
15  0.311398         close
8   0.244188       radiant
14  0.234799     ephemeral
5   0.194717       beaming
16  0.135319      cleaning
17  0.030129         maths
19  0.008951  distribution


## \# 4. Correlations
(This question is more of a reflection and building your intuition on how correlations we spoke in class connects to a real-world data set -  Open ended!)
What are some of the most correlated words from the similarity search you did earlier to the word "happy" and "sad". Likewise, what are some of the most uncorrelated words to "happy" and "sad". Does this make sense? How would you improve the results ? If "happy" were a random variable and "sad" was a random variable - What factors make the correlation between "happy" and "sad" (as you computed above) high?


**Answer:**

The word that most correlated to both '**happy**' and '**sad**' is '**hardly**'. It does not make sense. 

And the word that most uncorrelated is '**maths**', I suppose that it makes sense. 

I think I should improve my model that adding some restrictions and conditions which could keep some uncorrelated words like 'hardly' away from the input word.
A practical method that I prefer is to introduce a detector--a word that different from inputs and results--to show the relationships  between inputs and outputs. The relationships could be cosine similarity, dot product of vectors or something else.

In my opinion, both **sad** and **happy** are emotional words that people always say, maybe they could appear simultaneously in some sentences which people express their thoughts, feelings, or do some comparisons.Addtiionally, they would appear in some cases where some other eomtional words in, for example, glad, upset, frustrated, angry and so on, which will also lead into a high correlation. Thus, the correlation between happy and sad is relatively high.

## \# 5. Find nearest neighbourhood

It is helpful to compute the nearest neighbors to a word based on the cosine similarity that we defined earlier, so that given a word we can compute which are the other words which are most similar to it. Sometimes, the nearest neighbors according to this metric reveal overlap of concepts or topics that a word shares. E.g. government might be related to the word politics because they both share topics related to public policy, politicians, parties, elections, etc. The idea is whatever embeddings we are using - word2vec or glove is "hopefully" able to capture these correlations right!

1. Define a function to find the top n similar words to a given word. You can use either dot product of vectors or cosine_simialrity function. Note the search space for words is coming from your pandas data frame (so unlike the similarity problem we worked on earlier, we are not restricted to only a few words to search from - the search space here is the entire vocab captured in your data frame).
1. Find 20 nearest neighborhood for words 'duck' and 'animal'.
1. Find neighborhood intersection of 'duck' and 'animal', to find which words are similar to both 'duck' and 'animal'. This is also related to a similarity measure called "Jaccard Similarity" - Read up on this here: https://en.wikipedia.org/wiki/Jaccard_index


In [None]:
# define a function to find the top n similar words to a given word in the 'df'

# PART 1
from sklearn.metrics.pairwise import cosine_similarity
def find_most_similar(df, word, n):
    # INPUT: 
    # df: Given Data frame
    # word: String
    # n: Number of similar words to return
    
    # OUTPUT:
    # the list of similar words to return
    
    ## YOUR CODE HERE
    # define and compute the most similar words
    # Use a similarity measure like cosine similarity (like earlier) to do so
  similarity2 = pd.DataFrame(cosine_similarity(np.array(df.loc[word]).reshape(1,50),np.array(df)),columns = df.index.values.tolist())
  similarity2 = similarity2.T
  similarity2 = similarity2.sort_values(by = 0,ascending = False)
  similar_words = similarity2.iloc[:n]
  return similar_words


# PART 2
# find top 20 similar words to duck
simil1 = find_most_similar(df, "duck", 20)
# find top 20 similar words to animal
simil2 = find_most_similar(df, "animal", 20)

# PART 3
# find the intersection of simil1 and simil2
#intersection =  (concat function of pandas is helpful here)
intersection = [key for key in simil2.index if key in simil1.index]
intersection = pd.concat([simil1.loc[intersection],simil2.loc[intersection]],axis = 1)
print (intersection)

word_labels = ['duck', 'animal'] + list(intersection.index)

             0         0
pig   0.739596  0.735997
fish  0.676019  0.728633
dog   0.693291  0.725226


In [None]:
print(simil1)

                 0
duck      1.000000
crab      0.775702
lobster   0.762002
lame      0.746981
rabbit    0.745671
pig       0.739596
goose     0.735946
chicken   0.725202
grilled   0.722439
fried     0.712361
shrimp    0.707118
cat       0.703251
dog       0.693291
darkwing  0.685793
goat      0.681566
monkey    0.678014
confit    0.677600
fish      0.676019
bite      0.673196
broiled   0.670877


In [None]:
print(simil2)

                  0
animal     1.000000
animals    0.900348
bird       0.800324
human      0.772533
dogs       0.758666
pet        0.747176
pig        0.735997
feeding    0.731338
fish       0.728633
insect     0.727963
humans     0.726976
pigs       0.726021
dog        0.725226
elephant   0.723749
found      0.722678
cow        0.722619
birds      0.719840
livestock  0.719640
eating     0.714771
breeding   0.705041


In [None]:
# jaccard similarity
def jaccard(p,q):
    c = [a for i in p if i in q]
    return float(len(c))/(len(a)+len(b)-len(c))

print('the jaccard similarity is :' ,'%.4f' % jaccard(simil1,simil2))

the jaccard similarity is : 0.1111


## \# 6 Word analogies

Suppose you know the word vectors for King, Man and Woman. What is your intuitive answer for the 'riddle', King - Man + Woman = ? 
Let's go through below steps to derive the answer for this 'riddle' using the word embeddings.

1. Use vector arithmetic to define a new vector which equals to k - m + w (e.g. king, man and woman combination).
2. Calculate similarity of all the words in the corpus to the new vector and sort them by their similarity high to low. 
3. Return the top n vectors which have the highest similarity to the new vector.
1. Find the answers for the riddles, 
    1. good:bad::up:?
    1. germany:merkel::america:?



In [None]:
def find_most_similar_withoutkeywords(df, vec, n,x,y,a):#vec: new vector which is not in df
  similarity2 = pd.DataFrame(cosine_similarity(vec.reshape(1,50),np.array(df)),columns = df.index.values.tolist())
  similarity2 = similarity2.T
  similarity2 = similarity2.sort_values(by = 0,ascending = False)
  similarity2.drop([x,y,a],inplace = True)
  similar_words = similarity2.iloc[:n]
  return similar_words

In [None]:
def find_most_similar_withkeywords(df, vec, n):#vec: new vector which is not in df
  similarity2 = pd.DataFrame(cosine_similarity(vec.reshape(1,50),np.array(df)),columns = df.index.values.tolist())
  similarity2 = similarity2.T
  similarity2 = similarity2.sort_values(by = 0,ascending = False)
  similar_words = similarity2.iloc[:n]
  return similar_words

In [None]:
# define a function to solve the problem of x is to y as a is to ?
# 'n' is the number of top words similar to the vector to return
# 'dataframe' is the indexed dataframe that contains all the words

# PART 1,2,3 above (Fill in the missing pieces)
def solve_riddle(x, y, a, n, dataframe):
    ## YOUR CODE HERE
    # calculate the vector of a + y - x, where a, x, y are in dataframe
    #x man
    #y woman
    #a king
    cal_vec = np.array(dataframe.loc[a] + dataframe.loc[y] - dataframe.loc[x])
    
    # calculate distance of words in dataframe to cal_vec, sorted by similarity high to low
    similarity_with = find_most_similar_withkeywords(df, cal_vec, n)
    similarity_without = find_most_similar_withoutkeywords(df, cal_vec, n,x,y,a)

    # return top n words and distance that closest to cal_vec
    return similarity_with, similarity_without
result_with_keywords, result_without_keywords = solve_riddle("man", "woman", "king", 5,df)
# Call the solve_riddle function to compute the top answers
print('the result of riddle with known words is:')
print(result_with_keywords)
print('the result of riddle without known words is:')
print(result_without_keywords)

the result of riddle with known words is:
                 0
king      0.885983
queen     0.860958
daughter  0.768451
prince    0.764070
throne    0.763497
the result of riddle without known words is:
                 0
queen     0.860958
daughter  0.768451
prince    0.764070
throne    0.763497
princess  0.751273


The answer is '**queen**'

In [None]:
## YOUR CODE HERE
# Solve the other two riddles
# good:bad::up:?
# PART 4
#print solve_riddle()
result_with_keywords2, result_without_keywords2 = solve_riddle("good", "bad", "up", 5,df)
# Call the solve_riddle function to compute the top answers
print('the result of riddle with known words is:')
print(result_with_keywords2)
print('the result of riddle without known words is:')
print(result_without_keywords2)


the result of riddle with known words is:
                 0
down      0.850167
up        0.818045
falling   0.813844
out       0.792837
dropping  0.782064
the result of riddle without known words is:
                 0
down      0.850167
falling   0.813844
out       0.792837
dropping  0.782064
off       0.778428


The answer is '**down**'

In [None]:
# germany:merkel::america:?
#print solve_riddle()
result_with_keywords3, result_without_keywords3 = solve_riddle("germany", "merkel", "america", 5,df)
# Call the solve_riddle function to compute the top answers
print('the result of riddle with known words is:')
print(result_with_keywords3)
print('the result of riddle without known words is:')
print(result_without_keywords3)

the result of riddle with known words is:
                0
obama    0.694289
barack   0.682594
hillary  0.660910
bush     0.657911
clinton  0.655765
the result of riddle without known words is:
                0
obama    0.694289
barack   0.682594
hillary  0.660910
bush     0.657911
clinton  0.655765


The answer is '**obama**' or '**barack obama**'