<a href="https://colab.research.google.com/github/dasmiq/cs6200-hw3/blob/main/embeddings.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Cross-Language Word Embeddings

In class, we discussed how we can reduce the dimensionality of word representations from their original vector space to an embedding space on the order of a few hundred dimensions. Different modeling choices for word embeddings may be ultimately evaluated by the effectiveness of classification or retrieval models.

In this assignment, however, we will consider another common method of evaluating word embeddings: by judging the usefulness of pairwise distances between words in the embedding space.

Follow along with the examples in this notebook, and implement the sections of code flagged with **TODO**.

In [1]:
import gensim
import numpy as np
from gensim.test.utils import datapath, get_tmpfile
from gensim.models import Word2Vec
from gensim.models.word2vec import LineSentence

In [2]:
!pip3 install wget



In [3]:
import wget
wget.download("http://www.ccs.neu.edu/home/dasmith/courses/cs6200/shakespeare_plays.txt")

  0% [                                                      ]       0 / 4746840  0% [                                                      ]    8192 / 4746840  0% [                                                      ]   16384 / 4746840  0% [                                                      ]   24576 / 4746840  0% [                                                      ]   32768 / 4746840  0% [                                                      ]   40960 / 4746840  1% [                                                      ]   49152 / 4746840  1% [                                                      ]   57344 / 4746840  1% [                                                      ]   65536 / 4746840  1% [                                                      ]   73728 / 4746840  1% [                                                      ]   81920 / 4746840  1% [.                                                     ]   90112 / 4746840  2% [.                                

 44% [.......................                               ] 2097152 / 4746840 44% [.......................                               ] 2105344 / 4746840 44% [........................                              ] 2113536 / 4746840 44% [........................                              ] 2121728 / 4746840 44% [........................                              ] 2129920 / 4746840 45% [........................                              ] 2138112 / 4746840 45% [........................                              ] 2146304 / 4746840 45% [........................                              ] 2154496 / 4746840 45% [........................                              ] 2162688 / 4746840 45% [........................                              ] 2170880 / 4746840 45% [........................                              ] 2179072 / 4746840 46% [........................                              ] 2187264 / 4746840 46% [........................         

'shakespeare_plays (3).txt'

We'll start by downloading a plain-text version of the plays of William Shakespeare.

In [4]:
#!wget http://www.ccs.neu.edu/home/dasmith/courses/cs6200/shakespeare_plays.txt
lines = [s.split() for s in open('shakespeare_plays.txt')]

Then, we'll estimate a simple word2vec model on the Shakespeare texts.

In [14]:
model = Word2Vec(lines)

Even with such a small training set size, you can perform some standard analogy tasks.

In [15]:
model.wv.most_similar(positive=['king','woman'], negative=['man'])

[('queen', 0.7801874279975891),
 ('prince', 0.7033681273460388),
 ('princess', 0.6965100765228271),
 ('york', 0.6962099671363831),
 ('france', 0.6931193470954895),
 ('duke', 0.6859343647956848),
 ('warwick', 0.6854237914085388),
 ('clarence', 0.6798115968704224),
 ('gloucester', 0.6746828556060791),
 ('suffolk', 0.6614434123039246)]

In other words, we want a vector close to `king` and `woman` but subtracting the dimensions that are important to `man`, i.e., `queen`. Other words are mostly noble titles important in Shakespeare's "history" plays.

For the rest of this assignment, we will focus on finding words with similar embeddings, both within and across languages. For example, what words are similar to the name of the title character of *Othello*?

In [16]:
model.wv.most_similar(positive=['othello'])
#model.wv.most_similar(positive=['brutus'])

[('iago', 0.9569085240364075),
 ('desdemona', 0.9544375538825989),
 ('emilia', 0.9262716174125671),
 ('cleopatra', 0.9184913039207458),
 ('cassio', 0.8948360681533813),
 ('ham', 0.891790509223938),
 ('troilus', 0.8884052634239197),
 ('cressida', 0.8848922848701477),
 ('imogen', 0.8826351165771484),
 ('kent', 0.880477249622345)]

If you know the play, you might see some familiar names.

This search uses cosine similarity. In the default API, you should see the same similarity between the words `othello` and `desdemona` as in the search results above.

In [17]:
model.wv.similarity('othello', 'desdemona')

0.95443773

**TODO**: Your **first task**, therefore, is to implement your own cosine similarity function so that you can reuse it outside of the context of the gensim model object.

In [19]:
## TODO: Implement cosim
import numpy as np
from numpy.linalg import norm
def cosim(v1, v2):
  ## return cosine similarity between v1 and v2
  cosine = np.dot(v1,v2)/(norm(v1)*norm(v2))
  return cosine
  #return 0

## This should give a result similar to model.wv.similarity:
cosim(model.wv['othello'], model.wv['desdemona'])

0.95443785

## Evaluation

We could collect a lot of human judgments about how similar pairs of words, or pairs of Shakespearean characters, are. Then we could compare different word-embedding models by their ability to replicate these human judgments.

If we extend our ambition to multiple languages, however, we can use a word translation task to evaluate word embeddings.

We will use a subset of [Facebook AI's FastText cross-language embeddings](https://fasttext.cc/docs/en/aligned-vectors.html) for several languages. Your task will be to compare English both to French, and to *one more language* from the following set: Arabic, German, Portuguese, Russian, Spanish, Vietnamese, and Chinese.

In [20]:
# !wget http://www.ccs.neu.edu/home/dasmith/courses/cs6200/30k.en.vec
# !wget http://www.ccs.neu.edu/home/dasmith/courses/cs6200/30k.fr.vec
wget.download("http://www.ccs.neu.edu/home/dasmith/courses/cs6200/30k.en.vec")
wget.download("http://www.ccs.neu.edu/home/dasmith/courses/cs6200/30k.fr.vec")
# TODO: uncomment at least one of these to work with another language
# !wget http://www.ccs.neu.edu/home/dasmith/courses/cs6200/30k.ar.vec
# !wget http://www.ccs.neu.edu/home/dasmith/courses/cs6200/30k.de.vec
# !wget http://www.ccs.neu.edu/home/dasmith/courses/cs6200/30k.pt.vec
# !wget http://www.ccs.neu.edu/home/dasmith/courses/cs6200/30k.ru.vec
wget.download("http://www.ccs.neu.edu/home/dasmith/courses/cs6200/30k.es.vec")
# !wget http://www.ccs.neu.edu/home/dasmith/courses/cs6200/30k.vi.vec
# !wget http://www.ccs.neu.edu/home/dasmith/courses/cs6200/30k.zh.vec

  0% [                                                    ]        0 / 67681172  0% [                                                    ]     8192 / 67681172  0% [                                                    ]    16384 / 67681172  0% [                                                    ]    24576 / 67681172  0% [                                                    ]    32768 / 67681172  0% [                                                    ]    40960 / 67681172  0% [                                                    ]    49152 / 67681172  0% [                                                    ]    57344 / 67681172  0% [                                                    ]    65536 / 67681172  0% [                                                    ]    73728 / 67681172  0% [                                                    ]    81920 / 67681172  0% [                                                    ]    90112 / 67681172  0% [                                 

  2% [.                                                   ]  1835008 / 67681172  2% [.                                                   ]  1843200 / 67681172  2% [.                                                   ]  1851392 / 67681172  2% [.                                                   ]  1859584 / 67681172  2% [.                                                   ]  1867776 / 67681172  2% [.                                                   ]  1875968 / 67681172  2% [.                                                   ]  1884160 / 67681172  2% [.                                                   ]  1892352 / 67681172  2% [.                                                   ]  1900544 / 67681172  2% [.                                                   ]  1908736 / 67681172  2% [.                                                   ]  1916928 / 67681172  2% [.                                                   ]  1925120 / 67681172  2% [.                                

 13% [.......                                             ]  9240576 / 67681172 13% [.......                                             ]  9248768 / 67681172 13% [.......                                             ]  9256960 / 67681172 13% [.......                                             ]  9265152 / 67681172 13% [.......                                             ]  9273344 / 67681172 13% [.......                                             ]  9281536 / 67681172 13% [.......                                             ]  9289728 / 67681172 13% [.......                                             ]  9297920 / 67681172 13% [.......                                             ]  9306112 / 67681172 13% [.......                                             ]  9314304 / 67681172 13% [.......                                             ]  9322496 / 67681172 13% [.......                                             ]  9330688 / 67681172 13% [.......                          

 17% [.........                                           ] 11812864 / 67681172 17% [.........                                           ] 11821056 / 67681172 17% [.........                                           ] 11829248 / 67681172 17% [.........                                           ] 11837440 / 67681172 17% [.........                                           ] 11845632 / 67681172 17% [.........                                           ] 11853824 / 67681172 17% [.........                                           ] 11862016 / 67681172 17% [.........                                           ] 11870208 / 67681172 17% [.........                                           ] 11878400 / 67681172 17% [.........                                           ] 11886592 / 67681172 17% [.........                                           ] 11894784 / 67681172 17% [.........                                           ] 11902976 / 67681172 17% [.........                        

 32% [................                                    ] 21708800 / 67681172 32% [................                                    ] 21716992 / 67681172 32% [................                                    ] 21725184 / 67681172 32% [................                                    ] 21733376 / 67681172 32% [................                                    ] 21741568 / 67681172 32% [................                                    ] 21749760 / 67681172 32% [................                                    ] 21757952 / 67681172 32% [................                                    ] 21766144 / 67681172 32% [................                                    ] 21774336 / 67681172 32% [................                                    ] 21782528 / 67681172 32% [................                                    ] 21790720 / 67681172 32% [................                                    ] 21798912 / 67681172 32% [................                 

 38% [....................                                ] 26288128 / 67681172 38% [....................                                ] 26296320 / 67681172 38% [....................                                ] 26304512 / 67681172 38% [....................                                ] 26312704 / 67681172 38% [....................                                ] 26320896 / 67681172 38% [....................                                ] 26329088 / 67681172 38% [....................                                ] 26337280 / 67681172 38% [....................                                ] 26345472 / 67681172 38% [....................                                ] 26353664 / 67681172 38% [....................                                ] 26361856 / 67681172 38% [....................                                ] 26370048 / 67681172 38% [....................                                ] 26378240 / 67681172 38% [....................             

 44% [.......................                             ] 29966336 / 67681172 44% [.......................                             ] 29974528 / 67681172 44% [.......................                             ] 29982720 / 67681172 44% [.......................                             ] 29990912 / 67681172 44% [.......................                             ] 29999104 / 67681172 44% [.......................                             ] 30007296 / 67681172 44% [.......................                             ] 30015488 / 67681172 44% [.......................                             ] 30023680 / 67681172 44% [.......................                             ] 30031872 / 67681172 44% [.......................                             ] 30040064 / 67681172 44% [.......................                             ] 30048256 / 67681172 44% [.......................                             ] 30056448 / 67681172 44% [.......................          

 57% [.............................                       ] 38731776 / 67681172 57% [.............................                       ] 38739968 / 67681172 57% [.............................                       ] 38748160 / 67681172 57% [.............................                       ] 38756352 / 67681172 57% [.............................                       ] 38764544 / 67681172 57% [.............................                       ] 38772736 / 67681172 57% [.............................                       ] 38780928 / 67681172 57% [.............................                       ] 38789120 / 67681172 57% [.............................                       ] 38797312 / 67681172 57% [.............................                       ] 38805504 / 67681172 57% [.............................                       ] 38813696 / 67681172 57% [.............................                       ] 38821888 / 67681172 57% [.............................    

 71% [.....................................               ] 48185344 / 67681172 71% [.....................................               ] 48193536 / 67681172 71% [.....................................               ] 48201728 / 67681172 71% [.....................................               ] 48209920 / 67681172 71% [.....................................               ] 48218112 / 67681172 71% [.....................................               ] 48226304 / 67681172 71% [.....................................               ] 48234496 / 67681172 71% [.....................................               ] 48242688 / 67681172 71% [.....................................               ] 48250880 / 67681172 71% [.....................................               ] 48259072 / 67681172 71% [.....................................               ] 48267264 / 67681172 71% [.....................................               ] 48275456 / 67681172 71% [.................................

 85% [............................................        ] 58097664 / 67681172 85% [............................................        ] 58105856 / 67681172 85% [............................................        ] 58114048 / 67681172 85% [............................................        ] 58122240 / 67681172 85% [............................................        ] 58130432 / 67681172 85% [............................................        ] 58138624 / 67681172 85% [............................................        ] 58146816 / 67681172 85% [............................................        ] 58155008 / 67681172 85% [............................................        ] 58163200 / 67681172 85% [............................................        ] 58171392 / 67681172 85% [............................................        ] 58179584 / 67681172 85% [............................................        ] 58187776 / 67681172 85% [.................................

 94% [.................................................   ] 64217088 / 67681172 94% [.................................................   ] 64225280 / 67681172 94% [.................................................   ] 64233472 / 67681172 94% [.................................................   ] 64241664 / 67681172 94% [.................................................   ] 64249856 / 67681172 94% [.................................................   ] 64258048 / 67681172 94% [.................................................   ] 64266240 / 67681172 94% [.................................................   ] 64274432 / 67681172 94% [.................................................   ] 64282624 / 67681172 94% [.................................................   ] 64290816 / 67681172 95% [.................................................   ] 64299008 / 67681172 95% [.................................................   ] 64307200 / 67681172 95% [.................................

  0% [                                                    ]        0 / 67802327  0% [                                                    ]     8192 / 67802327  0% [                                                    ]    16384 / 67802327  0% [                                                    ]    24576 / 67802327  0% [                                                    ]    32768 / 67802327  0% [                                                    ]    40960 / 67802327  0% [                                                    ]    49152 / 67802327  0% [                                                    ]    57344 / 67802327  0% [                                                    ]    65536 / 67802327  0% [                                                    ]    73728 / 67802327  0% [                                                    ]    81920 / 67802327  0% [                                                    ]    90112 / 67802327  0% [                                 

  2% [.                                                   ]  1507328 / 67802327  2% [.                                                   ]  1515520 / 67802327  2% [.                                                   ]  1523712 / 67802327  2% [.                                                   ]  1531904 / 67802327  2% [.                                                   ]  1540096 / 67802327  2% [.                                                   ]  1548288 / 67802327  2% [.                                                   ]  1556480 / 67802327  2% [.                                                   ]  1564672 / 67802327  2% [.                                                   ]  1572864 / 67802327  2% [.                                                   ]  1581056 / 67802327  2% [.                                                   ]  1589248 / 67802327  2% [.                                                   ]  1597440 / 67802327  2% [.                                

 10% [.....                                               ]  6881280 / 67802327 10% [.....                                               ]  6889472 / 67802327 10% [.....                                               ]  6897664 / 67802327 10% [.....                                               ]  6905856 / 67802327 10% [.....                                               ]  6914048 / 67802327 10% [.....                                               ]  6922240 / 67802327 10% [.....                                               ]  6930432 / 67802327 10% [.....                                               ]  6938624 / 67802327 10% [.....                                               ]  6946816 / 67802327 10% [.....                                               ]  6955008 / 67802327 10% [.....                                               ]  6963200 / 67802327 10% [.....                                               ]  6971392 / 67802327 10% [.....                            

 18% [.........                                           ] 12812288 / 67802327 18% [.........                                           ] 12820480 / 67802327 18% [.........                                           ] 12828672 / 67802327 18% [.........                                           ] 12836864 / 67802327 18% [.........                                           ] 12845056 / 67802327 18% [.........                                           ] 12853248 / 67802327 18% [.........                                           ] 12861440 / 67802327 18% [.........                                           ] 12869632 / 67802327 18% [.........                                           ] 12877824 / 67802327 19% [.........                                           ] 12886016 / 67802327 19% [.........                                           ] 12894208 / 67802327 19% [.........                                           ] 12902400 / 67802327 19% [.........                        

 23% [............                                        ] 16244736 / 67802327 23% [............                                        ] 16252928 / 67802327 23% [............                                        ] 16261120 / 67802327 23% [............                                        ] 16269312 / 67802327 24% [............                                        ] 16277504 / 67802327 24% [............                                        ] 16285696 / 67802327 24% [............                                        ] 16293888 / 67802327 24% [............                                        ] 16302080 / 67802327 24% [............                                        ] 16310272 / 67802327 24% [............                                        ] 16318464 / 67802327 24% [............                                        ] 16326656 / 67802327 24% [............                                        ] 16334848 / 67802327 24% [............                     

 38% [...................                                 ] 26025984 / 67802327 38% [...................                                 ] 26034176 / 67802327 38% [...................                                 ] 26042368 / 67802327 38% [...................                                 ] 26050560 / 67802327 38% [...................                                 ] 26058752 / 67802327 38% [...................                                 ] 26066944 / 67802327 38% [...................                                 ] 26075136 / 67802327 38% [....................                                ] 26083328 / 67802327 38% [....................                                ] 26091520 / 67802327 38% [....................                                ] 26099712 / 67802327 38% [....................                                ] 26107904 / 67802327 38% [....................                                ] 26116096 / 67802327 38% [....................             

 53% [...........................                         ] 36257792 / 67802327 53% [...........................                         ] 36265984 / 67802327 53% [...........................                         ] 36274176 / 67802327 53% [...........................                         ] 36282368 / 67802327 53% [...........................                         ] 36290560 / 67802327 53% [...........................                         ] 36298752 / 67802327 53% [...........................                         ] 36306944 / 67802327 53% [...........................                         ] 36315136 / 67802327 53% [...........................                         ] 36323328 / 67802327 53% [...........................                         ] 36331520 / 67802327 53% [...........................                         ] 36339712 / 67802327 53% [...........................                         ] 36347904 / 67802327 53% [...........................      

 64% [.................................                   ] 43859968 / 67802327 64% [.................................                   ] 43868160 / 67802327 64% [.................................                   ] 43876352 / 67802327 64% [.................................                   ] 43884544 / 67802327 64% [.................................                   ] 43892736 / 67802327 64% [.................................                   ] 43900928 / 67802327 64% [.................................                   ] 43909120 / 67802327 64% [.................................                   ] 43917312 / 67802327 64% [.................................                   ] 43925504 / 67802327 64% [.................................                   ] 43933696 / 67802327 64% [.................................                   ] 43941888 / 67802327 64% [.................................                   ] 43950080 / 67802327 64% [.................................

 76% [.......................................             ] 51937280 / 67802327 76% [.......................................             ] 51945472 / 67802327 76% [.......................................             ] 51953664 / 67802327 76% [.......................................             ] 51961856 / 67802327 76% [.......................................             ] 51970048 / 67802327 76% [.......................................             ] 51978240 / 67802327 76% [.......................................             ] 51986432 / 67802327 76% [.......................................             ] 51994624 / 67802327 76% [.......................................             ] 52002816 / 67802327 76% [.......................................             ] 52011008 / 67802327 76% [.......................................             ] 52019200 / 67802327 76% [.......................................             ] 52027392 / 67802327 76% [.................................

 87% [.............................................       ] 59629568 / 67802327 87% [.............................................       ] 59637760 / 67802327 87% [.............................................       ] 59645952 / 67802327 87% [.............................................       ] 59654144 / 67802327 87% [.............................................       ] 59662336 / 67802327 88% [.............................................       ] 59670528 / 67802327 88% [.............................................       ] 59678720 / 67802327 88% [.............................................       ] 59686912 / 67802327 88% [.............................................       ] 59695104 / 67802327 88% [.............................................       ] 59703296 / 67802327 88% [.............................................       ] 59711488 / 67802327 88% [.............................................       ] 59719680 / 67802327 88% [.................................

 96% [..................................................  ] 65372160 / 67802327 96% [..................................................  ] 65380352 / 67802327 96% [..................................................  ] 65388544 / 67802327 96% [..................................................  ] 65396736 / 67802327 96% [..................................................  ] 65404928 / 67802327 96% [..................................................  ] 65413120 / 67802327 96% [..................................................  ] 65421312 / 67802327 96% [..................................................  ] 65429504 / 67802327 96% [..................................................  ] 65437696 / 67802327 96% [..................................................  ] 65445888 / 67802327 96% [..................................................  ] 65454080 / 67802327 96% [..................................................  ] 65462272 / 67802327 96% [.................................

  0% [                                                    ]        0 / 67762853  0% [                                                    ]     8192 / 67762853  0% [                                                    ]    16384 / 67762853  0% [                                                    ]    24576 / 67762853  0% [                                                    ]    32768 / 67762853  0% [                                                    ]    40960 / 67762853  0% [                                                    ]    49152 / 67762853  0% [                                                    ]    57344 / 67762853  0% [                                                    ]    65536 / 67762853  0% [                                                    ]    73728 / 67762853  0% [                                                    ]    81920 / 67762853  0% [                                                    ]    90112 / 67762853  0% [                                 

  1% [                                                    ]  1228800 / 67762853  1% [                                                    ]  1236992 / 67762853  1% [                                                    ]  1245184 / 67762853  1% [                                                    ]  1253376 / 67762853  1% [                                                    ]  1261568 / 67762853  1% [                                                    ]  1269760 / 67762853  1% [                                                    ]  1277952 / 67762853  1% [                                                    ]  1286144 / 67762853  1% [                                                    ]  1294336 / 67762853  1% [                                                    ]  1302528 / 67762853  1% [.                                                   ]  1310720 / 67762853  1% [.                                                   ]  1318912 / 67762853  1% [.                                

  5% [...                                                 ]  4046848 / 67762853  5% [...                                                 ]  4055040 / 67762853  5% [...                                                 ]  4063232 / 67762853  6% [...                                                 ]  4071424 / 67762853  6% [...                                                 ]  4079616 / 67762853  6% [...                                                 ]  4087808 / 67762853  6% [...                                                 ]  4096000 / 67762853  6% [...                                                 ]  4104192 / 67762853  6% [...                                                 ]  4112384 / 67762853  6% [...                                                 ]  4120576 / 67762853  6% [...                                                 ]  4128768 / 67762853  6% [...                                                 ]  4136960 / 67762853  6% [...                              

  9% [.....                                               ]  6586368 / 67762853  9% [.....                                               ]  6594560 / 67762853  9% [.....                                               ]  6602752 / 67762853  9% [.....                                               ]  6610944 / 67762853  9% [.....                                               ]  6619136 / 67762853  9% [.....                                               ]  6627328 / 67762853  9% [.....                                               ]  6635520 / 67762853  9% [.....                                               ]  6643712 / 67762853  9% [.....                                               ]  6651904 / 67762853  9% [.....                                               ]  6660096 / 67762853  9% [.....                                               ]  6668288 / 67762853  9% [.....                                               ]  6676480 / 67762853  9% [.....                            

 19% [.........                                           ] 12926976 / 67762853 19% [.........                                           ] 12935168 / 67762853 19% [.........                                           ] 12943360 / 67762853 19% [.........                                           ] 12951552 / 67762853 19% [.........                                           ] 12959744 / 67762853 19% [.........                                           ] 12967936 / 67762853 19% [.........                                           ] 12976128 / 67762853 19% [.........                                           ] 12984320 / 67762853 19% [.........                                           ] 12992512 / 67762853 19% [.........                                           ] 13000704 / 67762853 19% [.........                                           ] 13008896 / 67762853 19% [.........                                           ] 13017088 / 67762853 19% [.........                        

100% [....................................................] 67762853 / 67762853

'30k.es (1).vec'

We'll start by loading the word vectors from their textual file format to a dictionary mapping words to numpy arrays.

In [21]:
def vecref(s):
  (word, srec) = s.split(' ', 1)
  return (word, np.fromstring(srec, sep=' '))

def ftvectors(fname):
  return { k:v for (k, v) in [vecref(s) for s in open(fname)] if len(v) > 1} 

envec = ftvectors('30k.en.vec')
frvec = ftvectors('30k.fr.vec')

# TODO: load vectors for one more language, such as zhvec (Chinese)
# arvec = ftvectors('30k.ar.vec')
# devec = ftvectors('30k.de.vec')
# ptvec = ftvectors('30k.pt.vec')
# ruvec = ftvectors('30k.ru.vec')
esvec = ftvectors('30k.es.vec')
# vivec = ftvectors('30k.vi.vec')
# zhvec = ftvectors('30k.zh.vec')

**TODO**: Your next task is to write a simple function that takes a vector and a dictionary of vectors and finds the most similar item in the dictionary. For this assignment, a linear scan through the dictionary using your `cosim` function from above is acceptible.

In [31]:
## TODO: implement this search function
import numpy as np

def mostSimilar(vec, vd, k, k_most_similar=1):
    n = k_most_similar
    vec = np.array(vec)
    similarity = np.dot(vd,vec)/(norm(vd, axis = 1)*norm(vec))
    idx = np.argpartition(similarity, -n)[-n:]
    indices = idx[np.argsort((-similarity)[idx])]
    most_similar = k[indices]
    l = []
    for i in range(len(indices)):
        l.append([k[indices[i]],similarity[indices[i]]])
        
    return (l)

## some example searches
k = np.array(list(frvec.keys()))
v = np.array(list(frvec.values()))
[mostSimilar(envec[e], v, k) for e in ['computer', 'germany', 'matrix', 'physics', 'yeast']]

[[['informatique', 0.5023827767603762]],
 [['allemagne', 0.5937184138759636]],
 [['matrice', 0.5088361302065516]],
 [['physique', 0.45555434347963947]],
 [['fermentation', 0.3504105196166513]]]

Some matches make more sense than others. Note that `computer` most closely matches `informatique`, the French term for *computer science*. If you looked further down the list, you would see `ordinateur`, the term for *computer*. This is one weakness of a focus only on embeddings for word *types* independent of context.

To evalute cross-language embeddings more broadly, we'll look at a dataset of links between Wikipedia articles.

In [24]:
#!wget http://www.ccs.neu.edu/home/dasmith/courses/cs6200/links.tab
wget.download("http://www.ccs.neu.edu/home/dasmith/courses/cs6200/links.tab")
links = [s.split() for s in open('links.tab')]

  0% [                                                      ]       0 / 1408915  0% [                                                      ]    8192 / 1408915  1% [                                                      ]   16384 / 1408915  1% [                                                      ]   24576 / 1408915  2% [.                                                     ]   32768 / 1408915  2% [.                                                     ]   40960 / 1408915  3% [.                                                     ]   49152 / 1408915  4% [..                                                    ]   57344 / 1408915  4% [..                                                    ]   65536 / 1408915  5% [..                                                    ]   73728 / 1408915  5% [...                                                   ]   81920 / 1408915  6% [...                                                   ]   90112 / 1408915  6% [...                              

100% [......................................................] 1408915 / 1408915

This `links` variable consists of triples of `(English term, language, term in that language)`. For example, here is the link between English `academy` and French `académie`:

In [25]:
links[302]

['academy', 'fr', 'académie']

**TODO**: Evaluate the English and French embeddings by computing the proportion of English Wikipedia articles whose corresponding French article is also the closest word in embedding space. Skip English articles not covered by the word embedding dictionary. Since many articles, e.g., about named entities have the same title in English and French, compute the baseline accuracy achieved by simply echoing the English title as if it were French. Remember to iterate only over English Wikipedia articles, not the entire embedding dictionary.

In [26]:
## TODO: Compute English-French Wikipedia retrieval accuracy.
french_links = []
for i,j,k in links:
  if j == 'fr':
    frenchh = [i,j,k]
    french_links.append(frenchh)
accuracy = 0
baselineAccuracy = 0

eng_words_in_french =[]
for i in french_links:
  eng_words_in_french.append(i[0])

import numpy as np
eng_words_in_frenchnum = np.array(eng_words_in_french)

k = np.array(list(frvec.keys()))
v = np.array(list(frvec.values()))
similarity_list = dict()
for e in eng_words_in_frenchnum:
    if e in envec:
        most_similar,similarity= mostSimilar(envec[e], v, k)
        similarity_list[e] = most_similar[0], similarity[0]
    
similarity_res = []
article_res = []
for i in similarity_list.values():
  similarity_res.append(i[0])

for i in french_links:
  article_res.append(i[2])

count = 0
proportion = 0
for i in range(len(similarity_list)):
  if similarity_res[i]==article_res[i]:
    count+=1
print(count)

proportion = count/len(eng_words_in_frenchnum)
print("Proportion of Accuracy:",proportion)

baseline_count = 0
for i in range(len(similarity_list)):
   if eng_words_in_frenchnum[i] == article_res[i]:
    baseline_count=baseline_count+1
print("Proportion of baseline Accuracy: ",baseline_count/len(eng_words_in_frenchnum))

5289
Proportion of Accuracy: 0.5359205593271862
Proportion of baseline Accuracy:  0.6742324450298915


**TODO**: Compute accuracy and baseline (identity function) acccuracy for Englsih and another language besides French. Although the baseline will be lower for languages not written in the Roman alphabet (i.e., Arabic or Chinese), there are still many articles in those languages with headwords written in Roman characters.

In [30]:
## TODO: Compute English-X [Spanish] Wikipedia retrieval accuracy.
import numpy as np
spanish_links = []
for i,j,k in links:
  if j == 'es':
    spanish = [i,j,k]
    spanish_links.append(spanish)

eng_words_in_spanish =[]
for i in spanish_links:
  eng_words_in_spanish.append(i[0])

eng_words_in_spanishnum = np.array(eng_words_in_spanish)

k = np.array(list(esvec.keys()))
v = np.array(list(esvec.values()))
similarity_spanishlist = dict()
for e in eng_words_in_spanishnum:
    if e in envec:
        most_similar,similarity= mostSimilar(envec[e], v, k)
        similarity_spanishlist[e] = most_similar[0], similarity[0]
    
similarity_result = []
article_result = []
for i in similarity_spanishlist.values():
  similarity_result.append(i[0])

for i in spanish_links:
  article_result.append(i[2])

count1 = 0
proportion1 = 0
for i in range(len(similarity_spanishlist)):
  if similarity_result[i]==article_result[i]:
    count1+=1
print(count1)

proportion1 = count1/len(eng_words_in_spanishnum)
print("Proportion of Accuracy:",proportion1)

baseline_count1 = 0
for i in range(len(similarity_spanishlist)):
   if eng_words_in_spanishnum[i] == article_result[i]:
    baseline_count1=baseline_count1+1
print("Proportion of baseline Accuracy: ",baseline_count1/len(eng_words_in_spanishnum))
    



4355
Proportion of Accuracy: 0.5432884231536926
Proportion of baseline Accuracy:  0.5173403193612774


**TODO**: Find the 10 nearest neighbors of each English term to compute "recall at 10" and "mean reciprocal rank at 10".

In [34]:
## TODO: Compute recall@10 and MRR@10 when retrieving 10 nearest neighbors in French and some other language.
def getMapping(word1, language='fr'):
    for i in range(len(links)):
        if links[i][0] == word1 and links[i][1] == language:
            return links[i][2]
    return ''


In [39]:
def getRank(similarity, mappingWord):
    
    #print(mappingWord)
    for i in range(len(similarity)):
        #print (similarity[i][0])
        if similarity[i][0] == mappingWord:
            return i
    return -1


In [43]:
#Example execution
example = ['computer', 'germany', 'matrix', 'physics', 'yeast']
k = np.array(list(frvec.keys()))
v = np.array(list(frvec.values()))
l = []
mappings = []
l.append([mostSimilar(envec[e], v, k, 10) for e in example])
#mmr= 1/rank
mappings.append([getMapping(word1) for word1 in example])
count=0
mmr = 0
for i in range(len(example)):
    rank = getRank(l[0][i], mappings[0][i]) 
    if rank>-1:
        count+=1
        mmr += 1/(rank+1)
print ("MMR@10 = ",mmr/len(example))  
print("Recall@10 = "count/len(example))



0.5666666666666667
0.8


In [46]:
k = np.array(list(frvec.keys()))
v = np.array(list(frvec.values()))
l = []
mappings = []
l.append([mostSimilar(envec[e], v, k, 10) for e in eng_words_in_frenchnum])
#mmr= 1/rank
mappings.append([getMapping(word1) for word1 in eng_words_in_frenchnum])
count=0
mmr = 0
for i in range(len(eng_words_in_frenchnum)):
    rank = getRank(l[0][i], mappings[0][i]) 
    if rank>-1:
        count+=1
        mmr += 1/(rank+1)
print ("MMR@10 = ",mmr/len(eng_words_in_frenchnum)) 
print ("Recall@10 = ",count/len(eng_words_in_frenchnum))

MMR@10 =  0.5646970954423574
Recall@10 =  0.6130307021988043


The list of Wikipedia headwords is short enough that a linear scan through the non-English language embeddings takes some time but is feasible. In a production system, you could index the word embeddings using SimHash or some other locality sensitive hashing scheme, as we discussed for duplicate detection, to speed up this process.

In [48]:
k = np.array(list(esvec.keys()))
v = np.array(list(esvec.values()))
l = []
mappings = []
l.append([mostSimilar(envec[e], v, k, 10) for e in eng_words_in_spanishnum])
#mmr= 1/rank
mappings.append([getMapping(word1,'es') for word1 in eng_words_in_spanishnum])
count=0
mmr = 0
for i in range(len(eng_words_in_spanishnum)):
    rank = getRank(l[0][i], mappings[0][i]) + 1
    if rank>-1:
        count+=1
        mmr += 1/(rank+1)
        
print ("MMR@10 = ",mmr/len(eng_words_in_spanishnum)) 
print ("Recall@10 = ",count/len(eng_words_in_spanishnum))

MMR@10 =  0.6688740304384007
Recall@10 =  1.0
