# AAI612: Deep Learning & its Applications

*Notebook 7.3: Loading Pretrained Embeddings*

<a href="https://colab.research.google.com/github/harmanani/AAI612/blob/main/Week7/Notebook7.3.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [14]:
"""
The MIT License (MIT)
Copyright (c) 2021 NVIDIA
Permission is hereby granted, free of charge, to any person obtaining a copy of
this software and associated documentation files (the "Software"), to deal in
thex Software without restriction, including without limitation the rights to
use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of
the Software, and to permit persons to whom the Software is furnished to do so,
subject to the following conditions:
The above copyright notice and this permission notice shall be included in all
copies or substantial portions of the Software.
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS
FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR
COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER
IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
"""

'\nThe MIT License (MIT)\nCopyright (c) 2021 NVIDIA\nPermission is hereby granted, free of charge, to any person obtaining a copy of\nthis software and associated documentation files (the "Software"), to deal in\nthe Software without restriction, including without limitation the rights to\nuse, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of\nthe Software, and to permit persons to whom the Software is furnished to do so,\nsubject to the following conditions:\nThe above copyright notice and this permission notice shall be included in all\ncopies or substantial portions of the Software.\nTHE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR\nIMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS\nFOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR\nCOPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER\nIN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OU

## gloVe

Download and unzip the precomputed embeddings from 2014 English Wikipedia `glove.6B.zip` from Go to https://nlp.stanford.edu/projects/glove. It’s an 822 MB zip file called glove.6B.zip, containing 100-dimensional embedding vectors for 400,000 words (or nonword tokens). Unzip it inside a directory `data`.  Keep the file `glove.6B.100d.txt` an delete the rest.

### Preproessing the Embeddings

Read the embeddings.  Start by opening the file and read it line by line. Split each line into its elements. Extract the first element, which represents the word itself, and then create a vector from the remaining elements and insert the word and the corresponding vector into a dictionary, which serves as the return value of the function.

In [19]:
import numpy as np
import scipy.spatial

# Read embeddings from file.
def read_embeddings():
    FILE_NAME = './data/glove.6B/glove.6B.100d.txt'
    embeddings = {}
    file = open(FILE_NAME, 'r', encoding='utf-8')
    for line in file:
        values = line.split()
        word = values[0]
        vector = np.asarray(values[1:], dtype='float32')
        embeddings[word] = vector
    file.close()
    print('Read %s embeddings.' % len(embeddings))
    return embeddings

Compute the cosine distance between a specific embedding and all other embeddings. It then prints the n closest ones.  Euclidean distance would also have worked fine, but the results would sometimes be different because the GloVe vectors are not normalized.

In [20]:
       
def print_n_closest(embeddings, vec0, n):
    word_distances = {}
    for (word, vec1) in embeddings.items():
        distance = scipy.spatial.distance.cosine(vec1, vec0)
        word_distances[distance] = word
    # Print words sorted by distance.
    for distance in sorted(word_distances.keys())[:n]:
        word = word_distances[distance]
        print(word + ': %6.3f' % distance)

First read the embeddings by invoking `read_embeddings()`

In [21]:
embeddings = read_embeddings()

Read 400000 embeddings.


Retrieve the embeddings for **hello** and print closest emebdding using `print_n_closest()`

In [22]:
lookup_word = 'hello'
print('\nWords closest to ' + lookup_word)
print_n_closest(embeddings, embeddings[lookup_word], 3)


Words closest to hello
hello:  0.000
goodbye:  0.209
hey:  0.283


Retrieve the embeddings for **dog** and print closest emebdding using `print_n_closest()`

In [25]:
lookup_word = 'dog'
print('\nWords closest to ' + lookup_word)
print_n_closest(embeddings, embeddings[lookup_word], 10)


Words closest to dog
dog:  0.000
cat:  0.120
dogs:  0.166
pet:  0.255
puppy:  0.276
horse:  0.289
animal:  0.318
pig:  0.345
boy:  0.345
cats:  0.353


### What is the capital of Jordan?

In [26]:
vec = embeddings['beirut'] - embeddings['lebanon'] + embeddings['jordan']
print_n_closest(embeddings, vec, 3)

amman:  0.250
jordan:  0.268
cairo:  0.321


### King - man + Woman = ??

Retrieve the embeddings for **king** and print closest emebdding using `print_n_closest()`

In [27]:
lookup_word = 'king'
print('\nWords closest to ' + lookup_word)
print_n_closest(embeddings, embeddings[lookup_word], 3)


Words closest to king
king:  0.000
prince:  0.232
queen:  0.249


Print the words closest to the vector resulting from computing `(king - man + woman).`

In [28]:
lookup_word = '(king - man + woman)'
print('\nWords closest to ' + lookup_word)
vec = embeddings['king'] - embeddings['man'] + embeddings['woman']
print_n_closest(embeddings, vec, 3)


Words closest to (king - man + woman)
king:  0.145
queen:  0.217
monarch:  0.307


### Madrid − Spain + Sweden = ?

A more impressive example next where we first print the words closest to **sweden** and **madrid** and then print the words closest to the result from the computation `(madrid − spain + sweden).`  We would assume the answer to be `Stockholm`

In [32]:
lookup_word = 'sweden'
print('\nWords closest to ' + lookup_word)
print_n_closest(embeddings, embeddings[lookup_word], 3)


Words closest to sweden
sweden:  0.000
denmark:  0.138
norway:  0.193


In [33]:
lookup_word = 'madrid'
print('\nWords closest to ' + lookup_word)
print_n_closest(embeddings, embeddings[lookup_word], 3)


Words closest to madrid
madrid:  0.000
barcelona:  0.157
valencia:  0.197


Now, print the words closest to the result from the computation `(madrid − spain + sweden).`

In [34]:
lookup_word = '(madrid - spain + sweden)'
print('\nWords closest to ' + lookup_word)
vec = embeddings['madrid'] - embeddings['spain'] + embeddings['sweden']
print_n_closest(embeddings, vec, 3)


Words closest to (madrid - spain + sweden)
stockholm:  0.271
sweden:  0.300
copenhagen:  0.305
