# ANLP Lab 2 - Solutions

In this lab we are going to play with the pretrained GloVe embeddings and use them to solve some analogy problems.

We start by loading some required modules:



In [None]:
import numpy as np
import tensorflow as tf
from tensorflow import keras

Then we obtain the GloVe embeddings as follows (this may take a few minutes):

In [None]:
#!wget http://nlp.stanford.edu/data/glove.6B.zip
#!unzip -q glove.6B.zip
!wget http://john.mccr.ae/downloads/glove.6B.50d.txt.gz
!gunzip glove.6B.50d.txt.gz

--2020-10-19 10:34:59--  http://john.mccr.ae/downloads/glove.6B.50d.txt.gz
Resolving john.mccr.ae (john.mccr.ae)... 128.199.47.101
Connecting to john.mccr.ae (john.mccr.ae)|128.199.47.101|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 69182520 (66M) [application/x-gzip]
Saving to: ‘glove.6B.50d.txt.gz’


2020-10-19 10:35:02 (21.7 MB/s) - ‘glove.6B.50d.txt.gz’ saved [69182520/69182520]



We load the dataset as follows

In [None]:
path_to_glove_file = "glove.6B.50d.txt"

embeddings = {}
with open(path_to_glove_file) as f:
    for line in f:
        word, coefs = line.split(maxsplit=1)
        coefs = np.fromstring(coefs, "f", sep=" ")
        embeddings[word] = coefs

## Question 1

Write a function to calculate the cosine similarity between two words using the GloVe vectors


In [None]:
from math import sqrt
from numpy import dot
from numpy.linalg import norm

def sim(word1, word2):
  return dot(embeddings[word1],embeddings[word2])/norm(embeddings[word1])/norm(embeddings[word2])

assert(sim("cat","dog")>0)

## Question 2

Which word is closer to "nurse", "man" or "woman"? How about "programmer"?

In [None]:
print(sim("nurse","man"))
print(sim("nurse","woman"))
print(sim("programmer","man"))
print(sim("programmer","woman"))

0.5718703
0.7155021
0.26579538
0.2192782


## Question 3

Implement word similarity using Euclidean distance. Do you get the same result for Question 2?

In [None]:
from math import sqrt

def sim_ed(word1, word2):
  return norm(embeddings[word1]-embeddings[word2])

print(sim_ed("nurse","man"))
print(sim_ed("nurse","woman"))
print(sim_ed("programmer","man"))
print(sim_ed("programmer","woman"))

4.715742
4.0068955
5.7321897
6.151346


## Question 4

According to the model of analogy, we would expect that $v_{queen} \simeq v_{king} - v_{man} + v_{woman}$. Test this hypothesis do you think it holds?

In [None]:
def cos(v1, v2):
  return dot(v1,v2)/norm(v1)/norm(v2)

print(cos(embeddings["queen"], embeddings["king"] - embeddings["man"] + embeddings["woman"]))

0.8609582


## Question 5

Write a function that given a vector finds the words with the top 10 most similar embeddings. Using this find the words that are most similar to $v_{king} - v_{man} + v_{woman}$

In [None]:
from collections import Counter

def analogy(v):
  return Counter({ word: cos(e,v) for word, e in embeddings.items() }).most_common(10)

analogy(embeddings["king"] - embeddings["man"] + embeddings["woman"])

[('king', 0.8859835),
 ('queen', 0.8609582),
 ('daughter', 0.76845115),
 ('prince', 0.7640699),
 ('throne', 0.76349694),
 ('princess', 0.7512728),
 ('elizabeth', 0.75064886),
 ('father', 0.73144966),
 ('kingdom', 0.7296158),
 ('mother', 0.72800094)]

## Question 6

Repeat the example using the 3CosMul method as defined in the lectures. Do you get a different result?

In [None]:
def cos2(v1,v2):
  return (cos(v1,v2) + 1) / 2
  
def three_cosmul_analogy(m,w,k):
  return Counter({ word: cos2(v,embeddings[k]) * cos2(v,embeddings[m]) / (cos2(v,embeddings[w]) + 1e-6) for word, v in embeddings.items() }).most_common(10)

three_cosmul_analogy("woman","man","king")


[('queen', 0.9288907978427993),
 ('king', 0.9218768373440346),
 ('throne', 0.882325271864579),
 ('elizabeth', 0.8789501295328435),
 ('princess', 0.8767548497811588),
 ('daughter', 0.8705160447955236),
 ('prince', 0.8702554959921912),
 ('kingdom', 0.8607221035520414),
 ('eldest', 0.8595449106596545),
 ('monarch', 0.8584720608347555)]