# LING 4100/5800 Machine Learning and Linguistics

# Exercise: calculating word embedding similarity, finding the closest word, and finding the best analogy


## Cosine distance

Recall that cosine distance, which we use to compare similarity of vectors representing words, is defined as follows:

$$sim({\bf x}, {\bf y}) = \frac{{\bf x} \cdot {\bf y}}{||{\bf x}||||{\bf y}||}$$

where ${\bf x} \cdot {\bf y}$ is the standard dot product of two vectors and $||{\bf x}||$ and $||{\bf y}||$ are the norms of the individual vectors. This needs to be implemented in Python. The dot product function we already have from before, so that's included below.

### Norm

Additionally, recall that the norm (length) for some vector ${\bf z}$ was defined as:

$$||{\bf z}|| = \sqrt{(z_1)^2 + (z_2)^2 + \ldots + (z_n)^2}$$

That is, we simply take the square root of the summed squares of the individual components in the vector. You'll also need to implement this.

### Most similar word

If we have a word (say *cat*), to find the most similar word among all vectors, we would simply loop over all word representations ${\bf x}$ in our vocabulary and find one that has maximum similarity to the vector for *cat*:

$$sim({\bf v}_{\rm cat}, {\bf x})$$

We of course don't compare *cat* with *cat* since that would trivially always have the highest similarity. But we do compare *cat* with every other word in the vocabulary.

## Analogies

![Image of Vectors](https://adriancolyer.files.wordpress.com/2016/04/word2vec-king-queen-vectors.png?w=300)

For calculating an analogy a:b::c:x for some unknown ${\bf x}$, we want to find the ${\bf x}$ such that the following quantity is maximized:

$$sim({\bf x}, {\bf b} - {\bf a} + {\bf c})$$

For example, we'd expect that if ${\bf a}$ is the vector representation for *king*, ${\bf b}$ for *queen* and ${\bf c}$ for *man*, then the best ${\bf x}$ would be the vector for *woman*.


# Sub-exercise 1: similarity

Here, you should define a function _sim()_ that takes two vectors as arguments, constructed from the above information. To do that, it's best to also define a helper function _norm()_ that takes a single vector and returns its norm.

If your _sim_-function is correctly implemented, it should return the following, for example:

```
sim(vec['cat'], vec['dog'])
0.9638665904796965
```

In [None]:
from math import sqrt # You'll need this function

# Some made-up toy word vectors
vec = {}
vec['man'] = [1,2,3,4,5]
vec['woman'] = [6,7,8,9,10]
vec['king'] = [-5,-4,-3,-2,-1]
vec['queen'] = [0,1,2,3,4]
vec['dog'] = [-10.2,-3.2,-2.3,-4.3,3.1]
vec['cat'] = [-8.3,-3.01,-2.0,-1.3,1.1]

# Old definitions that are useful

def dp(x,y):       # Dot product
    return sum(a*b for a,b in zip(x,y))

def vec_sub(x,y):  # Vector subtraction
    return [a - b for a,b in zip(x,y)]

def vec_add(x,y):  # Vector addition
    return [a + b for a,b in zip(x,y)]

### Your code here ###

# def norm(x):
#    Your code here

# def sim(x,y):
    # Your code here
    
# sim(vec['cat'], vec['dog'])
# should produce 0.9638665904796965 if correctly implemented

# Sub-exercise 2: find closest word

Here, you should define a function `find_closest()` that takes a word and the dictionary of vectors as its two arguments and returns the closest word from among the vectors and its score. The reason we need to pass the dictionary as argument to the function is that we need to loop over all word vectors to find the closest one.

Your function should behave as follows:

```
find_closest('cat', vec)
('dog', 0.9638665904796965)
```

Note that as you loop over the vectors, you should not compare a word with itself, i.e. `sim(v[cat], v[cat])` since that will always have the maximum similarity.


In [None]:
### Your code here ###

# def find_closest(word, v):
    # Your code here

# find_closest('cat', vec)
# should produce ('dog', 0.9638665904796965)


# Sub-exercise 3: analogy

Here, you should define a function `analogy()`-that takes four arguments where the first three are words, and the last is the dictionary of all vectors. Again, we're passing the vector dictionary to the function so it can make use of it.

Your function should work as follows:

    ```
    analogy('man','woman','king', vec)
    'queen'
    ```


In [None]:
### Your code here ###

# def analogy(a,b,c,v):
   # Your code here
    
# analogy('man','woman','king', vec)
# should return 'queen'
