# Word vectors

In this notebook we will explore word vectors. We will use the vectors provided by [spaCy](https://spacy.io).

## Inspecting the vocabulary

We will use a preprocessed version of the vocabulary from the large English language model.

In [None]:
from spacy.vocab import Vocab

vocab = Vocab()
vocab.from_disk('en_core_web_lg-preprocessed.vocab')

The following cell prints the number of entries in the vocabulary:

In [None]:
print(len(vocab))

Every word in the vocabulary comes with a 300-dimensional vector, represented as a NumPy array. Here is the vector for *cheese*:

In [None]:
vocab['cheese'].vector

## Computing cosine similarities

Now we define a function that computes the pairwise cosine similarities between a word and all other words in the vocabulary.

In [None]:
import numpy as np

from sklearn.metrics import pairwise_distances

def most_similar(word, k=10):
    m = word.vocab.vectors.data
    x = np.array([word.vector])
    c = np.reshape(1 - pairwise_distances(m, x, metric='cosine'), -1)
    return sorted(word.vocab, key=lambda w: c[word.vocab.vectors.key2row[w.orth]], reverse=True)[:k]

What are the most similar words to *cheese*?

In [None]:
for word in most_similar(vocab['cheese']):
    print(word.orth_)

## Visualising word similarities

To visualise word vectors, we project them two a 2-dimensional plane using [t-SNE](https://lvdmaaten.github.io/tsne/), and plot the result.

In [None]:
%matplotlib inline

from sklearn.manifold import TSNE

import matplotlib.pyplot as plt

def display_most_similar(*words):
    xs = []
    ls = []
    for word in words:
        for w in most_similar(word):
            xs.append(w.vector)
            ls.append(w.orth_)
    tsne = TSNE(n_components=2, random_state=0)
    y = tsne.fit_transform(xs)
    x_coords = y[:, 0]
    y_coords = y[:, 1]
    plt.figure(figsize=(12, 8))
    plt.scatter(x_coords, y_coords)
    for label, x, y in zip(ls, x_coords, y_coords):
        plt.annotate(label, xy=(x, y), xytext=(5, 5), textcoords='offset points', size=15)
    plt.xlim(x_coords.min()-50, x_coords.max()+50)
    plt.ylim(y_coords.min()-50, y_coords.max()+50)
    plt.show()

Here are the neighbours of *cheese*.

In [None]:
display_most_similar(vocab['cheese'])

In [None]:
display_most_similar(vocab['goat'])

When adding more words, we see a cluster structure:

In [None]:
display_most_similar(vocab['cheese'], vocab['goat'], vocab['sweden'], vocab['university'], vocab['computer'])

## Analogies

We define a function that will find the closest word for a given vector (not necessarily a word vector).

In [None]:
from sklearn.metrics import pairwise_distances

def closest_word(vocab, x, exclude=[]):
    m = vocab.vectors.data
    x = np.array([x])
    c = np.reshape(1 - pairwise_distances(m, x, metric='cosine'), -1)
    for word in sorted(vocab, key=lambda w: c[vocab.vectors.key2row[w.orth]], reverse=True):
        if word not in exclude:
            return word

Of course, the closest word to *cheese* is *cheese*:

In [None]:
closest_word(vocab, vocab['cheese'].vector).orth_

What is the closest word to *cheese* if we exclude *cheese* itself?

In [None]:
closest_word(vocab, vocab['cheese'].vector, exclude=[vocab['cheese']]).orth_

We can now write a function that &lsquo;calculates&rsquo; with words.

In [None]:
def analogy(word1, word2, word3):
    x = word1.vector - word2.vector + word3.vector
    return closest_word(word1.vocab, x, exclude=[word1, word2, word3])

Here is the famous king &minus; man + woman = ? example.

In [None]:
analogy(vocab['king'], vocab['man'], vocab['woman']).orth_

The model &lsquo;knows&rsquo; the capital of Sweden.

In [None]:
analogy(vocab['berlin'], vocab['germany'], vocab['sweden']).orth_

The embedding also &lsquo;knows&rsquo; some syntactic analogies, such as the analogy between the past-tense and present-tense forms of verbs (here: *jump* and *eat*):

In [None]:
analogy(vocab['jumped'], vocab['jump'], vocab['eat']).orth_

## Limitations

The model is not good at distinguishing between synonyms and antonyms:

In [None]:
[w.orth_ for w in most_similar(vocab['alive'])]

When experimenting with analogy examples, you will find that the model has picked up common stereotypes:

In [None]:
analogy(vocab['doctor'], vocab['man'], vocab['woman']).orth_

Is a *cat* more similar to a *dog* or to a *tiger*?

In [None]:
vocab['cat'].similarity(vocab['dog'])

In [None]:
vocab['cat'].similarity(vocab['tiger'])

That&rsquo;s all folks!