# Word vectors

In this notebook we will explore word vectors. We will use the vectors from [spaCy](http://spacy.io).

In [None]:
import spacy

## Inspecting what is there

Load the large English language model. This can take a few seconds.

In [None]:
nlp = spacy.load('en_core_web_lg')

Every word in the vocabulary comes with a 300-dimensional vector, represented as a NumPy array. Here is the vector for *cheese*:

In [None]:
nlp.vocab['cheese'].vector

## Cleaning the vocabulary

Because the language model was built from web data, its vocabulary is rather large (ca. 1.3 million entries) and contains a lot of non-conventional words:

In [None]:
for i, word in enumerate(nlp.vocab):
    if i >= 10:
        break
    print(word.orth_)

We remove words with no word vectors (null vectors) and non-alphabetic characters, and normalise to lowercase. However, for each normalised word we keep the word vectors of all non-normalised word forms.

In [None]:
w2v = {}
for word in nlp.vocab:
    if word.has_vector:
        w = word.orth_.lower()
        if w.isalpha():
            if w not in w2v:
                w2v[w] = []
            w2v[w].append(word.vector)

The table `w2v` now holds the word vectors for all different forms of *cheese*.

In [None]:
len(w2v['cheese'])

We now construct a new vocabulary where the vector for each word is the average of the vectors of the different word forms in the old vocabulary. For this we need to load NumPy.

In [None]:
from spacy.vocab import Vocab

import numpy as np

lc_vocab = Vocab(strings=w2v.keys())
for w in lc_vocab:
    lc_vocab.set_vector(w.orth, np.mean(np.array(w2v[w.orth_]), axis=0))

Here is the size of the new vocabulary:

In [None]:
len(lc_vocab)

Here is the new vector for *cheese*:

In [None]:
lc_vocab['cheese'].vector

## Computing cosine similarities

Now we define a function that computes the pairwise cosine similarities between a word and all other words in the vocabulary.

In [None]:
from sklearn.metrics import pairwise_distances

def most_similar(word, k=10):
    m = word.vocab.vectors.data
    x = np.array([word.vector])
    c = np.reshape(1 - pairwise_distances(m, x, metric='cosine'), -1)
    return sorted(word.vocab, key=lambda w: c[word.vocab.vectors.key2row[w.orth]], reverse=True)[:k]

What are the most similar words to *cheese*?

In [None]:
for word in most_similar(lc_vocab['cheese']):
    print(word.orth_)

## Visualising word similarities

To visualise word vectors, we project them two a 2-dimensional plane using [t-SNE](https://lvdmaaten.github.io/tsne/), and plot the result.

In [None]:
%matplotlib inline

from sklearn.manifold import TSNE

import matplotlib.pyplot as plt

def display_most_similar(*words):
    xs = []
    ls = []
    for word in words:
        for w in most_similar(word):
            xs.append(w.vector)
            ls.append(w.orth_)
    tsne = TSNE(n_components=2, random_state=0)
    y = tsne.fit_transform(xs)
    x_coords = y[:, 0]
    y_coords = y[:, 1]
    plt.figure(figsize=(12, 8))
    plt.scatter(x_coords, y_coords)
    for label, x, y in zip(ls, x_coords, y_coords):
        plt.annotate(label, xy=(x, y), xytext=(5, 5), textcoords='offset points', size=15)
    plt.xlim(x_coords.min()-50, x_coords.max()+50)
    plt.ylim(y_coords.min()-50, y_coords.max()+50)
    plt.show()

Here are the neighbours of *cheese*.

In [None]:
display_most_similar(lc_vocab['cheese'])

In [None]:
display_most_similar(lc_vocab['goat'])

When adding more words, we see a cluster structure:

In [None]:
display_most_similar(lc_vocab['cheese'], lc_vocab['goat'], lc_vocab['sweden'], lc_vocab['university'], lc_vocab['computer'])

## Analogies

We start by defining a function that will find the closest word for a given vector (not necessarily a word vector).

In [None]:
from sklearn.metrics import pairwise_distances

def closest_word(vocab, x, exclude=[]):
    m = vocab.vectors.data
    x = np.array([x])
    c = np.reshape(1 - pairwise_distances(m, x, metric='cosine'), -1)
    for word in sorted(vocab, key=lambda w: c[vocab.vectors.key2row[w.orth]], reverse=True):
        if word not in exclude:
            return word

Of course, the closest word to *cheese* is *cheese*:

In [None]:
closest_word(lc_vocab, lc_vocab['cheese'].vector).orth_

What is the closest word to *cheese* if we exclude *cheese* itself?

In [None]:
closest_word(lc_vocab, lc_vocab['cheese'].vector, exclude=[lc_vocab['cheese']]).orth_

We can now write a function that &lsquo;calculates&rsquo; with words.

In [None]:
def analogy(word1, word2, word3):
    x = word1.vector - word2.vector + word3.vector
    return closest_word(word1.vocab, x, exclude=[word1, word2, word3])

Here is the famous king &minus; man + woman = ? example.

In [None]:
analogy(lc_vocab['king'], lc_vocab['man'], lc_vocab['woman']).orth_

The model knows the capital of Sweden.

In [None]:
analogy(lc_vocab['berlin'], lc_vocab['germany'], lc_vocab['sweden']).orth_

The embedding also &lsquo;learns&rsquo; some syntactic analogies, such as the analogy between the past-tense and present-tense forms of verbs (here: *jump* and *eat*):

In [None]:
analogy(lc_vocab['jumped'], lc_vocab['jump'], lc_vocab['eat']).orth_

## Limitations

The model is not good at distinguishing between synonyms and antonyms:

In [None]:
[w.orth_ for w in most_similar(lc_vocab['alive'])]

When experimenting with analogy examples, you will find that the embedding picks up common stereotypes:

In [None]:
analogy(lc_vocab['doctor'], lc_vocab['man'], lc_vocab['woman']).orth_

In [None]:
analogy(lc_vocab['germany'], lc_vocab['beer'], lc_vocab['wine']).orth_

Is a *cat* more closely related to a *dog* or to a *tiger*?

In [None]:
[w.orth_ for w in most_similar(lc_vocab['cat'])]

In [None]:
[w.orth_ for w in most_similar(lc_vocab['tiger'])]