## Word vectors in spaCy

Okay, let's have some fun with real word vectors. We're going to use the GloVe vectors that come with spaCy to creatively analyze and manipulate the text of Bram Stoker's *Dracula*. First, make sure you've got `spacy` imported:

In [381]:
from __future__ import unicode_literals
import spacy

The following cell loads the language model and parses the input text:

In [382]:
nlp = spacy.load('en')
doc = nlp(open("pg345.txt").read().decode('utf8'))

And the cell below creates a list of unique words (or tokens) in the text, as a list of strings.

In [383]:
# all of the words in the text file
tokens = list(set([w.text for w in doc if w.is_alpha]))

You can see the vector of any word in spaCy's vocabulary using the `vocab` attribute, like so:

In [385]:
nlp.vocab['cheese'].vector

array([ -5.52519977e-01,   1.88940004e-01,   6.87370002e-01,
        -1.97889999e-01,   7.05749989e-02,   1.00750005e+00,
         5.17890006e-02,  -1.56029999e-01,   3.19409996e-01,
         1.17019999e+00,  -4.72479999e-01,   4.28669989e-01,
        -4.20249999e-01,   2.48030007e-01,   6.81940019e-01,
        -6.74880028e-01,   9.24009979e-02,   1.30890000e+00,
        -3.62779982e-02,   2.00979993e-01,   7.60049999e-01,
        -6.67179972e-02,  -7.77940005e-02,   2.38440007e-01,
        -2.43509993e-01,  -5.41639984e-01,  -3.35399985e-01,
         2.98049986e-01,   3.52690011e-01,  -8.05939972e-01,
        -4.36109990e-01,   6.15350008e-01,   3.42119992e-01,
        -3.36030006e-01,   3.32819998e-01,   3.80650014e-01,
         5.74270003e-02,   9.99180004e-02,   1.25249997e-01,
         1.10389996e+00,   3.66780013e-02,   3.04899991e-01,
        -1.49419993e-01,   3.29120010e-01,   2.32999995e-01,
         4.33950007e-01,   1.56660005e-01,   2.27779999e-01,
        -2.58300006e-02,

For the sake of convenience, the following function gets the vector of a given string from spaCy's vocabulary:

In [384]:
def vec(s):
    return nlp.vocab[s].vector

### Cosine similarity and finding closest neighbors

The cell below defines a function `cosine()`, which returns the [cosine similarity](https://en.wikipedia.org/wiki/Cosine_similarity) of two vectors. Cosine similarity is another way of determining how similar two vectors are, which is more suited to high-dimensional spaces. [See the Encyclopedia of Distances for more information and even more ways of determining vector similarity.](http://www.uco.es/users/ma1fegan/Comunes/asignaturas/vision/Encyclopedia-of-distances-2009.pdf)

(You'll need to install `numpy` to get this to work. If you haven't already: `pip install numpy`. Use `sudo` if you need to and make sure you've upgraded to the most recent version of `pip` with `sudo pip install --upgrade pip`.)

In [387]:
import numpy as np
from numpy import dot
from numpy.linalg import norm

# cosine similarity
def cosine(v1, v2):
    if norm(v1) > 0 and norm(v2) > 0:
        return dot(v1, v2) / (norm(v1) * norm(v2))
    else:
        return 0.0

The following cell shows that the cosine similarity between `dog` and `puppy` is larger than the similarity between `trousers` and `octopus`, thereby demonstrating that the vectors are working how we expect them to:

In [389]:
cosine(vec('dog'), vec('puppy')) > cosine(vec('trousers'), vec('octopus'))

True

The following cell defines a function that iterates through a list of tokens and returns the token whose vector is most similar to a given vector.

In [390]:
def spacy_closest(token_list, vec_to_check, n=10):
    return sorted(token_list,
                  key=lambda x: cosine(vec_to_check, vec(x)),
                  reverse=True)[:n]

Using this function, we can get a list of synonyms, or words closest in meaning (or distribution, depending on how you look at it), to any arbitrary word in spaCy's vocabulary. In the following example, we're finding the words in *Dracula* closest to "basketball":

In [391]:
# what's the closest equivalent of basketball?
spacy_closest(tokens, vec("basketball"))

[u'tennis',
 u'coach',
 u'game',
 u'teams',
 u'Junior',
 u'junior',
 u'Team',
 u'school',
 u'boys',
 u'leagues']

### Fun with spaCy, Dracula, and vector arithmetic

Now we can start doing vector arithmetic and finding the closest words to the resulting vectors. For example, what word is closest to the halfway point between day and night?

In [393]:
# halfway between day and night
spacy_closest(tokens, meanv([vec("day"), vec("night")]))

[u'night',
 u'day',
 u'Day',
 u'evening',
 u'Evening',
 u'Morning',
 u'morning',
 u'afternoon',
 u'Nights',
 u'nights']

Variations of `night` and `day` are still closest, but after that we get words like `evening` and `morning`, which are indeed halfway between day and night!

Here are the closest words in _Dracula_ to "wine":

In [395]:
spacy_closest(tokens, vec("wine"))

[u'wine',
 u'beer',
 u'bottle',
 u'Drink',
 u'drink',
 u'fruit',
 u'bottles',
 u'taste',
 u'coffee',
 u'tasted']

If you subtract "alcohol" from "wine" and find the closest words to the resulting vector, you're left with simply a lovely dinner:

In [397]:
spacy_closest(tokens, subtractv(vec("wine"), vec("alcohol")))

[u'wine',
 u'Dinner',
 u'dinner',
 u'lovely',
 u'delicious',
 u'salad',
 u'treasure',
 u'wonderful',
 u'Wonderful',
 u'cheese']

The closest words to "water":

In [398]:
spacy_closest(tokens, vec("water"))

[u'water',
 u'waters',
 u'salt',
 u'Salt',
 u'dry',
 u'liquid',
 u'ocean',
 u'boiling',
 u'heat',
 u'sand']

But if you add "frozen" to "water," you get "ice":

In [400]:
spacy_closest(tokens, addv(vec("water"), vec("frozen")))

[u'water',
 u'cold',
 u'ice',
 u'salt',
 u'Salt',
 u'dry',
 u'fresh',
 u'liquid',
 u'boiling',
 u'milk']

You can even do analogies! For example, the words most similar to "grass":

In [401]:
spacy_closest(tokens, vec("grass"))

[u'grass',
 u'lawn',
 u'trees',
 u'garden',
 u'GARDEN',
 u'sand',
 u'tree',
 u'soil',
 u'Green',
 u'green']

If you take the difference of "blue" and "sky" and add it to grass, you get the analogous word ("green"):

In [280]:
# analogy: blue is to sky as X is to grass
blue_to_sky = subtractv(vec("blue"), vec("sky"))
spacy_closest(tokens, addv(blue_to_sky, vec("grass")))

[u'grass',
 u'Green',
 u'green',
 u'GREEN',
 u'yellow',
 u'red',
 u'Red',
 u'purple',
 u'lawn',
 u'pink']

## Sentence similarity

To get the vector for a sentence, we simply average its component vectors, like so:

In [303]:
def sentvec(s):
    sent = nlp(s)
    return meanv([w.vector for w in sent])

Let's find the sentence in our text file that is closest in "meaning" to an arbitrary input sentence. First, we'll get the list of sentences:

In [304]:
sentences = list(doc.sents)

The following function takes a list of sentences from a spaCy parse and compares them to an input sentence, sorting them by cosine similarity.

In [402]:
def spacy_closest_sent(space, input_str, n=10):
    input_vec = sentvec(input_str)
    return sorted(space,
                  key=lambda x: cosine(np.mean([w.vector for w in x], axis=0), input_vec),
                  reverse=True)[:n]

Here are the sentences in *Dracula* closest in meaning to "My favorite food is strawberry ice cream." (Extra linebreaks are present because we didn't strip them out when we originally read in the source text.)

In [315]:
for sent in spacy_closest_sent(sentences, "My favorite food is strawberry ice cream."):
    print sent.text
    print "---"

This, with some cheese
and a salad and a bottle of old Tokay, of which I had two glasses, was
my supper.
---
I got a cup of tea at the Aërated Bread Company
and came down to Purfleet by the next train.


---
We get hot soup, or coffee, or tea; and
off we go.
---
There is not even a toilet glass on my
table, and I had to get the little shaving glass from my bag before I
could either shave or brush my hair.
---
My own heart grew cold as ice,
and I could hear the gasp of Arthur, as we recognised the features of
Lucy Westenra.
---
I dined on what they
called "robber steak"--bits of bacon, onion, and beef, seasoned with red
pepper, and strung on sticks and roasted over the fire, in the simple
style of the London cat's meat!
---
I believe they went to the trouble of putting an
extra amount of garlic into our food; and I can't abide garlic.
---
Drink it off, like a good
child.
---
I had for dinner, or
rather supper, a chicken done up some way with red pepper, which was
very g

## Further resources

* [Word2vec](https://en.wikipedia.org/wiki/Word2vec) is another procedure for producing word vectors which uses a predictive approach rather than a context-counting approach. [This paper](http://clic.cimec.unitn.it/marco/publications/acl2014/baroni-etal-countpredict-acl2014.pdf) compares and contrasts the two approaches. (Spoiler: it's kind of a wash.)
* If you want to train your own word vectors on a particular corpus, the popular Python library [gensim](https://radimrehurek.com/gensim/) has an implementation of Word2Vec that is relatively easy to use. [There's a good tutorial here.](https://rare-technologies.com/word2vec-tutorial/)
* When you're working with vector spaces with high dimensionality and millions of vectors, iterating through your entire space calculating cosine similarities can be a drag. I use [Annoy](https://pypi.python.org/pypi/annoy) to make these calculations faster, and you should consider using it too.