# Word Vectors
Word vectors - also called *word embeddings* - are mathematical descriptions of individual words such that words that appear frequently together in the language will have similar values. In this way we can mathematically derive *context*. As mentioned above, the word vector for "lion" will be closer in value to "cat" than to "dandelion".

## Vector values
So what does a word vector look like? Since spaCy employs 300 dimensions, word vectors are stored as 300-item arrays.

In [2]:
# Import spaCy and load the language library
import spacy
nlp = spacy.load('en_core_web_md')  # make sure to use a larger model!

In [3]:
doc = nlp(u'The quick brown fox jumped over the lazy dogs.')

doc.vector

array([-1.96635887e-01, -2.32740352e-03, -5.36607020e-02, -6.10564947e-02,
       -4.08843048e-02,  1.45266443e-01, -1.08268000e-01, -6.27789786e-03,
        1.48455709e-01,  1.90697408e+00, -2.57692993e-01, -1.95818534e-03,
       -1.16141019e-02, -1.62858292e-01, -1.62938282e-01,  1.18210977e-02,
        5.12646027e-02,  1.00078702e+00, -2.01447997e-02, -2.54611671e-01,
       -1.28316596e-01, -1.97198763e-02, -2.89733019e-02, -1.94347113e-01,
        1.26644447e-01, -8.69869068e-02, -2.20812604e-01, -1.58452198e-01,
        9.86308008e-02, -1.79210991e-01, -1.55290633e-01,  1.95643142e-01,
        2.66436003e-02, -1.64984968e-02,  1.18824698e-01, -1.17830629e-03,
        4.99809943e-02, -4.23077159e-02, -3.86111848e-02, -7.47400150e-03,
        1.23448208e-01,  9.60620027e-03, -3.32463719e-02, -1.77848607e-01,
        1.19390726e-01,  1.87545009e-02, -1.84173390e-01,  6.91781715e-02,
        1.28520593e-01,  1.48827005e-02, -1.78013414e-01,  1.10003807e-01,
       -3.35464999e-02, -

In [4]:
doc.vector.shape

(300,)

## Identifying similar vectors
The best way to expose vector relationships is through the `.similarity()` method of Doc tokens.

In [5]:
# Create a three-token Doc object:
tokens = nlp(u'car bike bus')

# Iterate through token combinations:
for token1 in tokens:
    for token2 in tokens:
        print(token1.text, token2.text, token1.similarity(token2))

car car 1.0
car bike 0.5357731
car bus 0.48169607
bike car 0.5357731
bike bike 1.0
bike bus 0.39376006
bus car 0.48169607
bus bike 0.39376006
bus bus 1.0


## Vector norms
It's sometimes helpful to aggregate 300 dimensions into a [Euclidian (L2) norm](https://en.wikipedia.org/wiki/Norm_%28mathematics%29#Euclidean_norm), computed as the square root of the sum-of-squared-vectors. This is accessible as the `.vector_norm` token attribute. Other helpful attributes include `.has_vector` and `.is_oov` or *out of vocabulary*.

In [6]:
tokens = nlp(u'car bike bus')

for token in tokens:
    print(token.text, token.has_vector, token.vector_norm, token.is_oov)

car True 7.149045 False
bike True 7.237066 False
bus True 7.0950456 False


## Vector arithmetic
Believe it or not, we can actually calculate new vectors by adding & subtracting related vectors. A famous example suggests
<pre>"king" - "man" + "woman" = "queen"</pre>

In [11]:
from scipy import spatial

cosine_similarity = lambda x, y: 1 - spatial.distance.cosine(x, y)

king = nlp.vocab['king'].vector
man = nlp.vocab['man'].vector
woman = nlp.vocab['woman'].vector

# Now we find the closest vector in the vocabulary to the result of "man" - "woman" + "queen"
new_vector = king - man + woman
computed_similarities = []

for word in nlp.vocab:
    # Ignore words without vectors and mixed-case words:
    if word.has_vector:
        if word.is_lower:
            if word.is_alpha:
                similarity = cosine_similarity(new_vector, word.vector)
                computed_similarities.append((word, similarity))

computed_similarities = sorted(computed_similarities, key=lambda item: -item[1])

print([w[0].text for w in computed_similarities[:10]])

['king', 'woman', 'she', 'who', 'fox', 'brown', 'when', 'dare', 'was', 'not']
