### Word vectors and similarity 

To learn more about word vectors, how to **customize them** and how to load
**your own vectors** into spaCy, see the usage guide on
[using word vectors and semantic similarities](/usage/linguistic-features#vectors-similarity).


In [None]:
#from textblob import TextBlob

## Word vectors and similarity

To use vectors in spaCy, you might consider installing the larger models for the particular language. The common module and language packages only come with the small models. The larger models can be installed as described on the [spaCy vectors page](https://spacy.io/usage/vectors-similarity):

    python -m spacy download en_core_web_lg

The large model *en_core_web_lg* contains more than 1 million unique vectors.

Let us restart all necessary modules again, in particular spaCy:

In [3]:
!python -m spacy download en_core_web_lg
import spacy

2022-10-20 10:15:00.377399: E tensorflow/stream_executor/cuda/cuda_driver.cc:271] failed call to cuInit: CUDA_ERROR_NO_DEVICE: no CUDA-capable device is detected
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting en-core-web-lg==3.4.1
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_lg-3.4.1/en_core_web_lg-3.4.1-py3-none-any.whl (587.7 MB)
[K     |████████████████████████████████| 587.7 MB 16 kB/s 
Installing collected packages: en-core-web-lg
Successfully installed en-core-web-lg-3.4.1
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_lg')


We can now import the English NLP pipeline to process some word list. Since the small models in spacy only include context-sensitive tensors, we should use the dowloaded large model for better word vectors. We load the large model as follows:

In [4]:
nlp = spacy.load('en_core_web_lg')
#nlp = spacy.load("en_core_web_sm")

We can process a list of words by the pipeline using the *nlp* object:

In [5]:
tokens = nlp(u'dog poodle beagle cat banana apple')

As described in the spaCy chapter *[Word Vectors and Semantic Similarity](https://spacy.io/usage/vectors-similarity)*, the resulting elements of *Doc*, *Span*, and *Token* provide a method *similarity()*, which returns the similarities between words: 

In [6]:
for token1 in tokens:
    for token2 in tokens:
        print(token1, token2, token1.similarity(token2))

dog dog 1.0
dog poodle 0.6339900493621826
dog beagle 0.5964534282684326
dog cat 0.8220816850662231
dog banana 0.2090904712677002
dog apple 0.22881005704402924
poodle dog 0.6339900493621826
poodle poodle 1.0
poodle beagle 0.6217650771141052
poodle cat 0.6388016939163208
poodle banana 0.2899792790412903
poodle apple 0.2370169311761856
beagle dog 0.5964534282684326
beagle poodle 0.6217650771141052
beagle beagle 1.0
beagle cat 0.5943629145622253
beagle banana 0.10636148601770401
beagle apple 0.120062917470932
cat dog 0.8220816850662231
cat poodle 0.6388016939163208
cat beagle 0.5943629145622253
cat cat 1.0
cat banana 0.2235882580280304
cat apple 0.2036806046962738
banana dog 0.2090904712677002
banana poodle 0.2899792790412903
banana beagle 0.10636148601770401
banana cat 0.2235882580280304
banana banana 1.0
banana apple 0.6646699905395508
apple dog 0.22881005704402924
apple poodle 0.2370169311761856
apple beagle 0.120062917470932
apple cat 0.2036806046962738
apple banana 0.6646699905395508


We can access the *vectors* of these objects using the *vector* attribute:

In [12]:
tokens = nlp(u'dog cat banana grungle')

for token in tokens:
    print(token.text, token.has_vector, token.vector_norm, token.is_oov)

dog True 75.254234 False
cat True 63.188496 False
banana True 31.620354 False
grungle False 0.0 True


The attribute *has_vector* returns a boolean depending on whether the token has a vector in the model or not. The token *grungle* has no vector. It is also out-of-vocabulary (OOV), as the fourth column shows. Thus, it also has a norm of $0$, that is, it has a length of $0$.

Here the token vector has a length of $300$. We can print out the vector for a token:

In [8]:
n = 0
print(tokens[n].text, len(tokens[n].vector), tokens[n].vector)

dog 300 [ 1.2330e+00  4.2963e+00 -7.9738e+00 -1.0121e+01  1.8207e+00  1.4098e+00
 -4.5180e+00 -5.2261e+00 -2.9157e-01  9.5234e-01  6.9880e+00  5.0637e+00
 -5.5726e-03  3.3395e+00  6.4596e+00 -6.3742e+00  3.9045e-02 -3.9855e+00
  1.2085e+00 -1.3186e+00 -4.8886e+00  3.7066e+00 -2.8281e+00 -3.5447e+00
  7.6888e-01  1.5016e+00 -4.3632e+00  8.6480e+00 -5.9286e+00 -1.3055e+00
  8.3870e-01  9.0137e-01 -1.7843e+00 -1.0148e+00  2.7300e+00 -6.9039e+00
  8.0413e-01  7.4880e+00  6.1078e+00 -4.2130e+00 -1.5384e-01 -5.4995e+00
  1.0896e+01  3.9278e+00 -1.3601e-01  7.7732e-02  3.2218e+00 -5.8777e+00
  6.1359e-01 -2.4287e+00  6.2820e+00  1.3461e+01  4.3236e+00  2.4266e+00
 -2.6512e+00  1.1577e+00  5.0848e+00 -1.7058e+00  3.3824e+00  3.2850e+00
  1.0969e+00 -8.3711e+00 -1.5554e+00  2.0296e+00 -2.6796e+00 -6.9195e+00
 -2.3386e+00 -1.9916e+00 -3.0450e+00  2.4890e+00  7.3247e+00  1.3364e+00
  2.3828e-01  8.4388e-02  3.1480e+00 -1.1128e+00 -3.5598e+00 -1.2115e-01
 -2.0357e+00 -3.2731e+00 -7.7205e+00  4.094

Here just another example of similarities for some famous words:

In [9]:
tokens = nlp(u'queen king chef')

for token1 in tokens:
    for token2 in tokens:
        print(token1, token2, token1.similarity(token2))

queen queen 1.0
queen king 0.6108841896057129
queen chef 0.13113069534301758
king queen 0.6108841896057129
king king 1.0
king chef 0.04403642565011978
chef queen 0.13113069534301758
chef king 0.04403642565011978
chef chef 1.0


### Similarities in Context

In spaCy parsing, tagging and NER models make use of vector representations of contexts that represent the *meaning of words*. A text *meaning representation* is represented as an array of floats, i.e. a tensor, computed during the NLP pipeline processing. With this approach words that have not been seen before can be typed or classified. SpaCy uses a 4-layer convolutional network for the computation of these tensors. In this approach these tensors model a context of four words left and right of any given word.

Let us use the example from the spaCy documentation and check the word *labrador*:

In [None]:
tokens = nlp(u'labrador')

for token in tokens:
    print(token.text, token.has_vector, token.vector_norm, token.is_oov)

labrador True 6.850418 False


We can now test for the context:

In [None]:
doc1 = nlp(u"The labrador barked.")
doc2 = nlp(u"The labrador swam.")
doc3 = nlp(u"The people on Labrador are Canadians.")

dog = nlp(u"dog")

count = 0
for doc in [doc1, doc2, doc3]:
    lab = doc
    count += 1
    print(str(count) + ":", lab.similarity(dog))

1: 0.6907751984080799
2: 0.5961927660740638
3: 0.5374588437026319


Using this strategy we can compute document or text similarities as well:

In [None]:
docs = ( nlp(u"Paris is the largest city in France."),
        nlp(u"Vilnius is the capital of Lithuania."),
        nlp(u"An emu is a large bird.") )

for x in range(len(docs)):
    zset = set(range(len(docs)))
    zset.remove(x)
    for y in zset:
        print(x, y, docs[x].similarity(docs[y]))

0 1 0.7554966079333336
0 2 0.6921463288355282
1 0 0.7554966079333336
1 2 0.5668025741640493
2 0 0.6921463288355282
2 1 0.5668025741640493


We can vary the word order in sentences and compare them:

In [None]:
docs = [nlp(u"dog bites man"), nlp(u"man bites dog"),
        nlp(u"man dog bites"), nlp(u"cat eats mouse")]

for doc in docs:
    for other_doc in docs:
        print('"' + doc.text + '"', '"' + other_doc.text + '"', doc.similarity(other_doc))

"dog bites man" "dog bites man" 1.0
"dog bites man" "man bites dog" 0.9999999711588186
"dog bites man" "man dog bites" 1.000000047362914
"dog bites man" "cat eats mouse" 0.7096954239846529
"man bites dog" "dog bites man" 0.9999999711588186
"man bites dog" "man bites dog" 1.0
"man bites dog" "man dog bites" 1.0000000462548106
"man bites dog" "cat eats mouse" 0.709695423198237
"man dog bites" "dog bites man" 1.000000047362914
"man dog bites" "man bites dog" 1.0000000462548106
"man dog bites" "man dog bites" 1.0
"man dog bites" "cat eats mouse" 0.7096954242750528
"cat eats mouse" "dog bites man" 0.7096954239846529
"cat eats mouse" "man bites dog" 0.709695423198237
"cat eats mouse" "man dog bites" 0.7096954242750528
"cat eats mouse" "cat eats mouse" 1.0


### Custom Models