#### Word vectors and semantic similarity

In this lesson, you'll learn how to use spaCy to predict how similar documents, spans or tokens are to each other.

You'll also learn how to use word vectors and how to take advantage of them in your NLP application.

spaCy can compare two objects and predict how similar they are – for example, documents, spans or single tokens.

The Doc, Token and Span objects have a .similarity method that takes another object and returns a floating point number between 0 and 1, indicating how similar they are.

One thing that's very important: In order to use similarity, you need a larger spaCy pipeline that has word vectors included.

For example, the medium or large English pipeline – but not the small one. So if you want to use vectors, always go with a pipeline that ends in "md" or "lg". You can find more details on this in the documentation.

#### Comparing semantic similarity
- spaCy can compare two objects and predict similarity
- Doc.similarity(), Span.similarity() and Token.similarity()
- Take another object and return a similarity score (0 to 1)

Important: needs a pipeline that has word vectors included, for example:
- ✅ en_core_web_md (medium)
- ✅ en_core_web_lg (large)
- 🚫 NOT en_core_web_sm (small)

#### Here's an example.

Let's say we want to find out whether two documents are similar.

First, we load the medium English pipeline, "en_core_web_md".

We can then create two doc objects and use the first doc's similarity method to compare it to the second.

Here, a fairly high similarity score of 0.86 is predicted for "I like fast food" and "I like pizza".

The same works for tokens.

According to the word vectors, the tokens "pizza" and "pasta" are kind of similar, and receive a score of 0.7.

#### Similarity examples

In [4]:
import spacy

In [6]:
nlp = spacy.load("en_core_web_md")

In [8]:
doc1 = nlp("I like fast food")
doc2 = nlp("I like pizza")

print(doc1.similarity(doc2))

0.8698332283318978


In [9]:
doc = nlp("I like pizza and pasta")

token1 = doc[2]
token2 = doc[4]

print(token1.similarity(token2))

0.685019850730896


You can also use the similarity methods to compare different types of objects.

For example, a document and a token.

Here, the similarity score is pretty low and the two objects are considered fairly dissimilar.

Here's another example comparing a span – "pizza and pasta" – to a document about McDonalds.

The score returned here is 0.61, so it's determined to be kind of similar.

In [10]:
# Compare a document with a token
doc = nlp("I like pizza")
token = nlp("soap")[0]

print(doc.similarity(token))

0.1821369691957915


In [11]:
# Compare a span with a document
span = nlp("I like pizza and pasta")[2:5]
doc = nlp("McDonalds sells burgers")

print(span.similarity(doc))

0.47190033157126826


But how does spaCy do this under the hood?

Similarity is determined using word vectors, multi-dimensional representations of meanings of words.

You might have heard of Word2Vec, which is an algorithm that's often used to train word vectors from raw text.

Vectors can be added to spaCy's pipelines.

By default, the similarity returned by spaCy is the cosine similarity between two vectors – but this can be adjusted if necessary.

Vectors for objects consisting of several tokens, like the Doc and Span, default to the average of their token vectors.

That's also why you usually get more value out of shorter phrases with fewer irrelevant words.

How does spaCy predict similarity?
- Similarity is determined using word vectors
- Multi-dimensional meaning representations of words
- Generated using an algorithm like Word2Vec and lots of text
- Can be added to spaCy's pipelines
- Default: cosine similarity, but can be adjusted
- Doc and Span vectors default to average of token vectors
- Short phrases are better than long documents with many irrelevant words

To give you an idea of what those vectors look like, here's an example.

1. First, we load the medium pipeline again, which ships with word vectors.

2. Next, we can process a text and look up a token's vector using the .vector attribute.

3. The result is a 300-dimensional vector of the word "banana".

In [12]:
nlp = spacy.load("en_core_web_md")

In [13]:
doc = nlp("I like banana")

In [15]:
doc[2].vector

array([ 0.20778 , -2.4151  ,  0.36605 ,  2.0139  , -0.23752 , -3.1952  ,
       -0.2952  ,  1.2272  , -3.4129  , -0.54969 ,  0.32634 , -1.0813  ,
        0.55626 ,  1.5195  ,  0.97797 , -3.1816  , -0.37207 , -0.86093 ,
        2.1509  , -4.0845  ,  0.035405,  3.5702  , -0.79413 , -1.7025  ,
       -1.6371  , -3.198   , -1.9387  ,  0.91166 ,  0.85409 ,  1.8039  ,
       -1.103   , -2.5274  ,  1.6365  , -0.82082 ,  1.0278  , -1.705   ,
        1.5511  , -0.95633 , -1.4702  , -1.865   , -0.19324 , -0.49123 ,
        2.2361  ,  2.2119  ,  3.6654  ,  1.7943  , -0.20601 ,  1.5483  ,
       -1.3964  , -0.50819 ,  2.1288  , -2.332   ,  1.3539  , -2.1917  ,
        1.8923  ,  0.28472 ,  0.54285 ,  1.2309  ,  0.26027 ,  1.9542  ,
        1.1739  , -0.40348 ,  3.2028  ,  0.75381 , -2.7179  , -1.3587  ,
       -1.1965  , -2.0923  ,  2.2855  , -0.3058  , -0.63174 ,  0.70083 ,
        0.16899 ,  1.2325  ,  0.97006 , -0.23356 , -2.094   , -1.737   ,
        3.6075  , -1.511   , -0.9135  ,  0.53878 , 

#### Similarity depends on the application context

Predicting similarity can be useful for many types of applications. For example, to recommend a user similar texts based on the ones they have read. It can also be helpful to flag duplicate content, like posts on an online platform.

However, it's important to keep in mind that there's no objective definition of what's similar and what isn't. It always depends on the context and what your application needs to do.

Here's an example: spaCy's default word vectors assign a very high similarity score to "I like cats" and "I hate cats". This makes sense, because both texts express sentiment about cats. But in a different application context, you might want to consider the phrases as very dissimilar, because they talk about opposite sentiments.

- Useful for many applications: recommendation systems, flagging duplicates etc.
- There's no objective definition of "similarity"
- Depends on the context and what application needs to do

In [16]:
doc1 = nlp("I like cats")
doc2 = nlp("I hate cats")

print(doc1.similarity(doc2))

0.9530094042245597
