#### Word vectors and semantic similarity

In this lesson, you'll learn how to use spaCy to predict how similar documents, spans or tokens are to each other.

You'll also learn how to use word vectors and how to take advantage of them in your NLP application.

spaCy can compare two objects and predict how similar they are – for example, documents, spans or single tokens.

The Doc, Token and Span objects have a .similarity method that takes another object and returns a floating point number between 0 and 1, indicating how similar they are.

One thing that's very important: In order to use similarity, you need a larger spaCy pipeline that has word vectors included.

For example, the medium or large English pipeline – but not the small one. So if you want to use vectors, always go with a pipeline that ends in "md" or "lg". You can find more details on this in the documentation.

#### Comparing semantic similarity
- spaCy can compare two objects and predict similarity
- Doc.similarity(), Span.similarity() and Token.similarity()
- Take another object and return a similarity score (0 to 1)

Important: needs a pipeline that has word vectors included, for example:
- ✅ en_core_web_md (medium)
- ✅ en_core_web_lg (large)
- 🚫 NOT en_core_web_sm (small)

#### Here's an example.

Let's say we want to find out whether two documents are similar.

First, we load the medium English pipeline, "en_core_web_md".

We can then create two doc objects and use the first doc's similarity method to compare it to the second.

Here, a fairly high similarity score of 0.86 is predicted for "I like fast food" and "I like pizza".

The same works for tokens.

According to the word vectors, the tokens "pizza" and "pasta" are kind of similar, and receive a score of 0.7.

#### Similarity examples

In [4]:
import spacy

In [6]:
nlp = spacy.load("en_core_web_md")

In [8]:
doc1 = nlp("I like fast food")
doc2 = nlp("I like pizza")

print(doc1.similarity(doc2))

0.8698332283318978


In [9]:
doc = nlp("I like pizza and pasta")

token1 = doc[2]
token2 = doc[4]

print(token1.similarity(token2))

0.685019850730896
