## **Word vectors and Semantic similarity**

**Comparing semantic similarity**
- `spaCy` can compare 2 objects and predict similarity
  - `Doc.similarity()`
  - `Span.similarity()`
  - `Token.similarity()`
- Returns a similarity score between `0` and `1`
- It needs a pipeline which has word vectors included:
  - ✅ `en_core_web_md` (Medium)
  - ✅ `en_core_web_lg` (Large)
  - ❌ `en_core_web_sm` (Small)

In [1]:
import spacy
nlp = spacy.load("en_core_web_lg")

In [2]:
doc1 = nlp("I like fast food")
doc2 = nlp("I like pizza")

**Check similarity between 2 documents**

In [6]:
# Compare 2 documents
print(f"Doc similarity: {doc1.similarity(doc2):.3f}")

Doc similarity: 0.870


**Check similarity between 2 tokens**

In [5]:
# Compare 2 tokens
doc = nlp("I like pizza and pasta")
token1 = doc[2]
token2 = doc[4]

In [7]:
print(f"Token similarity: {token1.similarity(token2):.3f}")

Token similarity: 0.685


**Check similarity between a document and a token**

In [8]:
# Compare a token with a document
doc = nlp("I like pizza")
token = nlp("soap")[0]

In [9]:
print(f"Doc Token similarity: {doc.similarity(token):.3f}")

Doc Token similarity: 0.182


**Compare a span with a document**

In [10]:
# Compare a span with a document
span = nlp("I like pizza, extra salami")[2:]
doc = nlp("McDonalds sells burgers")

In [11]:
print(f"Span Document similarity: {span.similarity(doc):.3f}")

Span Document similarity: 0.506


**How does `spaCy` predict similarity?**
- Similarity is determined using **word vectors**
- Multi-dimensional meaning representations of words
- Generated using an algorith like `Word2Vec` and lots of text
- Default: cosine similarity, but can be adjusted
- `Doc` and `Span` vectors default to average of token vectors
- Short phrases are better than long documents with many irrelevant words

### **Word Vectors with `spaCy`**

In [12]:
doc = nlp("I have a banana")
doc[3].vector

array([ 0.20778 , -2.4151  ,  0.36605 ,  2.0139  , -0.23752 , -3.1952  ,
       -0.2952  ,  1.2272  , -3.4129  , -0.54969 ,  0.32634 , -1.0813  ,
        0.55626 ,  1.5195  ,  0.97797 , -3.1816  , -0.37207 , -0.86093 ,
        2.1509  , -4.0845  ,  0.035405,  3.5702  , -0.79413 , -1.7025  ,
       -1.6371  , -3.198   , -1.9387  ,  0.91166 ,  0.85409 ,  1.8039  ,
       -1.103   , -2.5274  ,  1.6365  , -0.82082 ,  1.0278  , -1.705   ,
        1.5511  , -0.95633 , -1.4702  , -1.865   , -0.19324 , -0.49123 ,
        2.2361  ,  2.2119  ,  3.6654  ,  1.7943  , -0.20601 ,  1.5483  ,
       -1.3964  , -0.50819 ,  2.1288  , -2.332   ,  1.3539  , -2.1917  ,
        1.8923  ,  0.28472 ,  0.54285 ,  1.2309  ,  0.26027 ,  1.9542  ,
        1.1739  , -0.40348 ,  3.2028  ,  0.75381 , -2.7179  , -1.3587  ,
       -1.1965  , -2.0923  ,  2.2855  , -0.3058  , -0.63174 ,  0.70083 ,
        0.16899 ,  1.2325  ,  0.97006 , -0.23356 , -2.094   , -1.737   ,
        3.6075  , -1.511   , -0.9135  ,  0.53878 , 