# Text similarity by Spacy

Spacy provides different word vector models (https://spacy.io/usage/vectors-similarity). Spacy allows us to compute the similarity between words by calculating the similarity between its word vectors.

First, you will have to download the model. This model, en_vectors_web_lg includes over 1 million unique vectors. Loading the model will take some minutes...

Then, you should restart the colab runtime. To do this in the colab menu, go for Runtime > Restart runtime...




In [1]:
!python -m spacy download en_core_web_lg

Collecting en_core_web_lg==2.2.5
[?25l  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_lg-2.2.5/en_core_web_lg-2.2.5.tar.gz (827.9MB)
[K     |████████████████████████████████| 827.9MB 1.3MB/s 
Building wheels for collected packages: en-core-web-lg
  Building wheel for en-core-web-lg (setup.py) ... [?25l[?25hdone
  Created wheel for en-core-web-lg: filename=en_core_web_lg-2.2.5-cp37-none-any.whl size=829180945 sha256=d0c70060e27fb263bba1897ae4439505e659858a9d84b7c763dc0e5a9102a222
  Stored in directory: /tmp/pip-ephem-wheel-cache-6yktiegr/wheels/2a/c1/a6/fc7a877b1efca9bc6a089d6f506f16d3868408f9ff89f8dbfc
Successfully built en-core-web-lg
Installing collected packages: en-core-web-lg
Successfully installed en-core-web-lg-2.2.5
[38;5;2m✔ Download and installation successful[0m
You can now load the model via spacy.load('en_core_web_lg')


**After restart**, load the model

https://spacy.io/usage/models


In [1]:
import spacy
nlp = spacy.load("en_core_web_lg")

Spacy allows us to parse an input text and obtain information for each token. 

Please, see https://spacy.io/api/token 


In [3]:
tokens = nlp("The horse ate an apple and the cow an orange tukutuu")

for token in tokens:
    #if token.is_stop==False:
    print("Token: ", token.text)
    print("  Lemma: ", token.lemma_)
    print("  has vector?: ", token.has_vector)
    print("  L2 norm: ", token.vector_norm)
    print("  out-of-vocabulary?: ", token.is_oov)
    print()

Token:  horse
  Lemma:  horse
  has vector?:  True
  L2 norm:  6.760544
  out-of-vocabulary?:  False

Token:  ate
  Lemma:  eat
  has vector?:  True
  L2 norm:  6.7929115
  out-of-vocabulary?:  False

Token:  apple
  Lemma:  apple
  has vector?:  True
  L2 norm:  7.1346846
  out-of-vocabulary?:  False

Token:  cow
  Lemma:  cow
  has vector?:  True
  L2 norm:  6.239251
  out-of-vocabulary?:  False

Token:  orange
  Lemma:  orange
  has vector?:  True
  L2 norm:  6.5420218
  out-of-vocabulary?:  False

Token:  tukutuu
  Lemma:  tukutuu
  has vector?:  False
  L2 norm:  0.0
  out-of-vocabulary?:  True



In [4]:
token1=tokens[0]
print(token1,token1.vector)

The [ 2.7204e-01 -6.2030e-02 -1.8840e-01  2.3225e-02 -1.8158e-02  6.7192e-03
 -1.3877e-01  1.7708e-01  1.7709e-01  2.5882e+00 -3.5179e-01 -1.7312e-01
  4.3285e-01 -1.0708e-01  1.5006e-01 -1.9982e-01 -1.9093e-01  1.1871e+00
 -1.6207e-01 -2.3538e-01  3.6640e-03 -1.9156e-01 -8.5662e-02  3.9199e-02
 -6.6449e-02 -4.2090e-02 -1.9122e-01  1.1679e-02 -3.7138e-01  2.1886e-01
  1.1423e-03  4.3190e-01 -1.4205e-01  3.8059e-01  3.0654e-01  2.0167e-02
 -1.8316e-01 -6.5186e-03 -8.0549e-03 -1.2063e-01  2.7507e-02  2.9839e-01
 -2.2896e-01 -2.2882e-01  1.4671e-01 -7.6301e-02 -1.2680e-01 -6.6651e-03
 -5.2795e-02  1.4258e-01  1.5610e-01  5.5510e-02 -1.6149e-01  9.6290e-02
 -7.6533e-02 -4.9971e-02 -1.0195e-02 -4.7641e-02 -1.6679e-01 -2.3940e-01
  5.0141e-03 -4.9175e-02  1.3338e-02  4.1923e-01 -1.0104e-01  1.5111e-02
 -7.7706e-02 -1.3471e-01  1.1900e-01  1.0802e-01  2.1061e-01 -5.1904e-02
  1.8527e-01  1.7856e-01  4.1293e-02 -1.4385e-02 -8.2567e-02 -3.5483e-02
 -7.6173e-02 -4.5367e-02  8.9281e-02  3.3672e-0

SpaCy also allows us to to compare two objects and calculate how similar they are.

Each Doc, Span and Token comes with a .similarity() method that lets you compare it with another object, and determine the similarity.

For example, the following cell compares the tokens defined in the previous cell:



In [7]:
token1=tokens[1]#horse
token2=tokens[7]#cow
token3=tokens[4]#apple
token4=tokens[9]#orange
token5=tokens[10] #tukutuu


print("Similarity:", token1,token2,token1.similarity(token2))
print("Similarity:", token1,token3,token1.similarity(token3))
print("Similarity:", token3,token4,token3.similarity(token4))
print()
print("Similarity:", token1,token5,token1.similarity(token5))

Similarity: horse cow 0.6237078
Similarity: horse apple 0.254428
Similarity: apple orange 0.56189173

Similarity: horse tukutuu 0.0


  "__main__", mod_spec)


#Sentence similarity by Spacy

Spacy also allows to directly compare two texts. The method similarity returns a value that ranges from 0 to 1, with 1 meaning both texts are the same and 0 showing no similarity between them.



In [8]:
doc1 = nlp("How can I uninstall the application?")
doc2 = nlp("How can I uninstall the app?")
doc1.similarity(doc2)

0.9724940903861203

In [9]:
doc1 = nlp("How can I uninstall the application?")
doc2 = nlp("How can I remove the software?")
doc1.similarity(doc2)

0.956193218894137

In [11]:
s1="AI is our friend and it has been friendly"
s2="AI and humans have always been friendly"
doc1 = nlp(s1)#1,1,0,1
doc2 = nlp(s2)#1,1,0,1
doc1.similarity(doc2)

0.9112253039739721

In [10]:
doc1 = nlp("How can I uninstall the application?")
doc2 = nlp("How can I find true love?")
doc1.similarity(doc2)

0.851284224996053

In [None]:
doc1 = nlp("How can I uninstall the application?")
doc2 = nlp("How can we open the window?")
doc1.similarity(doc2)

0.8712929400057715

Spacy vectors are not able to capture the semantic of sentences. 


In [12]:
doc1 = nlp("The hotel was very expensive and not good")#1,1,0,1
doc2 = nlp("The hotel was very good and not expensive")#1,1,0,1
doc1.similarity(doc2)

0.9999999802058442