<a href="https://colab.research.google.com/github/isegura/TextSimilarity/blob/master/Word_Embeddings_By_Spacy.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Word Embeddings using Spacy

Spacy provides different word vector models (https://spacy.io/usage/vectors-similarity). In particular, Spacy offers small models (their names end in sm) that allows us to compare documents and words very fast. However, their results will be worse than using the real word vectors (larger models like en_core_web_lg). 

Spacy allows us to compute the similarity between words by calculating the similarity between its word vectors. 

First, you will have to download the model. This model, 
**en_vectors_web_lg** includes over 1 million unique vectors.
Loading the model will take some minutes...

Then, you should restart the colab runtime. To do this in the colab menu, go for Runtime > Restart runtime...




In [0]:
!python -m spacy download en_core_web_lg

[38;5;2m✔ Download and installation successful[0m
You can now load the model via spacy.load('en_core_web_lg')


Now, we can use the model to look if some words have word vectors. 

In [0]:
import spacy

nlp = spacy.load("en_core_web_lg")
tokens = nlp("horse cow apple orange iansdiufnas")

for token in tokens:
    print(token.text, token.has_vector, token.vector_norm, token.is_oov)

horse True 6.760544 False
cow True 6.239251 False
apple True 7.1346846 False
orange True 6.5420218 False
iansdiufnas False 0.0 True


You can also access to the word vector using the .vector attribute of each token:

In [0]:
token1=tokens[0]
print(token1.vector)

[-1.4949e-01  3.7432e-01 -2.2050e-01 -4.5928e-01  7.8750e-02 -2.5019e-01
 -3.9290e-01 -2.1967e-01 -1.3093e-01  2.1229e+00 -5.1749e-01  6.5582e-01
  1.9008e-01  4.6526e-01 -1.0754e-01  8.4592e-04  1.2311e-01  7.1000e-01
  1.8634e-02 -2.2504e-01 -6.4710e-01 -1.5335e-01 -1.5011e-01  1.3864e-01
 -1.5222e-01  2.8469e-01 -8.5883e-02 -1.8998e-01 -1.6721e-01  5.5677e-02
 -2.1522e-01 -1.8072e-01  1.4413e-01  3.0392e-01  1.1802e-01  2.6905e-01
  2.8804e-01 -5.9657e-01  3.2062e-01 -1.0510e-02 -3.2404e-01  6.6192e-02
 -2.1103e-01 -4.0243e-02  3.4579e-01 -5.3813e-01 -9.6661e-02 -4.6308e-01
  3.6171e-01 -1.1145e-01  2.7529e-02  2.2443e-01 -6.0474e-02  3.8606e-01
  1.1518e-01  6.1889e-01  5.3072e-01  1.7185e-01 -7.4372e-01  2.0626e-01
 -1.3338e-01  7.8106e-01  3.4880e-01  7.9740e-01 -2.0633e-01 -3.3352e-01
  5.6860e-01 -5.1850e-01  1.7523e-01  1.5407e-01 -3.5782e-01 -4.2825e-01
  1.4979e-01 -1.6131e-01  1.7891e-01 -1.7277e-01 -1.0291e-01 -3.1026e-02
 -2.5864e-02 -3.9754e-01 -3.1961e-01 -5.6548e-01  1

SpaCy also allows us to to compare two objects and calculate how  similar they are. 

Each Doc, Span and Token comes with a .similarity() method that lets you compare it with another object, and determine the similarity. 

For example, the following cell compares the  tokens defined in the previous cell:

In [0]:
token1=tokens[0]#horse
token2=tokens[1]#cow
token3=tokens[2]#apple
token4=tokens[3]#orange
token5=tokens[4]
print(token1,token2,token1.similarity(token2))
print(token1,token3,token1.similarity(token3))
print(token3,token4,token3.similarity(token4))
print(token1,token5,token1.similarity(token5))

horse cow 0.6237078
horse apple 0.254428
apple orange 0.56189173
horse iansdiufnas 0.0


  "__main__", mod_spec)


You can also compare two sentences. The method similarity returns a value that ranges from 0 to 1, with 1 meaning both sentences are the same and 0 showing no similarity between both sentences. 

In [0]:
doc1 = nlp("How can I uninstall the application?")
doc2 = nlp("How can I find true love?")
doc1.similarity(doc2)

0.8447064005814853

In the previous example, the result is quite high even though the sentences don’t seem to be related from a human perspective. This is due to both of the sentences starting with “How can I” and ending with the symbol “?”.

To avoid this, you should only keep the relevant parts, for example:

In [0]:
doc1 = nlp("uninstall the application")
doc2 = nlp("find true lovet")
doc1.similarity(doc2)

0.2554321705657904