<a href="https://colab.research.google.com/github/isegura/seminarioUPM/blob/main/2_SpacyWordEmb.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Word Embeddings from Spacy

Spacy provides different word vector models (https://spacy.io/usage/vectors-similarity). 

Spacy allows us to compute the similarity between words by calculating the similarity between its word vectors.

First, you will have to download the model. This model, en_vectors_web_lg includes over 1 million unique vectors. Loading the model will take some minutes...

Then, you should restart the colab runtime. To do this in the colab menu, go for Runtime > Restart runtime...




In [None]:
!python -m spacy download en_core_web_lg

Collecting en_core_web_lg==2.2.5
[?25l  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_lg-2.2.5/en_core_web_lg-2.2.5.tar.gz (827.9MB)
[K     |████████████████████████████████| 827.9MB 1.3MB/s 
Building wheels for collected packages: en-core-web-lg
  Building wheel for en-core-web-lg (setup.py) ... [?25l[?25hdone
  Created wheel for en-core-web-lg: filename=en_core_web_lg-2.2.5-cp37-none-any.whl size=829180945 sha256=4b4947b84b09f44b6528c1e3a792b94eab921510e311ec15d55028ea28142596
  Stored in directory: /tmp/pip-ephem-wheel-cache-l0fjyrqf/wheels/2a/c1/a6/fc7a877b1efca9bc6a089d6f506f16d3868408f9ff89f8dbfc
Successfully built en-core-web-lg
Installing collected packages: en-core-web-lg
Successfully installed en-core-web-lg-2.2.5
[38;5;2m✔ Download and installation successful[0m
You can now load the model via spacy.load('en_core_web_lg')


After restart, load the model

In [None]:
import spacy
nlp = spacy.load("en_core_web_lg")

Obtain word vectors:

In [None]:
tokens = nlp("horse cow apple orange tukutuu")

for token in tokens:
    #print(token.text, token.has_vector, token.vector_norm, token.is_oov)
    print("Token: ", token.text)
    print("  has vector?: ", token.has_vector)
    print("  L2 norm: ", token.vector_norm)
    print("  out-of-vocabulary?: ", token.is_oov)
    print()

Token:  horse
  has vector?:  True
  L2 norm:  6.760544
  out-of-vocabulary?:  False

Token:  cow
  has vector?:  True
  L2 norm:  6.239251
  out-of-vocabulary?:  False

Token:  apple
  has vector?:  True
  L2 norm:  7.1346846
  out-of-vocabulary?:  False

Token:  orange
  has vector?:  True
  L2 norm:  6.5420218
  out-of-vocabulary?:  False

Token:  tukutuu
  has vector?:  False
  L2 norm:  0.0
  out-of-vocabulary?:  True



In [None]:
token1=tokens[0]
print(token1,token1.vector)

horse [-1.4949e-01  3.7432e-01 -2.2050e-01 -4.5928e-01  7.8750e-02 -2.5019e-01
 -3.9290e-01 -2.1967e-01 -1.3093e-01  2.1229e+00 -5.1749e-01  6.5582e-01
  1.9008e-01  4.6526e-01 -1.0754e-01  8.4592e-04  1.2311e-01  7.1000e-01
  1.8634e-02 -2.2504e-01 -6.4710e-01 -1.5335e-01 -1.5011e-01  1.3864e-01
 -1.5222e-01  2.8469e-01 -8.5883e-02 -1.8998e-01 -1.6721e-01  5.5677e-02
 -2.1522e-01 -1.8072e-01  1.4413e-01  3.0392e-01  1.1802e-01  2.6905e-01
  2.8804e-01 -5.9657e-01  3.2062e-01 -1.0510e-02 -3.2404e-01  6.6192e-02
 -2.1103e-01 -4.0243e-02  3.4579e-01 -5.3813e-01 -9.6661e-02 -4.6308e-01
  3.6171e-01 -1.1145e-01  2.7529e-02  2.2443e-01 -6.0474e-02  3.8606e-01
  1.1518e-01  6.1889e-01  5.3072e-01  1.7185e-01 -7.4372e-01  2.0626e-01
 -1.3338e-01  7.8106e-01  3.4880e-01  7.9740e-01 -2.0633e-01 -3.3352e-01
  5.6860e-01 -5.1850e-01  1.7523e-01  1.5407e-01 -3.5782e-01 -4.2825e-01
  1.4979e-01 -1.6131e-01  1.7891e-01 -1.7277e-01 -1.0291e-01 -3.1026e-02
 -2.5864e-02 -3.9754e-01 -3.1961e-01 -5.6548e

SpaCy also allows us to to compare two objects and calculate how similar they are.

Each Doc, Span and Token comes with a .similarity() method that lets you compare it with another object, and determine the similarity.

For example, the following cell compares the tokens defined in the previous cell:



In [None]:
token1=tokens[0]#horse
token2=tokens[1]#cow
token3=tokens[2]#apple
token4=tokens[3]#orange
token5=tokens[4] #tukutuu
print("Similarity:", token1,token2,token1.similarity(token2))
print("Similarity:", token1,token3,token1.similarity(token3))
print("Similarity:", token3,token4,token3.similarity(token4))
print("Similarity:", token1,token5,token1.similarity(token5))

Similarity: horse cow 0.6237078
Similarity: horse apple 0.254428
Similarity: apple orange 0.56189173
Similarity: horse tukutuu 0.0


  "__main__", mod_spec)


Spacy also allows to directly compare two texts. The method similarity returns a value that ranges from 0 to 1, with 1 meaning both texts are the same and 0 showing no similarity between them.



In [None]:
doc1 = nlp("How can I uninstall the application?")
doc2 = nlp("How can I uninstall the app?")
doc1.similarity(doc2)

0.9724940903861203

In [None]:
doc1 = nlp("How can I uninstall the application?")
doc2 = nlp("How can I remove the software?")
print(doc1.vector)
print(doc2.vector)

doc1.similarity(doc2)

0.956193218894137

In [None]:
doc1 = nlp("How can I uninstall the application?")
doc2 = nlp("How can I find true love?")
doc1.similarity(doc2)

0.851284224996053

In [None]:
doc1 = nlp("How can I uninstall the application?")
doc2 = nlp("How can we open the window?")
doc1.similarity(doc2)

0.8712929400057715

In [None]:
doc1 = nlp("The hotel was very expensive and not good")
doc2 = nlp("The hotel was very good and not expensive")
doc1.similarity(doc2)

0.9999999802058442