In [None]:
!pip install spacy

Download a big language model for embedding for English

In [None]:
!python -m spacy download en_core_web_lg

Consider an example borrowed from spacy.

In [3]:
import spacy

nlp = spacy.load("en_core_web_lg")
tokens = nlp("dog cat banana afskfsd")

for token in tokens:
    print(f"text = {token.text}\t present = {token.has_vector} \tNORM = {token.vector_norm:.2f}\tOOV = {token.is_oov}")

text = dog	 present = True 	NORM = 7.03	OOV = False
text = cat	 present = True 	NORM = 6.68	OOV = False
text = banana	 present = True 	NORM = 6.70	OOV = False
text = afskfsd	 present = False 	NORM = 0.00	OOV = True


What about similarity? What if we reorder word? Refer to [this]( https://spacy.io/usage/linguistic-features#vectors-similarity).

In [4]:
doc1 = nlp("Cats eat mice.")
doc2 = nlp("Mice eat cats.")

# TODO complete the code to esimate similarity between sentences
# refer to https://spacy.io/usage/linguistic-features#vectors-similarity
# print sentences similarity

Ok. Let's go for the advanced case. We will do sentence-based embeddings. For this we will use one of the libraries.

In [5]:
# to install pytorch in cpu mode. Refer to
# https://pytorch.org/get-started/locally/
# for details
# !conda install pytorch torchvision torchaudio cpuonly -c pytorch
# !conda install pytorch cpuonly -c pytorch

We will base sentence embeddings on `sentence_transformers` from [here](https://github.com/UKPLab/sentence-transformers).

In [None]:
!pip install transformers
!pip install -U sentence-transformers

In [7]:
from sentence_transformers import SentenceTransformer
# downloads the model first
model = SentenceTransformer('paraphrase-MiniLM-L6-v2')

[[-0.17621411  0.12060118 -0.2936239  ...  0.32965153  0.06130076
  -0.32493356]
 [ 0.32208768 -0.00123896  0.179374   ... -0.08103761  0.27076894
   0.11700248]
 [ 0.5897936  -0.23598365 -0.25411704 ...  0.14036159  1.0559162
   0.5301812 ]]


And hopefully test for mice! Refer for [these docs](https://github.com/UKPLab/sentence-transformers) to obtain embeddings and use `scipy.spatial.distance.cosine` to estimate distance.

In [8]:
from scipy.spatial.distance import cosine as cos
sents = ["Cats eat mice.", "Mice eat cats."]
emb = model.encode(sents)

# TODO complete the code to compute the embedding similaities

0.017541348934173584