<a href="https://colab.research.google.com/github/guillermomore/guillermomore/blob/main/Word_prediction_with_SpaCy_Embeddings.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

___
# Word prediction using SpaCy

Use of SpaCy embeddings to predict a word from a relation of other words, using cosine distance method.

We'll use SpaCy large embeddings file

> [**en_core_web_lg**](https://spacy.io/models/en#en_core_web_lg) (812MB) Vectors: 685k keys, 685k unique vectors (300 dimensions)


In [1]:
!python -m spacy download en_core_web_lg

Collecting en_core_web_lg==2.2.5
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_lg-2.2.5/en_core_web_lg-2.2.5.tar.gz (827.9 MB)
[K     |████████████████████████████████| 827.9 MB 1.2 MB/s 
Building wheels for collected packages: en-core-web-lg
  Building wheel for en-core-web-lg (setup.py) ... [?25l[?25hdone
  Created wheel for en-core-web-lg: filename=en_core_web_lg-2.2.5-py3-none-any.whl size=829180942 sha256=75c907d6b726a47636e83b8e04be79481e430b28475edee90325f70a0133d5d1
  Stored in directory: /tmp/pip-ephem-wheel-cache-ew4_i5a2/wheels/11/95/ba/2c36cc368c0bd339b44a791c2c1881a1fb714b78c29a4cb8f5
Successfully built en-core-web-lg
Installing collected packages: en-core-web-lg
Successfully installed en-core-web-lg-2.2.5
[38;5;2m✔ Download and installation successful[0m
You can now load the model via spacy.load('en_core_web_lg')


After downloading and installing, colab runtime must be restarted (Ctrl = M)

In [1]:
import spacy
nlp = spacy.load('en_core_web_lg')

# Vector operations

Calculation of new vectors with addition and substraction, following the Word2vec famous example:

<pre>"king" - "man" + "woman" = "queen"</pre>


In [2]:
from scipy import spatial

#Cosine distance calculation function
def cosine_similarity(x, y):
  return 1 - spatial.distance.cosine(x, y) 

#Vectors
king = nlp.vocab['king'].vector
man = nlp.vocab['man'].vector
woman = nlp.vocab['woman'].vector

new_vector = king - man + woman
computed_similarities = [] #An empty list to save the words and its probabilities

# Comparison with all vocabulary
for word in nlp.vocab:
    # Ignore words without embeddings in the model
    if word.has_vector:
      #Ignore upper case words
      if word.is_lower:
            # Keep only words
            if word.is_alpha:
              #Calculate cosine distance
              similarity = cosine_similarity(new_vector, word.vector)
              # Save results in list of tuples
              # Each tuple has token as a 1st element and similarity as the 2nd
              computed_similarities.append((word, similarity))

# Order by similarity, descending
computed_similarities = sorted(computed_similarities, key=lambda item: -item[1])

print([f"{w[0].text}:{w[1]}" for w in computed_similarities[:10]])

['king:0.8024259805679321', 'queen:0.7880843877792358', 'prince:0.6401076912879944', 'kings:0.6208544373512268', 'princess:0.6125636100769043', 'royal:0.5800970792770386', 'throne:0.5787012577056885', 'queens:0.5743793845176697', 'monarch:0.563362181186676', 'kingdom:0.5520980954170227']
