# **Word vectors**


In the previous exercise we observed that colors that we think of as similar are 'closer' to each other in RGB vector space. Is it possible to create a vector space for all English words that has this same 'closer in space is closer in meaning' property?

The answer is yes! Luckily, you don't need to create those vectors from scratch. Many researchers have made downloadable databases of pre-trained vectors. One such project is [Stanford's Global Vectors for Word Representation (GloVe)](https://nlp.stanford.edu/projects/glove/). 

These $300$-dimensional vectors are included with $\texttt{spaCy}$, and they're the vectors we'll be using in this exercise.

![cosine similarity: picture](https://d33wubrfki0l68.cloudfront.net/d2742976a92aa4d6c39f19c747ec5f56ed1cec30/3803f/images/guide-to-word-vectors-with-gensim-and-keras_files/word2vec-king-queen-vectors.png)

In [None]:
# The following will download the language model.
# Resart the runtime (Runtime -> Restart runtime) after running this cell
# (and don't run it for the second time).
!python -m spacy download en_core_web_lg

Collecting en_core_web_lg==2.2.5
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_lg-2.2.5/en_core_web_lg-2.2.5.tar.gz (827.9 MB)
[K     |████████████████████████████████| 827.9 MB 1.2 MB/s 
Building wheels for collected packages: en-core-web-lg
  Building wheel for en-core-web-lg (setup.py) ... [?25l[?25hdone
  Created wheel for en-core-web-lg: filename=en_core_web_lg-2.2.5-py3-none-any.whl size=829180942 sha256=0986e2ab9c5ff381074648561420e11e12ced6b3c3d8efa6bd6d20743fe6a08d
  Stored in directory: /tmp/pip-ephem-wheel-cache-00uuv5tg/wheels/11/95/ba/2c36cc368c0bd339b44a791c2c1881a1fb714b78c29a4cb8f5
Successfully built en-core-web-lg
Installing collected packages: en-core-web-lg
Successfully installed en-core-web-lg-2.2.5
[38;5;2m✔ Download and installation successful[0m
You can now load the model via spacy.load('en_core_web_lg')


Let's load the model now:

In [None]:
import en_core_web_sm

In [None]:
nlp = en_core_web_sm.load()

In [None]:
# import spacy

# nlp = spacy.load('en_core_web_lg')

OSError: ignored

## **Word vectors: the first glance**

You can see the vector of any word in $\texttt{spaCy}$' s vocabulary using the $\texttt{vector}$ attribute:

In [None]:
# A 300-dimensional vector
len(nlp('dog').vector)

96

In [None]:
nlp('dog').vector

array([ 3.0170894 , -1.5468277 ,  1.4642837 , -0.45664647,  2.416998  ,
       -0.82837516,  0.773814  ,  0.7099814 ,  0.73783636,  1.9741133 ,
        3.7342863 ,  2.0679865 ,  3.8942056 , -0.6749698 ,  0.37507713,
       -2.0970044 , -0.6250715 ,  2.6508548 , -1.5724103 , -4.0325656 ,
       -1.4097672 ,  0.39648557, -0.70805675, -1.0381888 ,  1.6989393 ,
       -1.0706389 ,  0.66801304, -3.9096825 ,  2.607851  , -0.7741172 ,
        3.8687487 , -0.28618616,  0.40867335,  2.0196295 , -0.8187747 ,
       -1.3746587 ,  1.1600451 , -0.06880021, -1.3988796 ,  0.5209464 ,
        4.9956036 ,  2.896077  ,  0.08491665, -3.1742032 ,  0.00753534,
        1.8921385 , -0.12929648,  0.30110502, -0.8420582 , -0.76468706,
        0.44588238, -1.4486729 , -2.1735194 , -0.56612396, -1.6122862 ,
        0.677354  ,  3.816813  , -1.1397399 ,  0.25616455, -1.4188657 ,
        0.62450516,  0.42642492, -1.1126095 , -1.6981561 ,  0.53187704,
       -3.6243727 ,  1.3320243 , -0.53186584, -4.1490126 ,  0.51

## **Cosine similarity**

**Cosine similarity** is a common way of assessing similarity between words in NLP. It is essentially defined as the cosine of the angle between the vectors representing the words of interest.

Recall that the angle $\phi$ between two non-zero vectors $u$ and $v$ can be computed as follows:

$cos(\phi) = \frac{(u,v)}{||u||\cdot||v||}$

![](https://miro.medium.com/max/1394/1*_Bf9goaALQrS_0XkBozEiQ.png)



Define a function computing cosine similarity between two vectors.

In [None]:
import numpy as np

def cosine(v1, v2):
  return np.dot(v1, v2) / (np.sqrt(np.sum(v1**2)) * np.sqrt(np.sum(v2**2)))

In [None]:
from numpy.linalg import norm

def cosine(v1, v2):
    if norm(v1) > 0 and norm(v2) > 0:
        return np.dot(v1, v2) / (norm(v1) * norm(v2))
    else:
        return 0.0

Test your function by computing similarities of some random pairs of words, e.g. $dog$ and $puppy$ vs. $dog$ and $kitten$. 

In [None]:
dog = nlp('dog').vector
kitten = nlp('kitten').vector

cosine(dog, kitten)

0.35801542

## **Loading the text**

Let's load the full text of *Alice in Wonderland*. It will serve us as a corpus of English words.

In [None]:
import requests

# Alice in Wonderland
response = requests.get('https://www.gutenberg.org/files/11/11-0.txt')

# If you prefer Dracula, load this instead:
#response = requests.get('https://www.gutenberg.org/cache/epub/345/pg345.txt')

# Extracting separate words from the text
doc = nlp(response.text)
tokens = list(set([w.text for w in doc if w.is_alpha]))

Check out the content of $\texttt{tokens}$ now.

In [None]:
tokens

['INDIRECT',
 'leaving',
 'legal',
 'addresses',
 'paused',
 'hot',
 'theirs',
 'severity',
 'would',
 'ordered',
 'doubled',
 'performed',
 'affectionately',
 'word',
 'DAMAGE',
 'choose',
 'tails',
 'electronically',
 'pop',
 'acceptance',
 'teeth',
 'wander',
 'rumbling',
 'engine',
 'Lacie',
 'all',
 'kiss',
 'pencils',
 'trouble',
 'change',
 'errors',
 'main',
 'fit',
 'outdated',
 'thrown',
 'night',
 'ridges',
 'poker',
 'chains',
 'waste',
 'explained',
 'prohibition',
 'butter',
 'current',
 'hear',
 'passion',
 'Because',
 'screaming',
 'consultation',
 'nervous',
 'flowers',
 'warranties',
 'best',
 'remembered',
 'helpless',
 'French',
 'empty',
 'AND',
 'turtles',
 'chorus',
 'fancying',
 'Seven',
 'order',
 'shutting',
 'prizes',
 'noise',
 'wants',
 'money',
 'kettle',
 'humbly',
 'future',
 'consented',
 'mile',
 'mournfully',
 'Race',
 'your',
 'enjoy',
 'made',
 'fixed',
 'perfectly',
 'venture',
 'dainties',
 'fits',
 'noises',
 'terribly',
 'an',
 'Use',
 'adding',

Define a function that takes a word and lists the $n$ most similar words in our corpus.

In [None]:
def spacy_closest(tokens, new_vec, n=10):
  d = dict()
  for w in tokens:
    vec = nlp(w.lower()).vector
    c = cosine(vec, new_vec)
    d[w] = c
  d = dict(sorted(d.items(), key=lambda item: item[1]))
  ans = dict()
  for k, v in d.items():
    if n == 0:
      break
    ans[k] = v
    n -= 1
  return ans

Try to find words similar to some random words, e.g. $good$.

In [None]:
cl = spacy_closest(tokens, nlp('good').vector)

In [None]:
print(cl)

{'exists': 0.017897528, 'turtles': 0.041663148, 'seems': 0.05380389, 'eats': 0.074532226, 'trusts': 0.07759, 'them': 0.078195356, 'says': 0.08203166, 'tells': 0.08383591, 'Tarts': 0.09429473, 'tarts': 0.09429473}


You can also get creative and search for combinations of words. For example, what is similar to $king - man + woman$? 

In [None]:
king = nlp('king').vector
man = nlp('man').vector
woman = nlp('woman').vector

v = king - man + woman

In [None]:
c1 = spacy_closest(tokens, v)

In [None]:
print(c1)

{'Ah': -0.18519199, 'oh': -0.18424168, 'Oh': -0.18424168, 'are': -0.1717206, 'Please': -0.11054639, 'please': -0.11054639, 'PLEASE': -0.11054639, 'wow': -0.10372446, 'least': -0.1019255, 'HAVE': -0.08215847}


## **Sentence vectors**

We can also construct a vector representation for the whole sentence. For example, we can define it as an *average* of the   vectors representing the words in it.

Let's take a random sentence *My favorite food is strawberry ice cream* and construct its vector representation.

In [None]:
sent = nlp('My favorite food is strawberry ice cream.')

# Your code here
# sentv ...

Let's also extract sentences (as opposed to individual words) from our corpus:

In [None]:
sents = list(doc.sents)

In [None]:
sents

Define a function that takes a random sentence and lists $n$ most similar sentences from our corpus.

In [None]:
def spacy_closest_sent(sentences, input_vec, n=10):
  # Your code here
  pass

Let's try it out!

In [None]:
for s in spacy_closest_sent(sents, sentv, n=10):
  print(s)
  print('\n---')

## **References**

This notebook is inspired by a [tutorial by Allison Parrish](https://gist.github.com/aparrish/2f562e3737544cf29aaf1af30362f469).