# Semantics and Word Vectors

We are going to use **embedded word vectors** already available in Spacy; but for that, we need to use medium or large language models, which need to be installed explicitly:

```bash
python -m spacy download en_core_web_md # spacy.load('en_core_news_md')
python -m spacy download en_core_web_lg # spacy.load('en_core_news_lg')
```

When we load any of these models, each token has its vector representation. The concept of representing words with vectors was popularized by Mikolov et al. in 2013 (Google) -- see `../literature/Mikolov_Word2Vec_2013.pdf`.

The idea is that we get an `N` dimensional vector representation of each word in the vocabulary, such that:
- **Close vectors are words semantically related**, and associations can be inferred: `man` is to `boy` as `woman` is to `girl`.
- **We can perform vector operations that are reflected in the semantical space**: `vector(queen) ~ vector(king) - vector(man) + vector(woman)`.

In order to generate those word vector embeddings, large corpuses of texts are trained with sets of close words mapping each word to a numerical vector. I understand that in the begining, words are represented as one-hot encoded vectors of dimension `M`, being `M` the size of the vocabulary:

`[0, 1, 0, ..., 0] (M: vocabulary size) -> [0.2, 0.5, ..., 0.1] (N: latent word vector space)`

A common metric to measure similarity between word vectors is the **cosine similarity**: cosine of the angle formed by the two words.

In Spacy, a word vector has dimension `N = 300`; however, not all language models have word vectors!
- `en_core_news_sm` (35MB): no word vector representations
- `en_core_news_md` (116MB): 685k keys, 20k unique vectors (300 dimensions)
- `en_core_news_lg` (812MB): 685k keys, 685k unique vectors (300 dimensions)

Overview of contents:

1. Word Vectors: Token & Doc Vectors
2. Vector Similarity: Cosine Similarity
3. Vector Norms
4. Vector Arithmetic

*Diclaimer: I made this notebook while following the Udemy course [NLP - Natural Language Processing with Python](https://www.udemy.com/course/nlp-natural-language-processing-with-python/) by José Marcial Portilla. The original course notebooks and materials were provided with a download link, I haven't found a repository to fork from.*

## 1. Word Vectors: Token & Doc Vectors

In [116]:
import numpy as np
import spacy
#nlp = spacy.load('en_core_web_md')  # make sure to use a larger model - it takes longer
nlp = spacy.load('en_core_web_lg')  # make sure to use a larger model - it takes longer

In [117]:
nlp(u'lion').vector.shape

(300,)

In [118]:
nlp(u'lion').vector

array([ 1.8963e-01, -4.0309e-01,  3.5350e-01, -4.7907e-01, -4.3311e-01,
        2.3857e-01,  2.6962e-01,  6.4332e-02,  3.0767e-01,  1.3712e+00,
       -3.7582e-01, -2.2713e-01, -3.5657e-01, -2.5355e-01,  1.7543e-02,
        3.3962e-01,  7.4723e-02,  5.1226e-01, -3.9759e-01,  5.1333e-03,
       -3.0929e-01,  4.8911e-02, -1.8610e-01, -4.1702e-01, -8.1639e-01,
       -1.6908e-01, -2.6246e-01, -1.5983e-02,  1.2479e-01, -3.7276e-02,
       -5.7125e-01, -1.6296e-01,  1.2376e-01, -5.5464e-02,  1.3244e-01,
        2.7519e-02,  1.2592e-01, -3.2722e-01, -4.9165e-01, -3.5559e-01,
       -3.0630e-01,  6.1185e-02, -1.6932e-01, -6.2405e-02,  6.5763e-01,
       -2.7925e-01, -3.0450e-03, -2.2400e-02, -2.8015e-01, -2.1975e-01,
       -4.3188e-01,  3.9864e-02, -2.2102e-01, -4.2693e-02,  5.2748e-02,
        2.8726e-01,  1.2315e-01, -2.8662e-02,  7.8294e-02,  4.6754e-01,
       -2.4589e-01, -1.1064e-01,  7.2250e-02, -9.4980e-02, -2.7548e-01,
       -5.4097e-01,  1.2823e-01, -8.2408e-02,  3.1035e-01, -6.33

In [119]:
# Number of unique vectors loaded in the model
len(nlp.vocab.vectors)

684830

In [120]:
# Doc and Span objects themselves have vectors,
# derived from the averages of individual token vectors. 
# This makes it possible to compare similarities between whole documents.
doc1 = nlp(u'The quick brown fox jumped over the lazy dogs.')
doc1.vector.shape

(300,)

In [121]:
v1 = doc1.vector

In [122]:
# According to the documentation, the vector of a Doc is averaged
# without considering the position of each word.
# However, there seems to be some positional encoding, because the vectors are not the same
# Or is it just the numerical error?
#doc = nlp(u'The quick brown jumped fox over the lazy dogs.')
doc2 = nlp(u'The brown quick jumped fox over the lazy dogs.')

In [123]:
v2 = doc.vector

In [124]:
d1=v1-v1
d2=v1-v2

In [125]:
np.sqrt(sum(d1*d1))

0.0

In [126]:
np.sqrt(sum(d2*d2))

8.607210729790503e-08

## 2. Vector Similarity: Cosine Similarity

In [127]:
# Create a three-token Doc object
tokens = nlp(u'lion cat pet')

# Iterate through token combinations
# Note: token1.similarity(token2) == token2.similarity(token1)
for token1 in tokens:
    for token2 in tokens:
        print(token1.text, token2.text, token1.similarity(token2))

lion lion 1.0
lion cat 0.5265437
lion pet 0.39923766
cat lion 0.5265437
cat cat 1.0
cat pet 0.7505457
pet lion 0.39923766
pet cat 0.7505457
pet pet 1.0


In [128]:
# Opposites are not necessarily different!
tokens = nlp(u'like love hate')

# Iterate through token combinations:
for token1 in tokens:
    for token2 in tokens:
        print(token1.text, token2.text, token1.similarity(token2))

like like 1.0
like love 0.657904
like hate 0.65746516
love like 0.657904
love love 1.0
love hate 0.63930994
hate like 0.65746516
hate love 0.63930994
hate hate 1.0


## 3. Vector Norms

In [129]:
# Number of unique vectors loaded in the model
len(nlp.vocab.vectors)

684830

Note that usual words, including names, can have vector representations; however, in some cases we can come up with a word that has no vector.

In [130]:
tokens = nlp(u'dog cat nargle')
# token.has_vector: True/False
# token.vector_norm: L2 norm or Euclidean length of the vector
# token.is_oov: is out-of-vocabulary, True/False (maybe it is in vocabulary, but has no vector)
for token in tokens:
    print(token.text, token.has_vector, token.vector_norm, token.is_oov)

dog True 7.0336733 False
cat True 6.6808186 False
nargle False 0.0 True


## 4. Vector Arithmetic

With word vector embeddings we can perform arithmetics that are reflected in meaningful sematic operations:

`vector(queen) ~ vector(king) - vector(man) + vector(woman)`

However, in my case at least, it does not seem to work that well...

In [139]:
from scipy import spatial

# Our custom similarity function
cosine_similarity = lambda x, y: 1 - spatial.distance.cosine(x, y)

king = nlp.vocab['king'].vector
man = nlp.vocab['man'].vector
woman = nlp.vocab['woman'].vector

# Now we find the closest vector in the vocabulary to the result of
# "king" - "man" + "woman"
new_vector = king - man + woman
computed_similarities = []

# Visit all words/tokens in the vocabulary
# and if they have a valid vector, compute the similarity
for word in nlp.vocab:
    # Ignore words without vectors and mixed-case words
    if word.has_vector:
        if word.is_lower:
            if word.is_alpha:
                similarity = cosine_similarity(new_vector, word.vector)
                computed_similarities.append((word, similarity))

# Sort all similarities by first time, descending
computed_similarities = sorted(computed_similarities, key=lambda item: -item[1])

# Unfortunately, it does not seem to work that well...
print([w[0].text for w in computed_similarities[:10]])

['king', 'monarch', 'woman', 'female', 'she', 'lion', 'male', 'who', 'fox', 'brown']
