# Embedding sparse word vectors into high-but-lower-dimensional dense space

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/parrt/playdl/blob/master/mnist/notebooks/word-embeddings.ipynb)

Word vectors are simple but the vocabulary size can make word vectors extremely long; high dimensionality. And they are very sparse, mostly zeros.

Word embeddings, on the other hand, embed that massive dimensional space into a smaller, dense space. For example, [GloVE](https://nlp.stanford.edu/projects/glove/) has pre-trained word embeddings of various sizes such as 50 and 300 dimensions. Unlike word vectors, we need to do some training to compute embeddings. I've used pre-trained word-to-embedding dictionaries to good effect, but we can also train and embedding specific to our task as part of our model, using an embedding layer.

## Getting random but dense not sparse word vectors

In [word-vectors](word-vectors.ipynb), we created sparse vectors representing words. When added together, these create bag of words (BOW) representations of documents. We can just turn on the particular position of a word in the vector if that word is present in the document. 

If we are passing words individually to a recurrent neural network (RNN), then these sparse vectors can get pretty big. If there are 20,000 words in the dictionary, we might have vectors of size 20,000. What we need is a dense representation that still gives us unique representations of each word.  (We also used to the hash trick also to try to shrink the size of the sparse vectors, but they are still sparse.)

In [3]:
import numpy as np
from keras.preprocessing.text import Tokenizer

Using TensorFlow backend.


In [59]:
# sample tweets from my twitter inbox with added text for experimentation
samples = [
    """Tesla Motors has nothing to do with this tweet.
    On those rare occasions when I really, really need to reduce the
    size of a file I use "xz -9". Today I found out about the "extreme" setting
    and "xz -e9" squeezed files down another 15% or so. It is not exactly quick,
    but that doesn't really matter in such cases!""",
    
    """Securities and exchange commission has nothing to do with this tweet.
    Do grad students get paid a lot? No. But do we at least have solid
    job security? Also, no. But are we at least ensured a stress-free work
    environment with a healthy work-life balance? Still, also no.""",

    """A design process hyperfocused on A/B testing can result in dark patterns even
    if that’s not the intent. That’s because most A/B tests are based on metrics
    that are relevant to the company’s bottom line, even if they result in harm to users."""
]

tokenizer = Tokenizer()
tokenizer.fit_on_texts(samples)
words = tokenizer.word_index.keys() # get tokenized words

If we are not using a hash function but instead dense factors, we need a dictionary to keep track of the word to vector mapping. But, we can create vectors of any length and get very little chance of collision. For example, even with a single floating-point number between 0 and 1, with our 96 words in the vocabulary, there's almost no chance of collision

In [60]:
def hashwords(words, dimensionality = 4):
    return {w:np.random.random(size=dimensionality) for w in words}

In [61]:
index = hashwords(words, dimensionality=2)
list(index.items())[:10]

[('a', array([0.19961118, 0.11705843])),
 ('to', array([0.85572254, 0.54628484])),
 ('do', array([0.66253901, 0.52195502])),
 ('the', array([0.04109797, 0.48363888])),
 ('with', array([0.01202187, 0.46650898])),
 ('on', array([0.84055389, 0.35435807])),
 ('i', array([0.27237413, 0.22134055])),
 ('really', array([0.08232775, 0.64253263])),
 ('but', array([0.72443352, 0.8018862 ])),
 ('in', array([0.44320091, 0.3192354 ]))]

In [63]:
ncollisions = len(index) - len(set([tuple(a) for a in index.values()]))
print(f"There were {ncollisions} collisions between dense word vectors")

There were 0 collisions between dense word vectors


## Word embeddings using Keras

Those vectors are dense but there is literally no meaning to the values in the vector positions. Two similar words are in no way similar in some kind of semantic space. We are not helping the model very much. 

The key idea for improvement is "*You shall know a word by the company it keeps.*" ([John Firth](https://en.wikipedia.org/wiki/John_Rupert_Firth), 1957).  Here is a [nice paper on GloVe](https://nlp.stanford.edu/pubs/glove.pdf).  I think this was the [word2vec paper that got word vectors started](https://arxiv.org/pdf/1301.3781.pdf).

BTW, this also works for embedding of other things like movies/users (from the netflix challenge) and [airlines](https://djcordhose.github.io/ml-workshop/2019-embeddings.html). If we want to get dense vectors for airline names, we need some mechanism to indicate similarity between airlines so that, rather than random dense factors, we can get dense vectors where airlines are somehow close to each other in some appropriate dense space.