# Getting random but dense not sparse vectors

In [word-vectors](word-vectors.ipynb), we created sparse vectors representing words. When added together, these create bag of words (BOW) representations of documents. We can just turn on the particular position of a word in the vector if that word is present in the document. 

If we are passing words individually to a recurrent neural network (RNN), then these sparse vectors can get pretty big. If there are 20,000 words in the dictionary, we might have vectors of size 20,000. What we need is a dense representation that still gives us unique representations of each word.  (We also used to the hash trick also to try to shrink the size of the sparse vectors, but they are still sparse.)

In [30]:
import numpy as np
from keras.preprocessing.text import Tokenizer

In [31]:
# sample tweets from my twitter inbox with added text for experimentation
samples = [
    """Tesla Motors has nothing to do with this tweet.
    On those rare occasions when I really, really need to reduce the
    size of a file I use "xz -9". Today I found out about the "extreme" setting
    and "xz -e9" squeezed files down another 15% or so. It is not exactly quick,
    but that doesn't really matter in such cases!""",
    
    """Securities and exchange commission has nothing to do with this tweet.
    Do grad students get paid a lot? No. But do we at least have solid
    job security? Also, no. But are we at least ensured a stress-free work
    environment with a healthy work-life balance? Still, also no.""",

    """A design process hyperfocused on A/B testing can result in dark patterns even
    if that’s not the intent. That’s because most A/B tests are based on metrics
    that are relevant to the company’s bottom line, even if they result in harm to users."""
]

tokenizer = Tokenizer()
tokenizer.fit_on_texts(samples)
words = tokenizer.word_index.keys() # get tokenized words

If we are not using a hash function but instead dense factors, we need a dictionary to keep track of the word to vector mapping. But, we can create vectors of any length and get very little chance of collision. For example, even with a single floating-point number between 0 and 1, with our 96 words in the vocabulary, there's almost no chance of collision

In [32]:
def hashwords(words, dimensionality = 4):
    return {w:np.random.random(size=dimensionality) for w in words}

In [33]:
index = hashwords(words, dimensionality=4)
list(index.items())[:10]

[('a', array([0.84, 0.18, 0.47, 0.46])),
 ('to', array([0.97, 0.97, 0.32, 0.59])),
 ('do', array([0.98, 0.04, 0.44, 0.89])),
 ('the', array([0.52, 0.85, 0.35, 0.62])),
 ('with', array([0.12, 0.92, 0.61, 0.14])),
 ('on', array([0.31, 0.23, 0.21, 0.9 ])),
 ('i', array([0.54, 0.68, 0.72, 0.23])),
 ('really', array([0.39, 0.08, 0.93, 0.77])),
 ('but', array([0.16, 0.53, 0.15, 0.83])),
 ('in', array([0.49, 0.38, 0.37, 0.89]))]

In [34]:
ncollisions = len(index) - len(set([tuple(a) for a in index.values()]))
print(f"There were {ncollisions} collisions between dense word vectors")

There were 0 collisions between dense word vectors


Note that if we only use dimensionality=1, then we are right back to label encoding, just with a floating-point number instead of an integer. We need at least a dimensionality of two.

What happens if we need to send an entire document not just a single word into a model? We need a continuous bag of words (CBOW), which is easy for one hot encoding.  We just turn on all relevant word-columns. For dense vectors, we either need to concatenate them together or sum or average them into a single vector. If documents are different length, then concatenating them doesn't work because models typically require fixed length input.

Those vectors are dense but there is literally no meaning to the values in the vector positions. Two similar words are in no way similar in some kind of semantic space. We are not helping the model very much. 