# Text tokenization, word indexing, word vectors

In [10]:
import numpy as np
from keras.preprocessing.text import Tokenizer

In [102]:
# sample tweets from my twitter inbox
samples = [
    """On those rare occasions when I really, really need to reduce the
    size of a file I use "xz -9". Today I found out about the "extreme" setting
    and "xz -e9" squeezed files down another 15% or so. It is not exactly quick,
    but that doesn't really matter in such cases!""",
    
    """Do grad students get paid a lot? No. But do we at least have solid
    job security? Also, no. But are we at least ensured a stress-free work
    environment with a healthy work-life balance? Still, also no.""",

    """A design process hyperfocused on A/B testing can result in dark patterns even
    if that’s not the intent. That’s because most A/B tests are based on metrics
    that are relevant to the company’s bottom line, even if they result in harm to users."""
]

tokenizer = Tokenizer(num_words=30)
tokenizer.fit_on_texts(samples)

In [103]:
sequences = tokenizer.texts_to_sequences(samples)
np.array(sequences[0]) # token sequence for first sample

array([ 3, 25, 26, 27, 28,  4,  5,  5, 29,  6,  2,  1,  4, 11,  4,  2, 11,
       12,  7, 13,  5,  8])

In [104]:
# one-hot for samples
one_hot_results = tokenizer.texts_to_matrix(samples, mode='binary')
one_hot_results

array([[0., 1., 1., 1., 1., 1., 1., 1., 1., 0., 0., 1., 1., 1., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 0., 0., 1., 1., 1., 1., 1.],
       [0., 1., 0., 0., 0., 0., 0., 1., 0., 1., 1., 0., 0., 0., 1., 1.,
        1., 1., 1., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
       [0., 1., 1., 1., 0., 0., 1., 0., 1., 0., 1., 0., 1., 1., 0., 0.,
        0., 0., 0., 0., 1., 1., 1., 1., 1., 0., 0., 0., 0., 0.]])

In [105]:
# count in one-hot positions for samples
one_hot_results = tokenizer.texts_to_matrix(samples, mode='count')
one_hot_results

array([[0., 1., 2., 1., 3., 3., 1., 1., 1., 0., 0., 2., 1., 1., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 0., 0., 1., 1., 1., 1., 1.],
       [0., 3., 0., 0., 0., 0., 0., 2., 0., 3., 1., 0., 0., 0., 2., 2.,
        2., 2., 2., 2., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
       [0., 3., 2., 2., 0., 0., 2., 0., 2., 0., 2., 0., 1., 1., 0., 0.,
        0., 0., 0., 0., 2., 2., 2., 2., 2., 0., 0., 0., 0., 0.]])

## "One-hot hash trick"

Instead of using unique word index into a dictionary of unique words from all samples, compute hash function on word as its code. Lets us limit size of word vector to fixed length rather than length of vocab or whatever. 

In [106]:
def hash(w):
    h = 0
    for c in w:
        h = (h<<3) + ord(c)
    return h

def hashwords(dimensionality = 1000):
    collisions = set()
    codes = set()
    index = {}
    for w in words:
        wcode = hash(w)%dimensionality
        if wcode in codes:
            collisions.add(wcode)
        codes.add(wcode)
        index[w] = wcode
    return index, collisions

words = tokenizer.word_index.keys() # get tokenized words
dimensionality=10
index, collisions = hashwords(dimensionality)
print("INDEX:", index)
print(f"{len(collisions)}/{len(words)}={(len(collisions)*100)/len(words):.1f}% collisions using {dimensionality} unique codes")

INDEX: {'a': 7, 'the': 7, 'on': 8, 'i': 5, 'really': 9, 'to': 9, 'but': 4, 'in': 0, 'no': 1, 'are': 1, 'xz': 2, 'not': 4, 'that': 0, 'do': 1, 'we': 3, 'at': 2, 'least': 4, 'also': 7, 'work': 1, 'b': 8, 'result': 6, 'even': 2, 'if': 2, 'that’s': 1, 'those': 9, 'rare': 9, 'occasions': 5, 'when': 2, 'need': 2, 'reduce': 9, 'size': 7, 'of': 0, 'file': 9, 'use': 9, '9': 7, 'today': 5, 'found': 2, 'out': 6, 'about': 4, 'extreme': 5, 'setting': 9, 'and': 8, 'e9': 5, 'squeezed': 2, 'files': 7, 'down': 6, 'another': 4, '15': 5, 'or': 2, 'so': 1, 'it': 6, 'is': 5, 'exactly': 3, 'quick': 1, "doesn't": 2, 'matter': 2, 'such': 4, 'cases': 1, 'grad': 8, 'students': 5, 'get': 6, 'paid': 2, 'lot': 6, 'have': 1, 'solid': 4, 'job': 0, 'security': 5, 'ensured': 2, 'stress': 3, 'free': 9, 'environment': 4, 'with': 0, 'healthy': 9, 'life': 3, 'balance': 3, 'still': 4, 'design': 0, 'process': 3, 'hyperfocused': 4, 'testing': 7, 'can': 2, 'dark': 7, 'patterns': 9, 'intent': 2, 'because': 7, 'most': 8, 'tests

In [107]:
for dimensionality in [10,50,100,500,1000,2000,5000,10000]:
    index, collisions = hashwords(dimensionality)
    print(f"{len(collisions):2d}/{len(words):2d}={(len(collisions)*100)/len(words):4.1f}% collisions using {dimensionality:6d} unique codes")

10/96=10.4% collisions using     10 unique codes
28/96=29.2% collisions using     50 unique codes
24/96=25.0% collisions using    100 unique codes
 6/96= 6.2% collisions using    500 unique codes
 4/96= 4.2% collisions using   1000 unique codes
 3/96= 3.1% collisions using   2000 unique codes
 1/96= 1.0% collisions using   5000 unique codes
 0/96= 0.0% collisions using  10000 unique codes


**Conclusion**: Using my typical hash function it takes a big dimensionality before we reduce collisions. In this case, there are only 96 unique words which means unique (perfect) hashing requires a dimensionality of only 96. The hashing trick requires thousands of hash buckets before we get 0 collisions. I suppose that if a simple hash function worked well, we would need wording embeddings. :)