# One-hot encoding of words or characters

One-hot encoding is the most common and fundamental way to convert a token into a vector. It includes how to associate an integer index to each word.

![](https://imgur.com/pTbwpVD.png)

Suppose we have only five words in our vocabulary: King, Queen, Man, Woman and Child. We can encode the word 'Queen' as:
![](https://imgur.com/Wy4M4fO.png)

One-hot encoding can be a length of 5 vector to represent each word, and this vector in addition to the i-th index value is '1', the rest are '0'.

Of course, we can also do One-hot encoding at the character-level. Next, I'll demonstrate what One-hot encoding is and how to implement it. Here are two simple One-hot encoding examples: one for word and one for character.



In [1]:
import keras
keras.__version__

Using TensorFlow backend.


'2.1.2'

## One-hot encoding at word level 


In [3]:
import numpy as np

# This is our initial data; each "sample" is just a sentence in this toy example,
# but in practice it could be the entire document.
samples = ['The cat sat on the mat.', 'The dog ate my homework.']

# First, create an index reference to the dictionary 
# object for all symbols in the data.
token_index = {}

for sample in samples:
    for word in sample.split():
        if word not in token_index:
            token_index[word] = len(token_index) + 1

# Next, we vectorize our samples.

# According to different situations, 
# we will determine the maximum length of the vectorized "max_length".
# Set to 10 here because none of our sample data exceeds 10 words
max_length = 10

results = np.zeros((len(samples), max_length, max(token_index.values()) + 1))

for i, sample in enumerate(samples):
    for j, word in list(enumerate(sample.split()))[:max_length]:
        index = token_index.get(word)
        results[i, j, index] = 1.

In [14]:
token_index

{'The': 1,
 'ate': 8,
 'cat': 2,
 'dog': 7,
 'homework.': 10,
 'mat.': 6,
 'my': 9,
 'on': 4,
 'sat': 3,
 'the': 5}

In [15]:
# Print the first stroke of One-hot encoded samples
# 'The cat sat on the mat.'

results[0]

array([[ 0.,  1.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.],
       [ 0.,  0.,  1.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.],
       [ 0.,  0.,  0.,  1.,  0.,  0.,  0.,  0.,  0.,  0.,  0.],
       [ 0.,  0.,  0.,  0.,  1.,  0.,  0.,  0.,  0.,  0.,  0.],
       [ 0.,  0.,  0.,  0.,  0.,  1.,  0.,  0.,  0.,  0.,  0.],
       [ 0.,  0.,  0.,  0.,  0.,  0.,  1.,  0.,  0.,  0.,  0.],
       [ 0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.],
       [ 0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.],
       [ 0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.],
       [ 0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.]])

## One-hot encoding at the character level

In [16]:
import string

samples = ['The cat sat on the mat.', 'The dog ate my homework.']

# All ASCII character list strings that can be printed
characters = string.printable

print(characters)

# The establishment of the index of all the symbols in the data reference dictionary object.
token_index = dict(zip(characters, range(1, len(characters) + 1)))

max_length = 50
results = np.zeros((len(samples), max_length, max(token_index.values()) + 1))

for i, sample in enumerate(samples):
    for j, character in enumerate(sample[:max_length]):
        index = token_index.get(character)
        results[i, j, index] = 1.

0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ!"#$%&'()*+,-./:;<=>?@[\]^_`{|}~ 	



In [21]:
print(token_index)

{'0': 1, '1': 2, '2': 3, '3': 4, '4': 5, '5': 6, '6': 7, '7': 8, '8': 9, '9': 10, 'a': 11, 'b': 12, 'c': 13, 'd': 14, 'e': 15, 'f': 16, 'g': 17, 'h': 18, 'i': 19, 'j': 20, 'k': 21, 'l': 22, 'm': 23, 'n': 24, 'o': 25, 'p': 26, 'q': 27, 'r': 28, 's': 29, 't': 30, 'u': 31, 'v': 32, 'w': 33, 'x': 34, 'y': 35, 'z': 36, 'A': 37, 'B': 38, 'C': 39, 'D': 40, 'E': 41, 'F': 42, 'G': 43, 'H': 44, 'I': 45, 'J': 46, 'K': 47, 'L': 48, 'M': 49, 'N': 50, 'O': 51, 'P': 52, 'Q': 53, 'R': 54, 'S': 55, 'T': 56, 'U': 57, 'V': 58, 'W': 59, 'X': 60, 'Y': 61, 'Z': 62, '!': 63, '"': 64, '#': 65, '$': 66, '%': 67, '&': 68, "'": 69, '(': 70, ')': 71, '*': 72, '+': 73, ',': 74, '-': 75, '.': 76, '/': 77, ':': 78, ';': 79, '<': 80, '=': 81, '>': 82, '?': 83, '@': 84, '[': 85, '\\': 86, ']': 87, '^': 88, '_': 89, '`': 90, '{': 91, '|': 92, '}': 93, '~': 94, ' ': 95, '\t': 96, '\n': 97, '\r': 98, '\x0b': 99, '\x0c': 100}


In [22]:
print(token_index['T'])
print(token_index['h'])
print(token_index['e'])

56
18
15


Note that Keras has a built-in one-hot encoding that encodes text at the single-word or character level. This is actually a method that should be used because it handles important features such as stripping special characters from a string or only the most commonly used N words in a dataset (a common filtering process to avoid dealing with very Large input vector space).

## Word-level One-hot encoding using Keras:

In [35]:
from keras.preprocessing.text import Tokenizer

samples = ['The cat sat on the mat.', 'The dog ate my homework.']

# We create a tokenizer that is configured to extract only the 1000 most common words
tokenizer = Tokenizer(num_words=1000)

# Create a single word index
tokenizer.fit_on_texts(samples)

# Converts a string to an integer index list
sequences = tokenizer.texts_to_sequences(samples)

# You can also get one-hot encoded binary representation directly
one_hot_results = tokenizer.texts_to_matrix(samples, mode='binary')

# take out the word index of dictionary object
word_index = tokenizer.word_index
print('Found %s unique tokens.' % len(word_index))

Found 9 unique tokens.


In [36]:
word_index

{'ate': 7,
 'cat': 2,
 'dog': 6,
 'homework': 9,
 'mat': 5,
 'my': 8,
 'on': 4,
 'sat': 3,
 'the': 1}

In [37]:
sequences

[[1, 2, 3, 4, 1, 5], [1, 6, 7, 8, 9]]

In [39]:
one_hot_results.shape

(2, 1000)

One variant of the One-hot encoding is "one-hot hashing trick," which can be used when the number of unique tags in our list is too large to handle directly. In addition to explicitly indexing each word and keeping references to those indexes in the dictionary, words can be converted to words of fixed size through a hashing algorithm.

This is usually done using a very lightweight hash function. The main advantage of this approach is that it does not need to maintain a clear single-word index, which saves memory and allows encoding of data online (starting to generate token vectors immediately before seeing all the available data).

A disadvantage of this approach is that it is susceptible to "hash collisions": two different words may have the same hash value, and any machine learning model that looks at these hash values will not be able to discern the differences between these words.

When the number of dimensions in a hashing space is much larger than the total number of hashes, the chances of "Hash collisions" are diminished.

## One-hot encoding for word-level hash techniques

In [40]:
samples = ['The cat sat on the mat.', 'The dog ate my homework.']

# We will store our word as a vector of size 1000.
# Please note that if we have nearly 1000 words (or more)
# We will see many Hash collisions, which will reduce the accuracy of this coding method.
dimensionality = 1000
max_length = 10

results = np.zeros((len(samples), max_length, dimensionality))
for i, sample in enumerate(samples):
    for j, word in list(enumerate(sample.split()))[:max_length]:
        # Hash this word into a "random" integer index
        # Between 0 and 1000
        index = abs(hash(word)) % dimensionality
        results[i, j, index] = 1.

## Reference 
https://blog.acolyer.org/2016/04/21/the-amazing-power-of-word-vectors/
