In [1]:
import numpy as np

# One-hot encoding
In the beginning were the words. So very many words. Our first step is to convert all the words to numbers so we can do math on them.

Imagine that our goal is to create the computer that responds to our voice commands. It’s our job to build the transformer that converts (or transduces) a sequence of sounds to a sequence of words.

We start by choosing our vocabulary, the collection of symbols that we are going to be working with in each sequence. In our case, there will be two different sets of symbols, one for the input sequence to represent vocal sounds and one for the output sequence to represent words.

For now, let's assume we're working with English. There are tens of thousands of words in the English language, and perhaps another few thousand to cover computer-specific terminology. That would give us a vocabulary size that is the better part of a hundred thousand. One way to convert words to numbers is to start counting at one and assign each word its own number. Then a sequence of words can be represented as a list of numbers.

For example, consider a tiny language with a vocabulary size of three: files, find, and my. Each word could be swapped out for a number, perhaps files = 1, find = 2, and my = 3. Then the sentence "Find my files", consisting of the word sequence [ find, my, files ] could be represented instead as the sequence of numbers [2, 3, 1].

This is a perfectly valid way to convert symbols to numbers, but it turns out that there's another format that's even easier for computers to work with, one-hot encoding. In one-hot encoding a symbol is represented by an array of mostly zeros, the same length of the vocabulary, with only a single element having a value of one. Each element in the array corresponds to a separate symbol.

Another way to think about one-hot encoding is that each word still gets assigned its own number, but now that number is an index to an array. Here is our example above, in one-hot notation.

In [4]:
entire_corpus = "Find my files".lower()
corpus = sorted(entire_corpus.split())
print(corpus)

['files', 'find', 'my']


![](https://e2eml.school/images/transformers/one_hot_vocabulary.png)

So the sentence "Find my files" becomes a sequence of one-dimensional arrays, which, after you squeeze them together, starts to look like a two-dimensional array.

In [6]:
# create function to return one-hot vector from a list of words
def one_hot_vector(word_list, corpus):
    one_hot_vector = []
    for word in word_list:
        one_hot_vector.append([1 if word == corpus[i] else 0 for i in range(len(corpus))])
    return one_hot_vector
print(corpus)
one_hot_vector(corpus, corpus)

['files', 'find', 'my']


[[1, 0, 0], [0, 1, 0], [0, 0, 1]]

![](https://e2eml.school/images/transformers/one_hot_sentence.png)

Heads-up, I'll be using the terms "one-dimensional array" and "vector" interchangeably. Likewise with "two-dimensional array" and "matrix".


In [9]:
def word_to_one_hot(word, corpus):
    return one_hot_vector([word], corpus)[0]

[word_to_one_hot(word, corpus) for word in ["find", "my", "files"]]

[[0, 1, 0], [0, 0, 1], [1, 0, 0]]

# Dot product
One really useful thing about the one-hot representation is that it lets us compute dot products. These are also known by other intimidating names like inner product and scalar product. To get the dot product of two vectors, multiply their corresponding elements, then add the results.

![](https://e2eml.school/images/transformers/dot_product.png)

In [10]:
array_1 = np.array([0, 1, 1, 2])
array_2 = np.array([1, 0, 1, 2])
array_1.dot(array_2)

5

Dot products are especially useful when we're working with our one-hot word representations. The dot product of any one-hot vector with itself is one.

![](https://e2eml.school/images/transformers/match.png)

In [11]:
array_1 = np.array([0, 1, 0, 0])
array_2 = np.array([0, 1, 0, 0])
array_1.dot(array_2)

1

And the dot product of any one-hot vector with any other one-hot vector is zero.

![](https://e2eml.school/images/transformers/non_match.png)

In [12]:
array_1 = np.array([0, 1, 0, 0])
array_2 = np.array([0, 0, 0, 1])
array_1.dot(array_2)

0

The previous two examples show how dot products can be used to measure similarity. As another example, consider a vector of values that represents a combination of words with varying weights. A one-hot encoded word can be compared against it with the dot product to show how strongly that word is represented.

![](https://e2eml.school/images/transformers/similarity.png)

In [13]:
array_1 = np.array([0, 0, 1, 0]) # Word
array_2 = np.array([0.2, 0.7, 0.8, 0.1]) # Combination of words
array_1.dot(array_2)

0.8