In [1]:
import numpy as np

# One-hot encoding
In the beginning were the words. So very many words. Our first step is to convert all the words to numbers so we can do math on them.

Imagine that our goal is to create the computer that responds to our voice commands. It’s our job to build the transformer that converts (or transduces) a sequence of sounds to a sequence of words.

We start by choosing our vocabulary, the collection of symbols that we are going to be working with in each sequence. In our case, there will be two different sets of symbols, one for the input sequence to represent vocal sounds and one for the output sequence to represent words.

For now, let's assume we're working with English. There are tens of thousands of words in the English language, and perhaps another few thousand to cover computer-specific terminology. That would give us a vocabulary size that is the better part of a hundred thousand. One way to convert words to numbers is to start counting at one and assign each word its own number. Then a sequence of words can be represented as a list of numbers.

For example, consider a tiny language with a vocabulary size of three: files, find, and my. Each word could be swapped out for a number, perhaps files = 1, find = 2, and my = 3. Then the sentence "Find my files", consisting of the word sequence [ find, my, files ] could be represented instead as the sequence of numbers [2, 3, 1].

This is a perfectly valid way to convert symbols to numbers, but it turns out that there's another format that's even easier for computers to work with, one-hot encoding. In one-hot encoding a symbol is represented by an array of mostly zeros, the same length of the vocabulary, with only a single element having a value of one. Each element in the array corresponds to a separate symbol.

Another way to think about one-hot encoding is that each word still gets assigned its own number, but now that number is an index to an array. Here is our example above, in one-hot notation.

In [2]:
entire_corpus = "Find my files".lower()
corpus = sorted(entire_corpus.split())
print(corpus)

['files', 'find', 'my']


![](https://e2eml.school/images/transformers/one_hot_vocabulary.png)

So the sentence "Find my files" becomes a sequence of one-dimensional arrays, which, after you squeeze them together, starts to look like a two-dimensional array.

In [3]:
# create function to return one-hot vector from a list of words
def one_hot_vector(word_list, corpus):
    one_hot_vector = []
    for word in word_list:
        one_hot_vector.append([1 if word == corpus[i] else 0 for i in range(len(corpus))])
    return one_hot_vector
print(corpus)
one_hot_vector(corpus, corpus)

['files', 'find', 'my']


[[1, 0, 0], [0, 1, 0], [0, 0, 1]]

![](https://e2eml.school/images/transformers/one_hot_sentence.png)

Heads-up, I'll be using the terms "one-dimensional array" and "vector" interchangeably. Likewise with "two-dimensional array" and "matrix".


In [4]:
def word_to_one_hot(word, corpus):
    return one_hot_vector([word], corpus)[0]

[word_to_one_hot(word, corpus) for word in ["find", "my", "files"]]

[[0, 1, 0], [0, 0, 1], [1, 0, 0]]

# Dot product
One really useful thing about the one-hot representation is that it lets us compute dot products. These are also known by other intimidating names like inner product and scalar product. To get the dot product of two vectors, multiply their corresponding elements, then add the results.

![](https://e2eml.school/images/transformers/dot_product.png)

In [5]:
array_1 = np.array([0, 1, 1, 2])
array_2 = np.array([1, 0, 1, 2])
array_1.dot(array_2)

5

Dot products are especially useful when we're working with our one-hot word representations. The dot product of any one-hot vector with itself is one.

![](https://e2eml.school/images/transformers/match.png)

In [6]:
array_1 = np.array([0, 1, 0, 0])
array_2 = np.array([0, 1, 0, 0])
array_1.dot(array_2)

1

And the dot product of any one-hot vector with any other one-hot vector is zero.

![](https://e2eml.school/images/transformers/non_match.png)

In [7]:
array_1 = np.array([0, 1, 0, 0])
array_2 = np.array([0, 0, 0, 1])
array_1.dot(array_2)

0

The previous two examples show how dot products can be used to measure similarity. As another example, consider a vector of values that represents a combination of words with varying weights. A one-hot encoded word can be compared against it with the dot product to show how strongly that word is represented.

![](https://e2eml.school/images/transformers/similarity.png)

In [8]:
array_1 = np.array([0, 0, 1, 0]) # Word
array_2 = np.array([0.2, 0.7, 0.8, 0.1]) # Combination of words
array_1.dot(array_2)

0.8

# Matrix Multiplication

The dot product is the building block of matrix multiplication, a very particular way to combine a pair of two-dimensional arrays. We'll call the first of these matrices A and the second one B. In the simplest case, when A has only one row and B has only one column, the result of matrix multiplication is the dot product of the two.

![](https://e2eml.school/images/transformers/matrix_mult_one_row_one_col.png)

Notice how the number of columns in A and the number of rows in B needs to be the same for the two arrays to match up and for the dot product to work out.

In [9]:
# matrix multiply the vectors [0, 0, 1, 0] and [0.2, 0.7, 0.8, 0.1]
np.matmul(np.array([0, 0, 1, 0]), np.array([0.2, 0.7, 0.8, 0.1]))

0.8

When A and B start to grow, matrix multiplication starts to get trippy. To handle more than one row in A, take the dot product of B with each row separately. The answer will have as many rows as A does.

![](https://e2eml.school/images/transformers/matrix_mult_two_row_one_col.png)

In [10]:
np.matmul(np.array([[1, 0, 0, 0], [0, 0, 1, 0]]), np.array([0.2, 0.7, 0.8, 0.1]))

array([0.2, 0.8])

When B takes on more columns, take the dot product of each column with A and stack the results in successive columns.

![](https://e2eml.school/images/transformers/matrix_mult_one_row_two_col.png)

In [11]:
# matrix multiply the vectors [0, 0, 1, 0] and [[0.2, 0.7, 0.8, 0.1], [0.9, 0, 0.3, 0.4]]
np.dot(np.array([0, 0, 1, 0]), np.array([[0.2, 0.7, 0.8, 0.1], [0.9, 0, 0.3, 0.4]]).transpose())

array([0.8, 0.3])

Now we can extend this to mutliplying any two matrices, as long as the number of columns in A is the same as the number of rows in B. The result will have the same number of rows as A and the same number of columns as B.

![](https://e2eml.school/images/transformers/matrix_mult_three_row_two_col.png)

In [12]:
a = np.array([
    [1, 0, 0, 0], 
    [0, 0, 0, 1],
    [0, 0, 1, 0]
])
b = np.array([
    [0.2, 0.7, 0.8, 0.1], 
    [0.9, 0, 0.3, 0.4]
])
np.matmul(a, b.transpose())

array([[0.2, 0.9],
       [0.1, 0.4],
       [0.8, 0.3]])

If this is the first time you're seeing this, it might feel needlessly complex, but I promise it pays off later.

## Matrix multiplication as a table lookup
Notice how matrix multiplication acts as a lookup table here. Our A matrix is made up of a stack of one-hot vectors. They have ones in the first column, the fourth column, and the third column, respectively. When we work through the matrix multiplication, this serves to pull out the first row, the fourth row, and the third row of the B matrix, in that order. This trick of using a one-hot vector to pull out a particular row of a matrix is at the core of how transformers work.

# First order sequence model
We can set aside matrices for a minute and get back to what we really care about, sequences of words. Imagine that as we start to develop our natural language computer interface we want to handle just three possible commands:

```
Show me my directories please.
Show me my files please.
Show me my photos please.
```
Our vocabulary size is now seven:
`{directories, files, me, my, photos, please, show}`.

One useful way to represent sequences is with a transition model. For every word in the vocabulary, it shows what the next word is likely to be. If users ask about photos half the time, files 30% of the time, and directories the rest of the time, the transition model will look like this. The sum of the transitions away from any word will always add up to one.

![](https://e2eml.school/images/transformers/markov_chain.png)

This particular transition model is called a **Markov chain**, because it satisfies the [Markov property](https://en.wikipedia.org/wiki/Markov_property) that the probabilities for the next word depend only on recent words. More specifically, it is a first order Markov model because it only looks at the single most recent word. If it considered the two most recent words it would be a second order Markov model.

In [28]:
import pandas as pd
weighted_edge_list = [
    ('show', 'me', 1), 
    ('me', 'my', 1), 
    ('my', 'directories', 0.2),
    ('my', 'files', 0.3),
    ('my', 'photos', 0.5),
    ('directories', 'please', 1),
    ('files', 'please', 1),
    ('photos', 'please', 1),
    ('please', 'please', 0)
]
wel_df = pd.DataFrame(weighted_edge_list, columns=["source", "target", "weight"])
wel_df

Unnamed: 0,source,target,weight
0,show,me,1.0
1,me,my,1.0
2,my,directories,0.2
3,my,files,0.3
4,my,photos,0.5
5,directories,please,1.0
6,files,please,1.0
7,photos,please,1.0
8,please,please,0.0


Our break from matrices is over. It turns out that Markov chains can be expressed conveniently in matrix form. Using the same indexing scheme that we used when creating one-hot vectors, each row represents one of the words in our vocabulary. So does each column. The matrix transition model treats a matrix as a lookup table. Find the row that corresponds to the word you’re interested in. The value in each column shows the probability of that word coming next. Because the value of each element in the matrix represents a probability, they will all fall between zero and one. Because probabilities always sum to one, the values in each row will always add up to one.

![](https://e2eml.school/images/transformers/transition_matrix.png)

In the transition matrix here we can see the structure of our three sentences clearly. Almost all of the transition probabilities are zero or one. There is only one place in the Markov chain where branching happens. After my, the words directories, files, or photos might appear, each with a different probability. Other than that, there’s no uncertainty about which word will come next. That certainty is reflected by having mostly ones and zeros in the transition matrix.

In [29]:
transition_matrix_df = wel_df.pivot(index="source", columns="target", values="weight").fillna(0)
transition_matrix_df

target,directories,files,me,my,photos,please
source,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
directories,0.0,0.0,0.0,0.0,0.0,1.0
files,0.0,0.0,0.0,0.0,0.0,1.0
me,0.0,0.0,0.0,1.0,0.0,0.0
my,0.2,0.3,0.0,0.0,0.5,0.0
photos,0.0,0.0,0.0,0.0,0.0,1.0
please,0.0,0.0,0.0,0.0,0.0,0.0
show,0.0,0.0,1.0,0.0,0.0,0.0


We can revisit our trick of using matrix multiplication with a one-hot vector to pull out the transition probabilities associated with any given word. For instance, if we just wanted to isolate the probabilities of which word comes after my, we can create a one-hot vector representing the word my and multiply it by our transition matrix. This pulls out the relevant row and shows us the probability distribution of what the next word will be.

![](https://e2eml.school/images/transformers/transition_lookups.png)

In [30]:
transition_matrix = transition_matrix_df.to_numpy()
transition_matrix

array([[0. , 0. , 0. , 0. , 0. , 1. ],
       [0. , 0. , 0. , 0. , 0. , 1. ],
       [0. , 0. , 0. , 1. , 0. , 0. ],
       [0.2, 0.3, 0. , 0. , 0.5, 0. ],
       [0. , 0. , 0. , 0. , 0. , 1. ],
       [0. , 0. , 0. , 0. , 0. , 0. ],
       [0. , 0. , 1. , 0. , 0. , 0. ]])

In [33]:
word_vector = np.array([0, 0, 0, 1, 0, 0, 0])

word_vector.dot(transition_matrix)

array([0.2, 0.3, 0. , 0. , 0.5, 0. ])

# Second order sequence model
Predicting the next word based on only the current word is hard. That's like predicting the rest of a tune after being given just the first note. Our chances are a lot better if we can at least get two notes to go on.

We can see how this works in another toy language model for our computer commands. We expect that this one will only ever see two sentences, in a 40/60 proportion.

```
Check whether the battery ran down please.
Check whether the program ran please.
```
A Markov chain illustrates a first order model for this.

![](https://e2eml.school/images/transformers/markov_chain_2.png)