# Introduction

This notebook is the second in the Makemore series. In part 1 we implemented Makemore using a bigram model. We looked at ONE previous character and produced a probability distribution of what character is likely to come next. We did that using two different approaches and reached the same results. First, we used counts and normalized them into probabilities. Second, we used a simple neural network (i.e. linear layer) to produce those same probabilities.

The limit of our implementation in part 1 is that it only looks at ONE previous character; and because the bigram model took only ONE character as context, the prediction of the bigram were **not** good. If we were to use consider MORE than one character when predicting the next one, the approach we used in part-1 (storing everything in a tensor) becomes unsustainable because _the number of combinations we will have to store in a tensor grows exponentially with the amount characters we take as context_. For instance, if we were to take just three characters as context when making a prediction, we would have to store $27 \times 27 \times 27 = 19683$ possibilities in a tensor. That's way to many possibilities, the majority of these possibilities will have very few counts, Andrej said.

That's why for this second part, we are moving away from the Bigram model. This time, we are implementing a Multi-Layer Percepton to predict the next character. The modeling approach we are going to adopt follows the paper: [Bengio et al. 2003](https://www.jmlr.org/papers/volume3/bengio03a/bengio03a.pdf) 

# Bengio et al. 2003 (MLP language model) paper walkthrough

This paper, Andrej said, is not the first to have proposed the use of MLPs to predict the next character/token in a sequence, but was very influencial and is very often cited. Since the paper is long ($19$ pages) Andrej decided to give us gist of it, but invited us to read the entire work. This paper is what we are going to implement in this notebook.

# Re-building our training dataset

In [1]:
import torch
import torch.nn.functional as F
import matplotlib.pyplot as plt # For making figures
%matplotlib inline

In [2]:
# Read in all the words
words = open('names.txt', 'r').read().splitlines()
words[:8]

['emma', 'olivia', 'ava', 'isabella', 'sophia', 'charlotte', 'mia', 'amelia']

In [3]:
len(words)

32033

Now we build the vocabulary of characters and mapping from/to integers. It is important to recall that we are building a character-level language model, unlike the paper who is describing a word-level language model.

In [4]:
# Build the vocabulary of characters and mapping to/from integers
chars = sorted(list(set(''.join(words))))
stoi = {s:i + 1 for i,s in enumerate(chars)}
stoi['.'] = 0
itos = {i:s for s,i in stoi.items()}
print(itos)

{1: 'a', 2: 'b', 3: 'c', 4: 'd', 5: 'e', 6: 'f', 7: 'g', 8: 'h', 9: 'i', 10: 'j', 11: 'k', 12: 'l', 13: 'm', 14: 'n', 15: 'o', 16: 'p', 17: 'q', 18: 'r', 19: 's', 20: 't', 21: 'u', 22: 'v', 23: 'w', 24: 'x', 25: 'y', 26: 'z', 0: '.'}


We now want to set up the dataset such that we can feed training examples to the neural network easily. We are going to refurbish code we wrote in the previous tutorial. To form the dataset in the first part, we added the first part of the bigram (the context) in a tensor, `xs`... and the second part (the correct prediction) in another tensor, `ys`. 

We are doing something very similar here, but this time Andrej is adding MORE than ONE character as context. `X` will store the input of the neural network and `Y` will store the correct labels. If, out of curiosity, you uncomment the commented lines, the code print all the training examples _per_ word in our dataset. For the output to be manageable, I suggest doing it for a couple words and not the entire dataset 😉.

In [5]:
block_size = 3 # context length: how many characters do we take to predict the next one?
X, Y = [], []
for w in words[:5]:
  #print(w)
  context = [0] * block_size
  for ch in w + '.':
    idx = stoi[ch]
    X.append(context)
    Y.append(idx)
    #print(''.join(itos[i] for i in context), '--->', itos[idx])
    context = context[1:] + [idx] # crop and append
  
X = torch.tensor(X)
Y = torch.tensor(Y)

The shape of the dataset is as follows

In [6]:
X.shape, X.dtype, Y.shape, Y.dtype

(torch.Size([32, 3]), torch.int64, torch.Size([32]), torch.int64)

Our training set `X` contains $32$ training examples of $3$ characters (i.e. context) each when **considering only $5$ words**. If we were to consider all of the words, our training set would contain $228146$ training examples, each with a context size of $3$. Also, printing the `dtype` indicates that tensors `X` and `Y` are storing `int64` values. It is because we are not storing the characters directly but rather the unique integers we assigned to each of them using `stoi` defined earlier in this notebook.

Feel free to print out `X` or `Y` and see what's inside.

In [17]:
#X

torch.Size([32, 3])

In [8]:
#Y

# Implementing the embedding lookup table

We have 27 possible characters and we are going to embed each of them in a lower dimensional space. In the paper, they have $17,000$ total words and embed them all into a 30 dimensional space, which Andrej said was small. Since we have just 27 characters like mentioned earlier, Andrej suggested we start with 2D embedding space. That means, each of the 27 letters will be associated with a 2D embedding vector. As a result, our embedding matrix will be of the shape $(27 \times 2)$. In that sense, it is reasonable to look at the embedding matrix as a **lookup table**.

_REMEMBER: Those embedding vectors are (initially) randomly generated._

In [9]:
C = torch.randn((27, 2))

In [10]:
C

tensor([[-0.3474, -0.1840],
        [ 0.4755, -0.5965],
        [ 1.9120, -1.0360],
        [ 1.9049,  2.7628],
        [ 1.7715,  0.3046],
        [ 0.7580,  1.2531],
        [ 0.1151, -0.2700],
        [ 0.5087, -0.9434],
        [ 1.3534,  0.5546],
        [ 0.7551, -0.9047],
        [-1.3486,  0.1972],
        [ 0.6857, -0.6411],
        [ 0.7509, -1.0326],
        [ 0.3202,  0.7973],
        [-0.8882,  1.4611],
        [-0.3065, -0.2079],
        [-0.4197,  0.7667],
        [-0.2535, -1.2449],
        [ 0.1487,  0.5806],
        [ 0.6719, -0.6666],
        [-0.2297, -1.5458],
        [-2.2268,  0.5456],
        [-1.7653, -0.4148],
        [ 0.5760,  1.4934],
        [ 1.4161, -0.1763],
        [ 1.7875, -0.0120],
        [-1.0754,  0.6780]])

There are multiple ways to index into the embedding matrix `C`. We can do it by using directly using the row index, or more interestingly, we can select specific rows of the matrix by multiplying it with a one-hot encoded vector. We illustrated this in "NOTE 1" section of the `part-1` notebook. For simplicity, Andrej decided to use numbers for indexing. With Pytorch, it is possible to do single-dimension indexing, meaning use a single number to select a row in a tensor or use a 1-D tensor to select multiple rows at once. Andrej showed us that is also possible to select rows in a tensor using a 2-D tensor.

Using 2-D indexing, we can simultaneously retreive the embedding vector of all the integers in the training set `X` like so:

In [11]:
C[X]

tensor([[[-0.3474, -0.1840],
         [-0.3474, -0.1840],
         [-0.3474, -0.1840]],

        [[-0.3474, -0.1840],
         [-0.3474, -0.1840],
         [ 0.7580,  1.2531]],

        [[-0.3474, -0.1840],
         [ 0.7580,  1.2531],
         [ 0.3202,  0.7973]],

        [[ 0.7580,  1.2531],
         [ 0.3202,  0.7973],
         [ 0.3202,  0.7973]],

        [[ 0.3202,  0.7973],
         [ 0.3202,  0.7973],
         [ 0.4755, -0.5965]],

        [[-0.3474, -0.1840],
         [-0.3474, -0.1840],
         [-0.3474, -0.1840]],

        [[-0.3474, -0.1840],
         [-0.3474, -0.1840],
         [-0.3065, -0.2079]],

        [[-0.3474, -0.1840],
         [-0.3065, -0.2079],
         [ 0.7509, -1.0326]],

        [[-0.3065, -0.2079],
         [ 0.7509, -1.0326],
         [ 0.7551, -0.9047]],

        [[ 0.7509, -1.0326],
         [ 0.7551, -0.9047],
         [-1.7653, -0.4148]],

        [[ 0.7551, -0.9047],
         [-1.7653, -0.4148],
         [ 0.7551, -0.9047]],

        [[-1.7653, -0

In [14]:
C[X].shape

torch.Size([32, 3, 2])

Why do we get a 3D matrix back? Remember each character is represented by a number in tensor `X`. For each number in `X` the corresponding embedding is returned. Since we have three numbers per row in `X`, the  is printed as a _series of $(3 \times 2)$ matrices_. Makes sense? That's my way of looking at it.

Andrej for instance said, for each of the elements in the $(32 \times 3)$ matrix (i.e. the training set `X`), we retreived the corresponding the embedding. Makes sense?

Besides, the following drawing helped solidify my understanding:

[IMAGE HERE]

With the embedding of all the integers in `X` selected, Andrej stored them in the `embed` variable like so:

In [18]:
embed = C[X]
embed.shape

torch.Size([32, 3, 2])

# Implementing the hidden layer + internals of torch. Tensor: storage, views

In this section of the video, we implement the hidden layer of the of the neural network. Andrej decided to call the weights of this hidden layer `W1`, so we are going to follow him.