# Building makemore: Multi-layer perceptron

https://www.youtube.com/watch?v=TCH_1BHY58I

Last time: two models, the first using counts and normalizing them to generate the next character in a sequence. Problem: using more characters grows exponentially. With the stop character, $27^n$.

# Benigo et al. approach

Better approach: multi-layer perceptron to predict future characters. Scales much better. See: https://www.jmlr.org/papers/volume3/bengio03a/bengio03a.pdf

Goes from 17,000 feature vectors to a 30-dimension space (dimensionality reduction). Approach otherwise is as seen in the previous lecture.

Intuitively, "walking" and "running" should be close to one another in the model space, so if the model's never seen "The dog was running in the ___" but it's seen "The cat was walking in the bedroom" it'll consider bedroom a likely word to fill in the blank. (Sounds like word embeddings.)

Network architecture: embedded input layers, hidden layer (30 parameters), output layer (17,000 parameters). Use softmax to take most probable words when generating a sequence.

# Network architecture

In [1]:
import torch
import torch.nn.functional as F
import matplotlib.pyplot as plt # for making figures
%matplotlib inline

In [2]:
# Import data
words = open('makemore/names.txt', 'r').read().splitlines()
words[:8]

['emma', 'olivia', 'ava', 'isabella', 'sophia', 'charlotte', 'mia', 'amelia']

In [3]:
len(words)

32033

In [5]:
# Build vocabulary - same approach as last lecture, identical code
# Put special . as 0 element, and shift the alphabet over by 1
chars = sorted(list(set(''.join(words))))
stoi = {s:i+1 for i,s in enumerate(chars)}
stoi['.'] = 0
itos = {i:s for s,i in stoi.items()}
print(itos)

{1: 'a', 2: 'b', 3: 'c', 4: 'd', 5: 'e', 6: 'f', 7: 'g', 8: 'h', 9: 'i', 10: 'j', 11: 'k', 12: 'l', 13: 'm', 14: 'n', 15: 'o', 16: 'p', 17: 'q', 18: 'r', 19: 's', 20: 't', 21: 'u', 22: 'v', 23: 'w', 24: 'x', 25: 'y', 26: 'z', 0: '.'}


## Lookup table

In [23]:
# Build dataset - new approach using block_size to use the previous n characters to predict the next
block_size = 3 # context length: how many characters do we take to predict the next one?
X, Y = [], [] # input, and labels, to neural network

for w in words[:5]: # Test on just first few names
    print(w)
    context = [0] * block_size
    for ch in w + '.':
        ix = stoi[ch]
        X.append(context)
        Y.append(ix)
        print(''.join(itos[i] for i in context), '-->', itos[ix])
        context = context[1:] + [ix] # crop and append

# Save results
X = torch.tensor(X)
Y = torch.tensor(Y)

emma
... --> e
..e --> m
.em --> m
emm --> a
mma --> .
olivia
... --> o
..o --> l
.ol --> i
oli --> v
liv --> i
ivi --> a
via --> .
ava
... --> a
..a --> v
.av --> a
ava --> .
isabella
... --> i
..i --> s
.is --> a
isa --> b
sab --> e
abe --> l
bel --> l
ell --> a
lla --> .
sophia
... --> s
..s --> o
.so --> p
sop --> h
oph --> i
phi --> a
hia --> .


5 examples can be generated: the 4 letters and the stop character. Note that initially there is no previous character, then e, em, emm, mma. The block size is a rolling window.

In [24]:
X.shape, X.dtype, Y.shape, Y.dtype

(torch.Size([32, 3]), torch.int64, torch.Size([32]), torch.int64)

The paper had 17,000 words, while we have 27 characters (a-z and .) so we need a dimension <27, 2> lookup table.

Conceptually, we use a one-hot encoded vector (indicator) to pull out the desired row from the lookup matrix as before. But this time it'll be indexed because this is faster to look up values.

In [25]:
g = torch.Generator().manual_seed(2147483647)
C = torch.randn((27, 2), generator = g)
F.one_hot(torch.tensor(5), num_classes = 27).float() @ C

tensor([-0.4713,  0.7868])

Embedding a single integer: easy, just return e.g. `C[5]`. Works the same way for tensors.

In [26]:
C[torch.tensor([5, 6, 7, 7, 7])]

tensor([[-0.4713,  0.7868],
        [-0.3284, -0.4330],
        [ 1.3729,  2.9334],
        [ 1.3729,  2.9334],
        [ 1.3729,  2.9334]])

In [27]:
C[X].shape

torch.Size([32, 3, 2])

In [28]:
C[X][4,2]

tensor([-0.0274, -1.1008])

In [29]:
emb = C[X]
emb.shape

torch.Size([32, 3, 2])

## Hidden layer

Weights are initialized randomly as usual; it'll be 3 * 2 = 6 inputs, due to 2-dimensional embeddings * 3 block size. The number of neurons is up to us.

In [31]:
W1 = torch.randn((6, 100), generator = g)
b1 = torch.randn(100, generator = g)

`emb @ W1 + b1` doesn't work due to dimensionality of emb. Need to concatenate the 3, 2 into 6 to do matrix multiplication with `W1`.

In [34]:
# Use torch.cat to concatenate the embeddings for each imput.
emb[:, 0, :].shape, emb[:, 1, :].shape, emb[:, 2, :].shape

(torch.Size([32, 2]), torch.Size([32, 2]), torch.Size([32, 2]))

In [40]:
# Concatenate on the 1st indexed dimension
torch.cat([emb[:, 0, :], emb[:, 1, :], emb[:, 2, :]], 1).shape

torch.Size([32, 6])

This has the correct dimensions; just needs to be updated to be generalizable based on the block size. Use `torch.unbind` to unwrap the tensor along a given dimension.

In [42]:
torch.cat(torch.unbind(emb, 1), 1).shape

torch.Size([32, 6])

This can be done even more efficiently by recasting the dimensions of a tensor directly by using `view`.

In [43]:
a = torch.arange(18)
a

tensor([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16, 17])

In [44]:
a.view(9, 2)

tensor([[ 0,  1],
        [ 2,  3],
        [ 4,  5],
        [ 6,  7],
        [ 8,  9],
        [10, 11],
        [12, 13],
        [14, 15],
        [16, 17]])

In [45]:
a.view(3, 3, 2)

tensor([[[ 0,  1],
         [ 2,  3],
         [ 4,  5]],

        [[ 6,  7],
         [ 8,  9],
         [10, 11]],

        [[12, 13],
         [14, 15],
         [16, 17]]])

As long as the arguments multiply to 18 in this case, the sequence can be represented with the given dimensions without any additional memory being used. Very efficient!

In [46]:
# Same as concatenation before
emb.view(32, 6)

tensor([[ 1.5674, -0.2373,  1.5674, -0.2373,  1.5674, -0.2373],
        [ 1.5674, -0.2373,  1.5674, -0.2373, -0.4713,  0.7868],
        [ 1.5674, -0.2373, -0.4713,  0.7868,  2.4448, -0.6701],
        [-0.4713,  0.7868,  2.4448, -0.6701,  2.4448, -0.6701],
        [ 2.4448, -0.6701,  2.4448, -0.6701, -0.0274, -1.1008],
        [ 1.5674, -0.2373,  1.5674, -0.2373,  1.5674, -0.2373],
        [ 1.5674, -0.2373,  1.5674, -0.2373, -1.0725,  0.7276],
        [ 1.5674, -0.2373, -1.0725,  0.7276, -0.0707,  2.4968],
        [-1.0725,  0.7276, -0.0707,  2.4968,  0.6772, -0.8404],
        [-0.0707,  2.4968,  0.6772, -0.8404, -0.1158, -1.2078],
        [ 0.6772, -0.8404, -0.1158, -1.2078,  0.6772, -0.8404],
        [-0.1158, -1.2078,  0.6772, -0.8404, -0.0274, -1.1008],
        [ 1.5674, -0.2373,  1.5674, -0.2373,  1.5674, -0.2373],
        [ 1.5674, -0.2373,  1.5674, -0.2373, -0.0274, -1.1008],
        [ 1.5674, -0.2373, -0.0274, -1.1008, -0.1158, -1.2078],
        [-0.0274, -1.1008, -0.1158, -1.2

In [47]:
emb.view(32, 6) @ W1 + b1

tensor([[-2.5071,  3.5656, -2.4451,  ...,  2.5479, -0.8539,  3.7077],
        [-4.2930,  6.2880, -5.4224,  ...,  2.6233, -1.7728,  3.3300],
        [-2.5582, -0.1837,  1.6790,  ...,  2.6169, -0.0838,  2.6059],
        ...,
        [ 1.8194, -4.3532,  5.9360,  ...,  2.4439,  0.0743, -3.7715],
        [ 0.7255,  6.5159,  2.6143,  ...,  2.2033, -0.4564,  1.9066],
        [-3.7433,  3.3581, -2.3161,  ...,  0.6154, -1.1615,  4.6473]])

In [49]:
# Even better, have pytorch infer the dimensions
emb.view(-1, 6) @ W1 + b1

tensor([[-2.5071,  3.5656, -2.4451,  ...,  2.5479, -0.8539,  3.7077],
        [-4.2930,  6.2880, -5.4224,  ...,  2.6233, -1.7728,  3.3300],
        [-2.5582, -0.1837,  1.6790,  ...,  2.6169, -0.0838,  2.6059],
        ...,
        [ 1.8194, -4.3532,  5.9360,  ...,  2.4439,  0.0743, -3.7715],
        [ 0.7255,  6.5159,  2.6143,  ...,  2.2033, -0.4564,  1.9066],
        [-3.7433,  3.3581, -2.3161,  ...,  0.6154, -1.1615,  4.6473]])

In [51]:
h = torch.tanh(emb.view(-1, 6) @ W1 + b1)
h.shape

torch.Size([32, 100])

Note the `+ b1` which is a constant bias term; due to Torch's internals even though b1 is a constant vector it will be copied down and added to all rows correctly; always worth checking this.

## Output layer

Input is 100 neurons, output is 27 possible characters.

In [53]:
W2 = torch.randn((100, 27), generator = g)
b2 = torch.randn(27, generator = g)

In [54]:
logits = h @ W2 + b2
logits.shape

torch.Size([32, 27])

In [58]:
counts = logits.exp()
prob = counts / counts.sum(1, keepdims = True) # Sum along first dimension
prob.shape

torch.Size([32, 27])

In [60]:
prob[0].sum()

tensor(1.)

Rows sum to 1.

Still need to grab probabilities and index prob, and compare to Y.

In [61]:
prob[torch.arange(32), Y]

tensor([6.2200e-08, 1.1838e-04, 2.8906e-09, 2.6372e-12, 1.3740e-07, 4.0513e-22,
        3.9412e-06, 1.7937e-07, 9.5025e-11, 2.0370e-06, 3.2776e-09, 6.3124e-07,
        6.4844e-16, 3.7328e-06, 1.1113e-05, 5.4767e-07, 1.1513e-14, 3.1438e-14,
        7.4242e-09, 1.7710e-09, 5.0028e-01, 7.0436e-09, 1.7890e-11, 3.5921e-14,
        3.4840e-15, 8.1048e-17, 1.3126e-16, 1.1197e-07, 3.1022e-11, 1.4176e-08,
        5.4811e-09, 5.7530e-04])

Negative log likelihood is defined as before, and will be minimized.

In [63]:
loss = -prob[torch.arange(32), Y].log().mean()
loss

tensor(21.1617)

## Cleaned up code

In [64]:
X.shape, Y.shape

(torch.Size([32, 3]), torch.Size([32]))

In [77]:
g = torch.Generator().manual_seed(2147483647)
C = torch.randn((27, 2), generator = g)
W1 = torch.randn((6, 100), generator = g)
b1 = torch.randn(100, generator = g)
W2 = torch.randn((100, 27), generator = g)
b2 = torch.randn(27, generator = g)
parameters = [C, W1, b1, W2, b2]

In [78]:
sum(p.nelement() for p in parameters) # Total parameters

3481

In [79]:
# Should be the same as when we multiply out the tensor sizes
27 * 2 + 6 * 100 + 100 + 100 * 27 + 27

3481

In [80]:
# Loss function with current parameters
emb = C[X] # (32, 3, 2)
h = torch.tanh(emb.view(-1, 6) @ W1 + b1) # (32, 100)
logits = h @ W2 + b2 # (32, 27)
counts = logits.exp()
prob = counts / counts.sum(1, keepdims = True)
loss = -prob[torch.arange(32), Y].log().mean()
loss

tensor(17.7697)

Can be further optimized with the `corss_entropy` function which takes the place of the `counts =`, `prob =`, `loss =`

In [81]:
# Loss function with current parameters
emb = C[X] # (32, 3, 2)
h = torch.tanh(emb.view(-1, 6) @ W1 + b1) # (32, 100)
logits = h @ W2 + b2 # (32, 27)
#counts = logits.exp()
#prob = counts / counts.sum(1, keepdims = True)
#loss = -prob[torch.arange(32), Y].log().mean()
loss = F.cross_entropy(logits, Y)
loss # Should be same as above

tensor(17.7697)

Always best to use the built-in operations like `cross_entropy` because intermediate steps aren't saved. It's also a better, more efficient implementation for the backward step. Can have a simpler expression when calculating the backpropagation gradients and updates.

Additionally, `cross_entropy` can avoid various numerical issues. Consider what happens with extreme values when you roll your own implementation...

In [82]:
logits = torch.tensor([-100, -3, 0, 5])
counts = logits.exp()
counts / counts.sum()

tensor([0.0000e+00, 3.3311e-04, 6.6906e-03, 9.9298e-01])

This works ok with lower values, but leads to numerical errors with large numbers...

In [83]:
logits = torch.tensor([-100, -3, 0, 100]) # But large numbers lead to numerical errors
counts = logits.exp() # ...due to exp() which blows up with large positive numbers
counts / counts.sum()

tensor([0., 0., 0., nan])

Internally, Pytorch subtracts the maximum from the entire vector to ensure nothing is > 0.

# Training