# Andrej Karpathy's /makemore

#### Reference
https://www.youtube.com/watch?v=TCH_1BHY58I

https://github.com/karpathy/makemore

With Bigram char level model, we only used two chars which created a 27x27 data space. if we move deeper in this approach to enhance the loss function and the model itself, the only avenue was to explor adding more dimensions, i.e. 27x27x27. however this path suddenly explodes in terms of data and parameters that we want to use for this model.

Therefore, we need to explore a better model.

## Multi Layer Perceptron (MLP)
#### Bengio et al. 2003

This is another char level model to predict the next char, however the paper is based on word predictions. 

The proposed approach is to take 'w' number of words, and associate to each word, 'm' number of feature vectors. Meaning that, each word is embedded in a 'm' dimensional feature space. Initially these words are initialized randomly but later we'll tune them using backpropagation. 

To imagine this approach, think about words that are similar or synonyms. They will end up in the same part of the space. And those that are different will be separated. 

The modeling approach is similar to the NN approach for Bigram. They use multi-layer NN to predict the next words, given the previous words. To train the NN, they ```maximize the log-likelihood of the training data```.

Let's look at an ```example``` for this approach. Assume, we are not given the sentence "A dog was running in a room". But now for testing the model we are providing it with "A dog was running in a ..." and expecting the model to fill in the blank. Since it hasn't seen this exact sentence, we call it, ```out of distribution```. However, MLP doesn't need to have seen the exact words to predict 'room' for the blank. Because it might have seen "The dog was running in a room" and based on the learnings, it has put the embeddings of 'The' and 'A' near by each other in the space. So now that we are asking it to fill the blank based on "A dog was running in a ...", it will match it up with "The dog was running in a room". This is called ```knowledge transfer```.

Let's look at the ```architecture``` of this approach. 

Assume the NN's input, takes 3 previous-words. And the output is the fourth word. Each of the incoming words, will go through a look-up table, to match up the corresponding embedding ('m' feature vector) for that word. So there will be $3 \times m$ neurons holding the 3 words. 

Then we need to build a hidden layer. The size is a ```hyper-parameter```. Meaning that, we need to come up with the right size based on try-error. So all the input neurons goes into the hidden layer. And there will be a ```tanh``` function applied for non-linearity. 

The output layer is a huge one, because the number of neurons is equivalent to $w$, the number of words in our data set. All the neurons in the hidden layer are connected to the output neurons. That's why there will be lots of params in between these two layers, and therefore, it's going to be computationally expensive. On top of the output layer we have ```softmax``` (exponentiate the logits and normalize, so that it will sum up to 1). This way, we'll get a nice probability distribution for the next word in the sequence. 

During training, because we have xs and ys, we will get the probability for each x and minimize the NN's loss by improving the parameters. The optimization used here is also ```backpropagation```.

In [5]:
import torch
import torch.nn.functional as F
import matplotlib.pyplot as plt
%matplotlib inline

In [6]:
import os

# read from another package while we are in a separate package
current_directory = os.getcwd()
parent_directory = os.path.dirname(current_directory)
file_path = os.path.join(parent_directory, 'opensource/makemore', 'names.txt')

words = open(file_path, 'r').read().splitlines()

words[:10]

['emma',
 'olivia',
 'ava',
 'isabella',
 'sophia',
 'charlotte',
 'mia',
 'amelia',
 'harper',
 'evelyn']

In [7]:
# build the vocabulary of chars and mappings to/from integers
chars = sorted(list(set(''.join(words))))
s2i = {s:i+1 for i,s in enumerate(chars)}
s2i['.'] = 0
i2s = {i:s for s,i in s2i.items()}
print(i2s)

{1: 'a', 2: 'b', 3: 'c', 4: 'd', 5: 'e', 6: 'f', 7: 'g', 8: 'h', 9: 'i', 10: 'j', 11: 'k', 12: 'l', 13: 'm', 14: 'n', 15: 'o', 16: 'p', 17: 'q', 18: 'r', 19: 's', 20: 't', 21: 'u', 22: 'v', 23: 'w', 24: 'x', 25: 'y', 26: 'z', 0: '.'}


#### build the dataset

In [9]:
# build the dataset

block_size = 3 # context length: how many chars do we take to predict the next char, 4-th one
X, Y = [], []
for w in words[:5]: # the examples we can generate from the first 5 words
    print(w)
    context = [0] * block_size
    for ch in w + '.': # we are padding with dots, because if the word doesn't have enough chars to cover for our block_size, we'll have something to build
        ix = s2i[ch]
        X.append(context)
        Y.append(ix)
        print(''.join(i2s[i] for i in context), '--->', i2s[ix])
        context = context[1:] + [ix] # crop and append
X = torch.tensor(X)
Y = torch.tensor(Y)

emma
... ---> e
..e ---> m
.em ---> m
emm ---> a
mma ---> .
olivia
... ---> o
..o ---> l
.ol ---> i
oli ---> v
liv ---> i
ivi ---> a
via ---> .
ava
... ---> a
..a ---> v
.av ---> a
ava ---> .
isabella
... ---> i
..i ---> s
.is ---> a
isa ---> b
sab ---> e
abe ---> l
bel ---> l
ell ---> a
lla ---> .
sophia
... ---> s
..s ---> o
.so ---> p
sop ---> h
oph ---> i
phi ---> a
hia ---> .


In [10]:
X

tensor([[ 0,  0,  0],
        [ 0,  0,  5],
        [ 0,  5, 13],
        [ 5, 13, 13],
        [13, 13,  1],
        [ 0,  0,  0],
        [ 0,  0, 15],
        [ 0, 15, 12],
        [15, 12,  9],
        [12,  9, 22],
        [ 9, 22,  9],
        [22,  9,  1],
        [ 0,  0,  0],
        [ 0,  0,  1],
        [ 0,  1, 22],
        [ 1, 22,  1],
        [ 0,  0,  0],
        [ 0,  0,  9],
        [ 0,  9, 19],
        [ 9, 19,  1],
        [19,  1,  2],
        [ 1,  2,  5],
        [ 2,  5, 12],
        [ 5, 12, 12],
        [12, 12,  1],
        [ 0,  0,  0],
        [ 0,  0, 19],
        [ 0, 19, 15],
        [19, 15, 16],
        [15, 16,  8],
        [16,  8,  9],
        [ 8,  9,  1]])

In [11]:
Y

tensor([ 5, 13, 13,  1,  0, 15, 12,  9, 22,  9,  1,  0,  1, 22,  1,  0,  9, 19,
         1,  2,  5, 12, 12,  1,  0, 19, 15, 16,  8,  9,  1,  0])

#### Build The Embeddings

the paper used 70000 words with 30 embeddings. we have 27 chars, so we'll go with 2 embeddings.

In [12]:
C = torch.randn((27, 2))
C[5]

tensor([2.3131, 0.0140])

In [13]:
# retrieve embeddings with a list of lookups
C[[5,6,8]]

tensor([[ 2.3131,  0.0140],
        [-0.6978,  0.8964],
        [-0.5446,  1.6925]])

In [15]:
# therefore this works
emb = C[X]
emb

tensor([[[-1.4846, -0.9307],
         [-1.4846, -0.9307],
         [-1.4846, -0.9307]],

        [[-1.4846, -0.9307],
         [-1.4846, -0.9307],
         [ 2.3131,  0.0140]],

        [[-1.4846, -0.9307],
         [ 2.3131,  0.0140],
         [-0.4677, -0.6155]],

        [[ 2.3131,  0.0140],
         [-0.4677, -0.6155],
         [-0.4677, -0.6155]],

        [[-0.4677, -0.6155],
         [-0.4677, -0.6155],
         [ 1.7267,  0.5913]],

        [[-1.4846, -0.9307],
         [-1.4846, -0.9307],
         [-1.4846, -0.9307]],

        [[-1.4846, -0.9307],
         [-1.4846, -0.9307],
         [ 0.9399,  1.0661]],

        [[-1.4846, -0.9307],
         [ 0.9399,  1.0661],
         [-0.9957, -0.7736]],

        [[ 0.9399,  1.0661],
         [-0.9957, -0.7736],
         [-0.7113, -0.3796]],

        [[-0.9957, -0.7736],
         [-0.7113, -0.3796],
         [-0.1170, -0.3380]],

        [[-0.7113, -0.3796],
         [-0.1170, -0.3380],
         [-0.7113, -0.3796]],

        [[-0.1170, -0

#### hidden layer

In [16]:
hidden_layer_hyperparameter_size = 100
num_of_words = 3
num_of_embeddings = 2
num_of_inputs = num_of_words * num_of_embeddings

w1 = torch.randn((num_of_inputs, hidden_layer_hyperparameter_size))
b1 = torch.randn((hidden_layer_hyperparameter_size))



In [20]:
X.shape

torch.Size([32, 3])

In [21]:
C.shape

torch.Size([27, 2])

In [22]:
emb.shape

torch.Size([32, 3, 2])

In [17]:
w1.shape

torch.Size([6, 100])

wee want to setup the tensor's shapes in such a way that ```emb @ w1 + b1``` would work.

http://blog.ezyang.com/2019/05/pytorch-internals/


In [25]:
x_size = emb.shape[0] # or use -1 for pytorch to figure it out
emb.view(x_size, num_of_inputs) @ w1 + b1

tensor([[-2.5943, -0.8644,  0.8524,  ...,  5.1608, -1.5512, -0.1240],
        [-0.1756,  6.6545, -4.5579,  ...,  1.7516,  2.5350,  0.3815],
        [-1.0892, -3.7104, -3.1708,  ...,  2.3125,  1.7632, -2.0210],
        ...,
        [ 1.3089,  4.0530,  1.7895,  ...,  6.5008, -0.4606,  1.8410],
        [ 2.3053,  1.1875, -0.6656,  ...,  6.1175, -0.4204,  2.4052],
        [ 2.9902,  6.7687, -2.5202,  ...,  1.9511, -1.7564,  2.0530]])

In [28]:
# hidden layer
h = torch.tanh(emb.view(-1, num_of_inputs) @ w1 + b1) # added tanh to bring all the values between -1 and 1 for non-linearity
h 

tensor([[-0.9889, -0.6985,  0.6923,  ...,  0.9999, -0.9140, -0.1234],
        [-0.1738,  1.0000, -0.9998,  ...,  0.9416,  0.9875,  0.3640],
        [-0.7966, -0.9988, -0.9965,  ...,  0.9806,  0.9429, -0.9655],
        ...,
        [ 0.8640,  0.9994,  0.9457,  ...,  1.0000, -0.4306,  0.9509],
        [ 0.9803,  0.8298, -0.5821,  ...,  1.0000, -0.3973,  0.9838],
        [ 0.9950,  1.0000, -0.9871,  ...,  0.9604, -0.9421,  0.9676]])

#### output layer

In [30]:
# output layer
w2 = torch.randn((hidden_layer_hyperparameter_size, 27))
b2 = torch.randn((27))
logits = h @ w2 + b2
logits.shape

torch.Size([32, 27])

In [31]:
counts = logits.exp() # get fake counts
probs = counts / counts.sum(1, keepdim=True) # normalize to get the probabilities
probs.shape


torch.Size([32, 27])

In [33]:
# proof of normalized probs is to check if every row sums up to =1
probs[0].sum()

tensor(1.0000)