## makemore - MLP

### Implementing [Bengio, et al. 2003: A Neural Probabilistic Language Model](https://www.jmlr.org/papers/volume3/bengio03a/bengio03a.pdf)

This implementation introduces:
- more sophisticated embeddings
- hyperparameters
- learning rate tuning
- train/test/dev set split
- under- and over-fitting

In [3]:
import torch
import torch.nn.functional as F
import matplotlib.pyplot as plt
%matplotlib inline

names = open('data/names.txt', 'r').read().splitlines()
print("Sample names: ", names[5:12])


# create char to index mapping for set (vocabulary) of chars in names
chars = sorted(list(set(''.join(names))))
stoi = {s:i+1 for i,s in enumerate(chars)}
stoi['.'] = 0
itos = {i:s for s,i in stoi.items()}
print("Character mapping: ", itos)

Sample names:  ['charlotte', 'mia', 'amelia', 'harper', 'evelyn', 'abigail', 'emily']
Character mapping:  {1: 'a', 2: 'b', 3: 'c', 4: 'd', 5: 'e', 6: 'f', 7: 'g', 8: 'h', 9: 'i', 10: 'j', 11: 'k', 12: 'l', 13: 'm', 14: 'n', 15: 'o', 16: 'p', 17: 'q', 18: 'r', 19: 's', 20: 't', 21: 'u', 22: 'v', 23: 'w', 24: 'x', 25: 'y', 26: 'z', 0: '.'}


In [13]:
# creating dataset: context and target pairs
# before, we used one-hot encoding; this will now be used for an embedding lookup table
# and eventually transform into modern vector embeddings 

context_length = 3 # how many chars to consider to predict the next char
X, Y = [], []
for name in names[8:12]:

    print("Name: ", name)
    context = [0] * context_length # initial context is all dots; this pads shorter words to match the context length
    for char in name + '.':
        ix = stoi[char]
        X.append(context)
        Y.append(ix)
        print(''.join(itos[i] for i in context), '----->', itos[ix])
        context = context[1:] + [ix]

X = torch.tensor(X)
Y = torch.tensor(Y)

print("X shape: ", X.shape, "   X.dtype: ", X.dtype) # X has shape (vocab_size, context_length)
print("Y shape: ", Y.shape, "      Y.dtype: ", Y.dtype)

Name:  harper
... -----> h
..h -----> a
.ha -----> r
har -----> p
arp -----> e
rpe -----> r
per -----> .
Name:  evelyn
... -----> e
..e -----> v
.ev -----> e
eve -----> l
vel -----> y
ely -----> n
lyn -----> .
Name:  abigail
... -----> a
..a -----> b
.ab -----> i
abi -----> g
big -----> a
iga -----> i
gai -----> l
ail -----> .
Name:  emily
... -----> e
..e -----> m
.em -----> i
emi -----> l
mil -----> y
ily -----> .
X shape:  torch.Size([28, 3])    X.dtype:  torch.int64
Y shape:  torch.Size([28])       Y.dtype:  torch.int64


In [12]:
C = torch.randn(27, 2) # embedding lookup table: 27 possible characters, each represented by a 2-dim vector

F.one_hot(torch.tensor(5), num_classes=27).float() @ C # one-hot encoding of index 5 in a vocab of size 27

tensor([1.0449, 0.5345])

#### Exercises:
- [ ] E01: Tune the hyperparameters of the training to beat a validation loss of 2.2
- [ ] E02: (1) What is the loss you'd get if the predicted probabilities at initialization were perfectly uniform? What loss do we achieve? (2) Can you tune the initialization to get a starting loss that is much more similar to (1)?
- [ ] E03: Read the Bengio et al 2003 paper (link above), implement and try any idea from the paper. Did it work?

### Exercise: 