<h1 style="text-align: center; font-weight: bold; font-size: 36px;">Character Level MLP - Torch Autograd</h1>

# Introduction

Let's create a **bigram** model by **gradient descent** - a single linear layer pseudo neural network.

Inspired by Karpathy [Neural Networks: Zero-to-Hero](https://github.com/karpathy/nn-zero-to-hero). 
We are using the same [names.txt](https://github.com/karpathy/makemore/blob/master/names.txt) as in Zero to Hero so we can compare results.

References:

- [Bengio et al. 2003 MLP language model paper](https://www.jmlr.org/papers/volume3/bengio03a/bengio03a.pdf)

# Imports

In [1]:
import torch
import torch.nn.functional as F

# Build the Dataset

In [7]:
with open('../data/names.txt', 'r') as f:
    names = f.read().splitlines()
letters = sorted(list(set(''.join(names))))
letters = ['.'] + letters

In [8]:
class Tokenizer:
    def __init__(self, vocab):
        assert isinstance(vocab, list)
        assert all(isinstance(v, str) for v in vocab)
        assert all(len(v) == 1 for v in vocab)
        self.stoi = {ch: i for i, ch in enumerate(vocab)}
        self.itos = {i: ch for i, ch in enumerate(vocab)}

    def encode(self, text):
        return [self.stoi[s] for s in text]

    def decode(self, sequence):
        assert isinstance(sequence, torch.Tensor)
        assert sequence.ndim in [0, 1]
        if sequence.ndim == 0:
            return self.itos[sequence.item()]  # one char
        else:
            return ''.join([self.itos[i.item()] for i in sequence])

In [10]:
blok_size = 3  # context length

X, Y = [], []  # inputs and targets
tok = Tokenizer(vocab=letters)
for name in names:
    name = '.'*blok_size + name + '.'  # add start/stop tokens '..emma.'
    for i in range(len(name) - blok_size):
        X.append(tok.encode(name[i:i+blok_size]))
        Y.append(tok.encode(name[i+blok_size])[0])  # [0] to keep Y 1d tensor

X = torch.tensor(X)
Y = torch.tensor(Y)

In [11]:
def verify_softmax_and_cross_entropy():
    def softmax(logits):
        """Numerically stable softmax"""
        max_ = torch.max(logits, dim=-1, keepdim=True)[0]
        exp = torch.exp(logits - max_)
        exp_sum = torch.sum(exp, dim=-1, keepdim=True)
        return exp / exp_sum

    def only_cross_entropy(y_hat, correct_target_idx):
        """Compute the cross-entropy loss. Equivalent to neg log likelihood."""
        target_class_prob = y_hat[torch.arange(len(y_hat)), correct_target_idx]    # n_batch
        ce_loss = -1 * torch.log(target_class_prob)
        return ce_loss

    def fused_cross_entropy(logits, correct_target_idx):
        """Softmax fused with cross_entropy. Matches F.cross_entropy"""
        y_hat = softmax(logits)
        ce_loss = only_cross_entropy(y_hat, correct_target_idx)
        return ce_loss.mean()
    
    # Rand init
    torch.manual_seed(42)
    
    # Init Layers
    C = torch.randn((27, 2), requires_grad=True)     # n_vocab, n_emb (embeddings)
    W1 = torch.randn((6, 100), requires_grad=True)   # n_seq+n_emb, n_hid1
    b1 = torch.randn((1, 100), requires_grad=True)   # 1, n_hid1
    W2 = torch.randn((100, 27), requires_grad=True)  # n_hid1, n_out
    b2 = torch.randn((1, 27), requires_grad=True)    # 1, n_out

    # Mini batch:
    x_batch = X[:12]
    y_batch = Y[:12]

    # Embed inputs
    emb = C[x_batch]  # n_batch, n_seq, n_emb

    # First layer
    z1 = emb.view(-1, 6) @ W1 + b1  # n_batch, n_hid1
    h1 = torch.tanh(z1)             # n_batch, n_hid1

    # Output layer
    logits = h1 @ W2 + b2   # n_batch, n_vocab

    probs = softmax(logits)                # n_batch, n_vocab
    probs_2 = torch.softmax(logits, -1)    # Equivalently
    assert torch.allclose(probs, probs_2)

    loss = fused_cross_entropy(logits, y_batch)  # scalar
    loss_2 = F.cross_entropy(logits, y_batch)    # equivalent
    assert torch.allclose(loss, loss_2)

    print(loss)

    print("PyTorch softmax/cross_entropy seem correct! :)")

verify_softmax_and_cross_entropy()

tensor(19.1366, grad_fn=<MeanBackward0>)
PyTorch softmax/cross_entropy seem correct! :)
