
# Generating names by a training a neural networks using bigrams

In this first solution, we train the model by counting the bigrams appearance. A different approach is to train a neural network in order to predict the next character.
- The network is fed with the current character, and returns an estimation of the probabilities of the next character
- Since we have a loss function, we will be able to evaluate the behavior of different configurations of the network.

In [None]:
import torch
import matplotlib.pyplot as plt

words = open('data/names.txt', 'r').read().splitlines()
all_chars = ['.'] + sorted(list(set("".join(words))))
itos = {idx: v for idx, v in enumerate(all_chars)}
stoi = {v: k for k, v in itos.items()}


In [None]:
# First, lets create the training sample
xs, ys = [], []
for w in words:
    chs = ['.'] + list(w) + ['.']
    for ch1, ch2 in zip(chs, chs[1:]):
        bigram = ch1, ch2
        ix1 = stoi[ch1]
        ix2 = stoi[ch2]
        xs.append(ix1)
        ys.append(ix2)
        
xs = torch.tensor(xs)
ys = torch.tensor(ys)

In [None]:
xs[:10], ys[:10]

In [None]:
# In order to feed the input layer of the network, we need to encode the characters
# We will use one-hot encoding

import torch.nn.functional as F
xenc = F.one_hot(xs[:5], num_classes=len(all_chars))
xenc

In [None]:
xenc.shape

In [None]:
plt.imshow(xenc)

In [None]:
xenc.dtype

In [None]:
# We need to change the data type to float in order to feed the network
xenc = F.one_hot(xs[:5], num_classes=len(all_chars)).float()
xenc.dtype

Now, we will create a neuron layer with 27 neurons with no bias and no activation function.
- We use 27 neurons in order to return the probability of each of the 27 posible output characters.

In [None]:
W = torch.randn((27, 27))
xenc @ W

We can understand the 27 outputs of each input like a quantity directly correlated with how much the output character comes after the input character.

Since we have here positive and negative numbers, we will transform them by the exponential function.

In [None]:
(xenc @ W).exp()

Now, all results are positive, and since the exponential is monotonic, it preserves the order of the results. This values can be understood as the "counts".

To turn the values to probabilities, we perform the same transformation we did with the counts in the original solution.

In [None]:
logits = (xenc @ W) # log-counts 
counts = logits.exp() # equivalent to the M metric
probs = counts / counts.sum(1, keepdims=True)
probs

In [None]:
probs[0].sum()

In [None]:
probs.shape

**probs** contains the probability of generating the character in the columns (from the 27 candidates) when the input is each of the first 5 characters.

Note: All this operations are differentiable, so we can backpropagate.

We only need a loss. Lets check the current result on the first character.

In [None]:
probs[0]

In [None]:
ys[0].item(), probs[0][ys[0].item()].item()

Since the results are now far from the expected, we need to find (using backpropagation and gradient descent) good values for W, so the largest probabilities are assigned to the correct characters in the sequences.

Lets make an small summary.

In [None]:
[(itos[c1.item()], itos[c2.item()]) for c1, c2 in zip(xs[:5], ys[:5])]

In [None]:
xs[:5]

In [None]:
ys[:5]

In [None]:
# randomly initialize 27 neurons, each one receives 27 inputs
g = torch.Generator().manual_seed(2147483647)
W = torch.rand((27, 27), generator=g)

In [None]:
xenc = F.one_hot(xs[:5], num_classes=len(all_chars)).float()
logits = (xenc @ W) # log-counts 
counts = logits.exp() # equivalent to the M metric
probs = counts / counts.sum(1, keepdims=True)

The last two lines is named "softmax", that transforms a layer outputs into probabilities.

In [None]:
torch.allclose(probs, logits.softmax(dim=1))

In [None]:
xenc = F.one_hot(xs[:5], num_classes=len(all_chars)).float()
logits = (xenc @ W) # log-counts 
probs = logits.softmax(dim=1)

nlls = torch.zeros(5)
for i in range(5):
    x = xs[i].item()
    y = ys[i].item()
    p = probs[i, y]  # probability assigned by the network to real output
    logp = torch.log(p)
    nll = -logp
    nlls[i] = nll 
print(nlls.mean().item())

We can play with different randomly generated Ws.

Now, it is time to train the neural network, but first we need to define the loss function. We will us the same meassure: the negative log-likelihood.

In [None]:
xs[:5]

In [None]:
ys[:5]

In [None]:
probs.shape

In [None]:
probs[0, 5], probs[1, 13], probs[2, 13], probs[3, 1], probs[4, 0]

We can use a way of selecting elements in pytorch. If we index a tensor using arrays of tensors, they are used as indexes in the tensor.

In [None]:
probs[torch.arange(5), ys[:5]]

In [None]:
# negative log-likehood
-probs[torch.arange(5), ys[:5]].log()

In [None]:
# since we need a single number, we will use the mean
loss = -probs[torch.arange(5), ys[:5]].log().mean()
loss 

This loss, together with soft max, is combined in torch with in CrossEntropyLoss.

Now we have all the ingredients to train the neural network to minimize the loss function. 

Lets put the training code in torch, which will be pretty similar to the one in micrograd.

In [None]:
import torch.optim as optim
import torch.nn as nn

W = torch.rand((27, 27), requires_grad=True)

# Define the loss function and optimizer
loss_fn = nn.CrossEntropyLoss()
optimizer = optim.Adam([W], lr=0.1)

# Training loop
num_epochs = 100
for epoch in range(num_epochs):

    xenc = F.one_hot(xs, num_classes=len(all_chars)).float()
    logits = (xenc @ W) # log-counts 
    loss = loss_fn(logits, ys)

    optimizer.zero_grad()
    loss.backward()
        
    # update
    optimizer.step()

    if epoch % (num_epochs // 10) == 0:
        print(loss)

print("Training complete")

**Note**: The bigram based model achieved a loss of 2.4541

It happens that our network is slowly converging to the model we creating by using the counts, which is optimal for all models using bigrams.

Lets generate some names with the trained network.

In [None]:
for _ in range(20):
    ix = 0
    result = []
    while True:
        xenc = F.one_hot(torch.tensor([ix]), num_classes=len(all_chars)).float()
        logits = (xenc @ W) # log-counts 
        p = logits.softmax(dim=1)
        
        ix = torch.multinomial(p,  num_samples=1, replacement=True, generator=g)
        ix = ix.item()
        if ix == 0:
            break
        result.append(itos[ix])
    print("".join(result))