Train a trigram language model, i.e. take two characters as an input to predict the 3rd one. Feel free to use either counting or a neural net. Evaluate the loss; Did it improve over a bigram model?

In [2]:
words = open('../../data/names.txt', 'r').read().splitlines()

In [6]:
import torch

Create string to int and int to string dictionaries

In [17]:
chars = sorted(list(set(''.join(words))))
stoi = {s:i+1 for i,s in enumerate(chars)}
stoi['.'] = 0
itos = {i:s for s,i in stoi.items()}

Populating a 3 dimensional matrix to count trigram occurrences.

In [23]:
N = torch.zeros(27, 27, 27)

for w in words:
    chs = ['.'] + list(w) + ['.'] # add special start and end chars
    for ch1, ch2, ch3 in zip(chs, chs[1:], chs[2:]):
        ix1 = stoi[ch1]
        ix2 = stoi[ch2]
        ix3 = stoi[ch3]
        N[ix1, ix2, ix3] += 1

Normalizing the matrix to get probabilities.

In [36]:
P = (N+1).float() # model smoothing: we add one to the N matrix to avoid having trigrams with 0 occurrences (that is, probability = 0)
P /= P.sum(2, keepdim=True) # we want to divide by the sum of the third dimension (0 indexed)

Let's try to generate some names!

In [58]:
g = torch.Generator().manual_seed(2147483647)

for i in range(10):
    
    out = []
    ix1 = 0
    ix2 = 0
    while True:
        p = P[ix1][ix2]
        ix3 = torch.multinomial(p, num_samples=1, replacement=True, generator=g).item()
        out.append(itos[ix3])    
        if ix3 == 0:
            break
        ix1 = ix2
        ix2 = ix3
        
    print(''.join(out))

ce.
za.
zogh.
uriana.
kaydnevonimittain.
luwak.
ka.
da.
samiyah.
javer.


Now, let's take a step back and calculate the loss function (negative log likelihood).

In [62]:
log_likelihood = 0
n = 0

for w in words:
    chs = ['.'] + list(w) + ['.'] # add special start and end chars
    for ch1, ch2, ch3 in zip(chs, chs[1:], chs[2:]):
        ix1 = stoi[ch1]
        ix2 = stoi[ch2]
        ix3 = stoi[ch3]
        prob = P[ix1, ix2, ix3]
        logprob = torch.log(prob)
        log_likelihood += logprob
        n += 1

print(f'{log_likelihood=}')
nll = -log_likelihood
print(f'{nll=}')
print(f'{nll/n}') # this is our loss function! the lower the better

log_likelihood=tensor(-410414.9688)
nll=tensor(410414.9688)
2.092747449874878


Let's now try to do the same thingy with a neural network. The first thing that comes to mind is that a tensor needs to be an integer: I should map every single 2 chars word to a single int. that would be 27*27 right?

What about instead using a 2d tensor?

In [173]:
# create training set

xs = []
ys = []

for w in words:
    chs = ['.'] + list(w) + ['.'] # add special start and end chars
    for ch1, ch2, ch3 in zip(chs, chs[1:], chs[2:]):
        ix1 = stoi[ch1]
        ix2 = stoi[ch2]
        ix3 = stoi[ch3]
        xs.append((ix1, ix2))
        ys.append(ix3)

xs = torch.tensor(xs)
ys = torch.tensor(ys)

In [208]:
xs

tensor([[ 0,  5],
        [ 5, 13],
        [13, 13],
        ...,
        [26, 25],
        [25, 26],
        [26, 24]])

In [175]:
W = torch.randn((27*2, 27), requires_grad=True) # create a tensor filled with random numbers from a normal distribution

In [178]:
xenc = F.one_hot(xs, num_classes=27).float()
xenc.shape

torch.Size([196113, 2, 27])

In [206]:
xenc_view = xenc.view(-1, 27*2) # resize the xenc tensor to be compatible with the multiplication by W

In [209]:
xenc_view.shape

torch.Size([196113, 54])

In [210]:
W.shape

torch.Size([54, 27])

In [207]:
for i in range(10):
    
    # forward pass
    logits = xenc_view @ W # log-counts
    counts = logits.exp()
    probs = counts / counts.sum(1, keepdims=True)
    loss = -probs[torch.arange(len(ys)), ys].log().mean()

    print(loss.item())

    # backward pass
    W.grad = None # set grad to 0
    loss.backward()

    # update
    W.data += -10 * W.grad

2.2441227436065674
2.244115114212036
2.244107723236084
2.244100570678711
2.2440929412841797
2.2440857887268066
2.2440783977508545
2.2440712451934814
2.2440640926361084
2.244056463241577
