# In this jupyter notebook we will solve the exercises from lesson 2 in Andrej Karpathy's Youtube series on LLM's


 ## **E01**: train a trigram language model, i.e. take two characters as an input to predict the 3rd one. Feel free to use either counting or a neural net. Evaluate the loss; Did it improve over a bigram model?

We will use counting

We begin by loading all the data

In [5]:
words = open('../names.txt', 'r').read().splitlines()

And importing PyTorch

In [6]:
import torch

We create our characters to integers and integers to charachters dictionaries

In [7]:
chars = sorted(list(set(''.join(words))))
stoi = {s:i+1 for i,s in enumerate(chars)}
stoi['.'] = 0
itos = {i:s for s,i in stoi.items()}

Now the  difference begins. We will not use a two dimensional matrix to store the counts. We will use a three dimensional matrix since we are doing a trigram model. The first two coordinates will correspond to the first two letters of a triple of letters. Our context to predict the third letter is now two letters, as opposed to the bigram model, where we were using only 1 letter as context. 

We initialise our count matrix

In [8]:
N = torch.zeros((27, 27, 27), dtype=torch.int32)

We load the information into the matrix

In [9]:
for w in words:
  chs = ['.'] + list(w) + ['.']
  for ch1, ch2, ch3 in zip(chs, chs[1:], chs[2:]):
    ix1 = stoi[ch1]
    ix2 = stoi[ch2]
    ix3 = stoi[ch3]
    N[ix1, ix2, ix3] += 1

Now we convert our 3D-matrix "N" of counts into a probability distribution over the first two dimensions. We also sum 1 to every entry so that every letter is possible to come after other 2, even if it is unlikely.

In [18]:
P = (N+1).float()
P.sum(2, keepdim = True).shape


torch.Size([27, 27, 1])

In [19]:
P.sum(2, keepdim = True)

tensor([[[  27.],
         [4437.],
         [1333.],
         [1569.],
         [1717.],
         [1558.],
         [ 444.],
         [ 696.],
         [ 901.],
         [ 618.],
         [2449.],
         [2990.],
         [1599.],
         [2565.],
         [1173.],
         [ 421.],
         [ 542.],
         [ 119.],
         [1666.],
         [2082.],
         [1335.],
         [ 105.],
         [ 403.],
         [ 334.],
         [ 161.],
         [ 562.],
         [ 956.]],

        [[  27.],
         [ 583.],
         [ 568.],
         [ 497.],
         [1069.],
         [ 719.],
         [ 161.],
         [ 195.],
         [2359.],
         [1677.],
         [ 202.],
         [ 595.],
         [2555.],
         [1661.],
         [5465.],
         [  90.],
         [ 109.],
         [  87.],
         [3291.],
         [1145.],
         [ 714.],
         [ 408.],
         [ 861.],
         [ 188.],
         [ 209.],
         [2077.],
         [ 462.]],

        [[  27.],
      

In [20]:
P = (N+1).float()
P /= P.sum(2, keepdim = True)

Now we want to generate some examples

In [21]:
g = torch.Generator().manual_seed(2147483647)

for i in range(5):
  
  out = []
  ix1, ix2 = 0, 0  # Start with the beginning token ('.', '.')

  while True:
    p = P[ix1, ix2]  # Get the probability distribution for the next character
    ix3 = torch.multinomial(p, num_samples=1, replacement=True, generator=g).item()
    out.append(itos[ix3])  # Convert index back to character
    
    if ix3 == 0:  # Stop when the end token '.' is generated
      break

    # Shift indices: move (ix2 -> ix1) and (ix3 -> ix2) for the next iteration
    ix1, ix2 = ix2, ix3

  print(''.join(out))


junide.
ilyasid.
prelay.
ocin.
fairritoper.


Now we compute the normalised negative loglikelihood and compare it to the bigram model

In [23]:
log_likelihood = 0.0
n = 0

for w in words:
  chs = ['.'] + list(w) + ['.']
  for ch1, ch2, ch3 in zip(chs, chs[1:], chs[2:]):
    ix1, ix2, ix3 = stoi[ch1], stoi[ch2], stoi[ch3]      
    prob = P[ix1, ix2, ix3]
    logprob = torch.log(prob)
    log_likelihood += logprob
    n += 1

print(f'{log_likelihood=}')
nll = -log_likelihood
print(f'{nll=}')
print(f'{nll/n}')

log_likelihood=tensor(-410414.9688)
nll=tensor(410414.9688)
2.092747449874878


We see that our model has improved w.r.t the bigram model