# In this jupyter notebook we will solve the exercises from lesson 2 in Andrej Karpathy's Youtube series on LLM's


 ## **E01**: train a trigram language model, i.e. take two characters as an input to predict the 3rd one. Feel free to use either counting or a neural net. Evaluate the loss; Did it improve over a bigram model?

We will use counting first


We begin by loading all the data

In [None]:
words = open('../names.txt', 'r').read().splitlines()

And importing PyTorch

In [None]:
import torch

We create our characters to integers and integers to charachters dictionaries

In [None]:
chars = sorted(list(set(''.join(words))))
stoi = {s:i+1 for i,s in enumerate(chars)}
stoi['.'] = 0
itos = {i:s for s,i in stoi.items()}

Now the  difference begins. We will not use a two dimensional matrix to store the counts. We will use a three dimensional matrix since we are doing a trigram model. The first two coordinates will correspond to the first two letters of a triple of letters. Our context to predict the third letter is now two letters, as opposed to the bigram model, where we were using only 1 letter as context. 

We initialise our count matrix

In [None]:
N = torch.zeros((27, 27, 27), dtype=torch.int32)

We load the information into the matrix

In [None]:
for w in words:
  chs = ['.'] + list(w) + ['.']
  for ch1, ch2, ch3 in zip(chs, chs[1:], chs[2:]):
    ix1 = stoi[ch1]
    ix2 = stoi[ch2]
    ix3 = stoi[ch3]
    N[ix1, ix2, ix3] += 1

Now we convert our 3D-matrix "N" of counts into a probability distribution over the first two dimensions. We also sum 1 to every entry so that every letter is possible to come after other 2, even if it is unlikely.

In [None]:
P = (N+1).float()
P.sum(2, keepdim = True).shape


In [None]:
P.sum(2, keepdim = True)

In [None]:
P = (N+1).float()
P /= P.sum(2, keepdim = True)

Now we want to generate some examples

In [None]:
g = torch.Generator().manual_seed(2147483647)

for i in range(5):
  
  out = []
  ix1, ix2 = 0, 0  # Start with the beginning token ('.', '.')

  while True:
    p = P[ix1, ix2]  # Get the probability distribution for the next character
    ix3 = torch.multinomial(p, num_samples=1, replacement=True, generator=g).item()
    out.append(itos[ix3])  # Convert index back to character
    
    if ix3 == 0:  # Stop when the end token '.' is generated
      break

    # Shift indices: move (ix2 -> ix1) and (ix3 -> ix2) for the next iteration
    ix1, ix2 = ix2, ix3

  print(''.join(out))


Now we compute the normalised negative loglikelihood and compare it to the bigram model

In [None]:
log_likelihood = 0.0
n = 0

for w in words:
  chs = ['.'] + list(w) + ['.']
  for ch1, ch2, ch3 in zip(chs, chs[1:], chs[2:]):
    ix1, ix2, ix3 = stoi[ch1], stoi[ch2], stoi[ch3]      
    prob = P[ix1, ix2, ix3]
    logprob = torch.log(prob)
    log_likelihood += logprob
    n += 1

print(f'{log_likelihood=}')
nll = -log_likelihood
print(f'{nll=}')
print(f'{nll/n}')

We see that our model has improved w.r.t the bigram model

---------------------
---------------------
We now create the same model using a one layer neural network

In [None]:
# create the training set of trigrams: input (ch1, ch2) and output ch3
xs, ys = [], []

for w in words:
  chs = ['.'] + list(w) + ['.']
  for ch1, ch2, ch3 in zip(chs, chs[1:], chs[2:]):
    ix1, ix2, ix3 = stoi[ch1], stoi[ch2], stoi[ch3]    
    xs.append([ix1, ix2])
    ys.append(ix3)
    
xs = torch.tensor(xs)
ys = torch.tensor(ys)
num_examples = xs.shape[0]

In [None]:
xs.shape


In [None]:
# Combine the two indices into a single index for a bigram.
bigram_idx = xs[:, 0] * 27 + xs[:, 1]  # shape: (N,)

# One-hot encode the bigram indices into a 729-dimensional vector.
import torch.nn.functional as F
xenc = F.one_hot(bigram_idx, num_classes=27*27).float()

In [None]:
# randomly initialize 27 neurons' weights. each neuron receives 27*27 inputs
g = torch.Generator().manual_seed(2147483647)
W = torch.randn((27*27, 27), generator=g, requires_grad=True)

In [None]:
# gradient descent
for k in range(1):
  
  # forward pass
  logits = xenc @ W # predict log-counts
  counts = logits.exp() # counts, equivalent to N
  probs = counts / counts.sum(1, keepdims=True) # probabilities for next character
  loss = -probs[torch.arange(num_examples), ys].log().mean() + 0.01*(W**2).mean()
  print(loss.item())
  
  # backward pass
  W.grad = None # set to zero the gradient
  loss.backward()
  
  # update
  W.data += -50 * W.grad




## E02: split up the dataset randomly into 80% train set, 10% dev set, 10% test set. Train the bigram and trigram models only on the training set. Evaluate them on dev and test splits. What can you see?

We will only do so with the trigram model. We will begin by shuffling the list of words. 

In [None]:
import random
random.shuffle(words)

In [None]:
# create the training set of trigrams: input (ch1, ch2) and output ch3
xs, ys = [], []

for w in words:
  chs = ['.'] + list(w) + ['.']
  for ch1, ch2, ch3 in zip(chs, chs[1:], chs[2:]):
    ix1, ix2, ix3 = stoi[ch1], stoi[ch2], stoi[ch3]    
    xs.append([ix1, ix2])
    ys.append(ix3)
    
xs = torch.tensor(xs)
ys = torch.tensor(ys)
num_examples = xs.shape[0]
# Combine the two indices into a single index for a bigram.
bigram_idx = xs[:, 0] * 27 + xs[:, 1]  # shape: (N,)

# One-hot encode the bigram indices into a 729-dimensional vector.
import torch.nn.functional as F
xenc = F.one_hot(bigram_idx, num_classes=27*27).float()

# calculationg index for splits
a = int(num_examples*0.8)
b = int(num_examples*0.9)

# creating training dev and test set after encoding has been done
xs_tr, xs_dev, xs_te = xenc[:a], xenc[a:b], xenc[b:]
ys_tr, ys_dev, ys_te = ys[:a], ys[a:b], ys[b:]


In [None]:
# randomly initialize 27 neurons' weights. each neuron receives 27*27 inputs
g = torch.Generator().manual_seed(2147483647)
W = torch.randn((27*27, 27), generator=g, requires_grad=True)

In [None]:
# gradient descent
for k in range(500):
  
  # forward pass
  logits = xs_tr @ W # predict log-counts
  counts = logits.exp() # counts, equivalent to N
  probs = counts / counts.sum(1, keepdims=True) # probabilities for next character
  loss = -probs[torch.arange(a), ys_tr].log().mean() + 0.01*(W**2).mean()
  print(loss.item())
  
  # backward pass
  W.grad = None # set to zero the gradient
  loss.backward()
  
  # update
  W.data += -50 * W.grad