# In this jupyter notebook we will solve the exercises from lesson 2 in Andrej Karpathy's Youtube series on LLM's


 ## **E01**: train a trigram language model, i.e. take two characters as an input to predict the 3rd one. Feel free to use either counting or a neural net. Evaluate the loss; Did it improve over a bigram model?

We will use counting first


We begin by loading all the data

In [75]:
words = open('../names.txt', 'r').read().splitlines()

And importing PyTorch

In [76]:
import torch

We create our characters to integers and integers to charachters dictionaries

In [77]:
chars = sorted(list(set(''.join(words))))
stoi = {s:i+1 for i,s in enumerate(chars)}
stoi['.'] = 0
itos = {i:s for s,i in stoi.items()}

Now the  difference begins. We will not use a two dimensional matrix to store the counts. We will use a three dimensional matrix since we are doing a trigram model. The first two coordinates will correspond to the first two letters of a triple of letters. Our context to predict the third letter is now two letters, as opposed to the bigram model, where we were using only 1 letter as context. 

We initialise our count matrix

In [78]:
N = torch.zeros((27, 27, 27), dtype=torch.int32)

We load the information into the matrix

In [79]:
for w in words:
  chs = ['.'] + list(w) + ['.']
  for ch1, ch2, ch3 in zip(chs, chs[1:], chs[2:]):
    ix1 = stoi[ch1]
    ix2 = stoi[ch2]
    ix3 = stoi[ch3]
    N[ix1, ix2, ix3] += 1

Now we convert our 3D-matrix "N" of counts into a probability distribution over the first two dimensions. We also sum 1 to every entry so that every letter is possible to come after other 2, even if it is unlikely.

In [80]:
P = (N+1).float()
P.sum(2, keepdim = True).shape


torch.Size([27, 27, 1])

In [81]:
P.sum(2, keepdim = True)

tensor([[[  27.],
         [4437.],
         [1333.],
         [1569.],
         [1717.],
         [1558.],
         [ 444.],
         [ 696.],
         [ 901.],
         [ 618.],
         [2449.],
         [2990.],
         [1599.],
         [2565.],
         [1173.],
         [ 421.],
         [ 542.],
         [ 119.],
         [1666.],
         [2082.],
         [1335.],
         [ 105.],
         [ 403.],
         [ 334.],
         [ 161.],
         [ 562.],
         [ 956.]],

        [[  27.],
         [ 583.],
         [ 568.],
         [ 497.],
         [1069.],
         [ 719.],
         [ 161.],
         [ 195.],
         [2359.],
         [1677.],
         [ 202.],
         [ 595.],
         [2555.],
         [1661.],
         [5465.],
         [  90.],
         [ 109.],
         [  87.],
         [3291.],
         [1145.],
         [ 714.],
         [ 408.],
         [ 861.],
         [ 188.],
         [ 209.],
         [2077.],
         [ 462.]],

        [[  27.],
      

In [82]:
P = (N+1).float()
P /= P.sum(2, keepdim = True)

Now we want to generate some examples

In [83]:
g = torch.Generator().manual_seed(2147483647)

for i in range(5):
  
  out = []
  ix1, ix2 = 0, 0  # Start with the beginning token ('.', '.')

  while True:
    p = P[ix1, ix2]  # Get the probability distribution for the next character
    ix3 = torch.multinomial(p, num_samples=1, replacement=True, generator=g).item()
    out.append(itos[ix3])  # Convert index back to character
    
    if ix3 == 0:  # Stop when the end token '.' is generated
      break

    # Shift indices: move (ix2 -> ix1) and (ix3 -> ix2) for the next iteration
    ix1, ix2 = ix2, ix3

  print(''.join(out))


junide.
ilyasid.
prelay.
ocin.
fairritoper.


Now we compute the normalised negative loglikelihood and compare it to the bigram model

In [84]:
log_likelihood = 0.0
n = 0

for w in words:
  chs = ['.'] + list(w) + ['.']
  for ch1, ch2, ch3 in zip(chs, chs[1:], chs[2:]):
    ix1, ix2, ix3 = stoi[ch1], stoi[ch2], stoi[ch3]      
    prob = P[ix1, ix2, ix3]
    logprob = torch.log(prob)
    log_likelihood += logprob
    n += 1

print(f'{log_likelihood=}')
nll = -log_likelihood
print(f'{nll=}')
print(f'{nll/n}')

log_likelihood=tensor(-410414.9688)
nll=tensor(410414.9688)
2.092747449874878


We see that our model has improved w.r.t the bigram model

---------------------
---------------------
We now create the same model using a one layer neural network

In [85]:
# create the training set of trigrams: input (ch1, ch2) and output ch3
xs, ys = [], []

for w in words:
  chs = ['.'] + list(w) + ['.']
  for ch1, ch2, ch3 in zip(chs, chs[1:], chs[2:]):
    ix1, ix2, ix3 = stoi[ch1], stoi[ch2], stoi[ch3]    
    xs.append([ix1, ix2])
    ys.append(ix3)
    
xs = torch.tensor(xs)
ys = torch.tensor(ys)
num_examples = xs.shape[0]

In [86]:
xs.shape


torch.Size([196113, 2])

In [87]:
# Combine the two indices into a single index for a bigram.
bigram_idx = xs[:, 0] * 27 + xs[:, 1]  # shape: (N,)

# One-hot encode the bigram indices into a 729-dimensional vector.
import torch.nn.functional as F
xenc = F.one_hot(bigram_idx, num_classes=27*27).float()

In [88]:
# randomly initialize 27 neurons' weights. each neuron receives 27*27 inputs
g = torch.Generator().manual_seed(2147483647)
W = torch.randn((27*27, 27), generator=g, requires_grad=True)

In [95]:
# gradient descent
for k in range(1):
  
  # forward pass
  logits = xenc @ W # predict log-counts
  counts = logits.exp() # counts, equivalent to N
  probs = counts / counts.sum(1, keepdims=True) # probabilities for next character
  loss = -probs[torch.arange(num_examples), ys].log().mean() + 0.01*(W**2).mean()
  print(loss.item())
  
  # backward pass
  W.grad = None # set to zero the gradient
  loss.backward()
  
  # update
  W.data += -50 * W.grad




2.1307287216186523


## E02: split up the dataset randomly into 80% train set, 10% dev set, 10% test set. Train the bigram and trigram models only on the training set. Evaluate them on dev and test splits. What can you see?

We will only do so with the trigram model. We will begin by shuffling the list of words. 

In [90]:
import random
random.shuffle(words)

In [91]:
# create the training set of trigrams: input (ch1, ch2) and output ch3
xs, ys = [], []

for w in words:
  chs = ['.'] + list(w) + ['.']
  for ch1, ch2, ch3 in zip(chs, chs[1:], chs[2:]):
    ix1, ix2, ix3 = stoi[ch1], stoi[ch2], stoi[ch3]    
    xs.append([ix1, ix2])
    ys.append(ix3)
    
xs = torch.tensor(xs)
ys = torch.tensor(ys)
num_examples = xs.shape[0]
# Combine the two indices into a single index for a bigram.
bigram_idx = xs[:, 0] * 27 + xs[:, 1]  # shape: (N,)

# One-hot encode the bigram indices into a 729-dimensional vector.
import torch.nn.functional as F
xenc = F.one_hot(bigram_idx, num_classes=27*27).float()

# calculationg index for splits
a = int(num_examples*0.8)
b = int(num_examples*0.9)

# creating training dev and test set after encoding has been done
xs_tr, xs_dev, xs_te = xenc[:a], xenc[a:b], xenc[b:]
ys_tr, ys_dev, ys_te = ys[:a], ys[a:b], ys[b:]


In [92]:
# randomly initialize 27 neurons' weights. each neuron receives 27*27 inputs
g = torch.Generator().manual_seed(2147483647)
W = torch.randn((27*27, 27), generator=g, requires_grad=True)

In [94]:
# gradient descent
for k in range(500):
  
  # forward pass
  logits = xs_tr @ W # predict log-counts
  counts = logits.exp() # counts, equivalent to N
  probs = counts / counts.sum(1, keepdims=True) # probabilities for next character
  loss = -probs[torch.arange(a), ys_tr].log().mean() + 0.01*(W**2).mean()
  print(loss.item())
  
  # backward pass
  W.grad = None # set to zero the gradient
  loss.backward()
  
  # update
  W.data += -50 * W.grad

2.1671576499938965
2.166998863220215
2.1668405532836914
2.166682720184326
2.1665258407592773
2.1663694381713867
2.1662135124206543
2.166058301925659
2.1659035682678223
2.1657493114471436
2.165595531463623
2.165442705154419
2.165290117263794
2.1651382446289062
2.1649866104125977
2.1648359298706055
2.1646854877471924
2.1645357608795166
2.164386510848999
2.1642377376556396
2.1640896797180176
2.1639420986175537
2.163794994354248
2.1636481285095215
2.1635022163391113
2.1633565425872803
2.1632115840911865
2.163067102432251
2.1629230976104736
2.1627795696258545
2.1626367568969727
2.162493944168091
2.1623520851135254
2.162210702896118
2.162069320678711
2.16192889213562
2.1617889404296875
2.161649465560913
2.161510467529297
2.1613717079162598
2.161233901977539
2.1610960960388184
2.160959005355835
2.1608221530914307
2.1606860160827637
2.160550594329834
2.1604151725769043
2.160280227661133
2.1601459980010986
2.1600120067596436
2.159878730773926
2.159745454788208
2.1596131324768066
2.1594810485839

In [96]:
def evaluate(xenc_eval, ys_eval, W):
    num_eval = xenc_eval.shape[0]
    logits_eval = xenc_eval @ W
    counts_eval = logits_eval.exp()
    probs_eval  = counts_eval / counts_eval.sum(1, keepdims=True)
    loss_eval = -probs_eval[torch.arange(num_eval), ys_eval].log().mean() + 0.01 * (W**2).mean()
    return loss_eval.item()

dev_loss  = evaluate(xs_dev, ys_dev, W)
test_loss = evaluate(xs_te, ys_te, W)
print("Dev loss:", dev_loss)
print("Test loss:", test_loss)


Dev loss: 2.1583569049835205
Test loss: 2.14095139503479


## E03: use the dev set to tune the strength of smoothing (or regularization) for the trigram model - i.e. try many possibilities and see which one works best based on the dev set loss. What patterns can you see in the train and dev set loss as you tune this strength? Take the best setting of the smoothing and evaluate on the test set once and at the end. How good of a loss do you achieve?

We will rewrite almost alll the code so that variables are reinitialised in the right moment

In [None]:
import torch
import torch.nn.functional as F
import random

# Shuffle the words list so that your splits are random.
random.shuffle(words)

# Build the trigram dataset: inputs (bigram: ch1,ch2) and target (ch3)
xs, ys = [], []
for w in words:
    chs = ['.'] + list(w) + ['.']
    for ch1, ch2, ch3 in zip(chs, chs[1:], chs[2:]):
        ix1, ix2, ix3 = stoi[ch1], stoi[ch2], stoi[ch3]
        xs.append([ix1, ix2])
        ys.append(ix3)

xs = torch.tensor(xs)  # shape: (N, 2)
ys = torch.tensor(ys)  # shape: (N,)
num_examples = xs.shape[0]

# Combine the two indices into a single unique index for each bigram.
bigram_idx = xs[:, 0] * 27 + xs[:, 1]  # shape: (N,)

# One-hot encode the bigram indices into a vector of size 27*27 = 729.
xenc = F.one_hot(bigram_idx, num_classes=27*27).float()

# Split the data into training (80%), dev (10%), and test (10%)
a = int(num_examples * 0.8)
b = int(num_examples * 0.9)
xs_tr, xs_dev, xs_te = xenc[:a], xenc[a:b], xenc[b:]
ys_tr, ys_dev, ys_te = ys[:a], ys[a:b], ys[b:]

# List of candidate regularization strengths (smoothing factors) to try.
reg_strengths = [0.001, 0.005, 0.01, 0.05, 0.1, 0.5]

results = {}  # To store final train and dev losses for each regularization value

for reg in reg_strengths:
    # Reinitialize weights for each experiment.
    g = torch.Generator().manual_seed(2147483647)
    W = torch.randn((27*27, 27), generator=g, requires_grad=True)
    
    # Train the model on the training set for a fixed number of iterations.
    num_train = xs_tr.shape[0]
    learning_rate = 50  # you can adjust this if needed
    for k in range(500):
        # Forward pass on training set.
        logits = xs_tr @ W      # (num_train, 27)
        counts = logits.exp()
        probs = counts / counts.sum(1, keepdims=True)
        
        # Compute the loss:
        # - Negative log-likelihood of the correct target (first term).
        # - Plus the L2 regularization on W (weighted by 'reg').
        loss_train = -probs[torch.arange(num_train), ys_tr].log().mean() + reg * (W**2).mean()
        
        # Backward pass and weight update.
        W.grad = None
        loss_train.backward()
        W.data += -learning_rate * W.grad  # simple gradient descent update
    
    # After training, evaluate on the dev set.
    num_dev = xs_dev.shape[0]
    logits_dev = xs_dev @ W
    counts_dev = logits_dev.exp()
    probs_dev = counts_dev / counts_dev.sum(1, keepdims=True)
    loss_dev = -probs_dev[torch.arange(num_dev), ys_dev].log().mean() + reg * (W**2).mean()
    
    results[reg] = (loss_train.item(), loss_dev.item())
    print(f"Reg: {reg:>5} | Train loss: {loss_train.item():.4f} | Dev loss: {loss_dev.item():.4f}")

# Choose the best regularization strength based on dev loss.
best_reg = min(results, key=lambda r: results[r][1])
print("\nBest regularization strength based on dev set:", best_reg)

# Now, for the best setting, re-train (or reuse the trained model) and evaluate on the test set.
# Here, for clarity, we reinitialize and train with the best regularization setting.
g = torch.Generator().manual_seed(2147483647)
W = torch.randn((27*27, 27), generator=g, requires_grad=True)
num_train = xs_tr.shape[0]
for k in range(500):
    logits = xs_tr @ W
    counts = logits.exp()
    probs = counts / counts.sum(1, keepdims=True)
    loss_train = -probs[torch.arange(num_train), ys_tr].log().mean() + best_reg * (W**2).mean()
    W.grad = None
    loss_train.backward()
    W.data += -learning_rate * W.grad

# Evaluate on the test set.
num_test = xs_te.shape[0]
logits_test = xs_te @ W
counts_test = logits_test.exp()
probs_test = counts_test / counts_test.sum(1, keepdims=True)
loss_test = -probs_test[torch.arange(num_test), ys_te].log().mean() + best_reg * (W**2).mean()
print(f"Test loss with best regularization ({best_reg}): {loss_test.item():.4f}")


Reg: 0.001 | Train loss: 2.1533 | Dev loss: 2.1827
