# NanoGPT walkthrough

This notebook follows the [Let's build GPT: from scratch, in code, spelled out](https://www.youtube.com/watch?v=kCc8FmEb1nY&list=PLAqhIrjkxbuWI23v9cThsA9GvCAUhRvKZ&index=8) video.

Helpful resources on attention:
- [Attention in transformers - 3Blue1Brown](https://www.youtube.com/watch?v=eMlx5fFNoYc&t=269s)

## Get dataset

In [1]:
# Inspect dataset
with open('dataset/emma.txt', 'r', encoding='utf-8') as f:
    text = f.read()

In [2]:
# Dataset length
print(f"Dataset length (number of characters): {len(text)}")

Dataset length (number of characters): 880425


## Encoder/decoder

- Encoder: encode string into indices
- Decoder: decode indices into string

In [3]:
# Get vocabulary size (unique characters)
chars = sorted(list(set(text)))
vocab_size = len(chars)
print(f"Vocab: {''.join(chars)}")
print(f"Vocab size: {vocab_size}")

Vocab: 
 !&(),-.01234678:;?ABCDEFGHIJKLMNOPQRSTUVWXY[]_abcdefghijklmnopqrstuvwxyzàéêï—‘’“”
Vocab size: 83


In [4]:
# Create a mapping from characters to integers
stoi = { ch:i for i,ch in enumerate(chars) }
itos = { i:ch for i,ch in enumerate(chars) }
encode = lambda s: [stoi[c] for c in s]
decode = lambda l: ''.join([itos[i] for i in l])

print(encode("jane austen"))

[57, 48, 61, 52, 1, 48, 68, 66, 67, 52, 61]


## Encode dataset

In [6]:
## Encode the entire dataset and store it in a torch.Tensor
import torch
data = torch.tensor(encode(text), dtype=torch.long)
print(data.shape, data.dtype)

torch.Size([880425]) torch.int64


## Split dataset into `train` and `validation` sets

In [7]:
# First 90% of dataset will be `train`; the rest is `val`
n = int(0.9 * len(data))
train_data = data[:n]
val_data = data[n:]

print(f"Dataset size: {len(data)}")
print(f"Training set size: {len(train_data)}")
print(f"Validation set size: {len(val_data)}")

Dataset size: 880425
Training set size: 792382
Validation set size: 88043


## Context window

In [8]:
block_size = 8

In [9]:
# A block of 9 characters actually contains 8 examples

# Example:
x = train_data[:block_size]
y = train_data[1:block_size + 1]
for t in range(block_size):
    context = x[:t+1]
    target = y[t]
    print(f"Input {context} ==> Output {target}")

Input tensor([24]) ==> Output 60
Input tensor([24, 60]) ==> Output 60
Input tensor([24, 60, 60]) ==> Output 48
Input tensor([24, 60, 60, 48]) ==> Output 0
Input tensor([24, 60, 60, 48,  0]) ==> Output 0
Input tensor([24, 60, 60, 48,  0,  0]) ==> Output 49
Input tensor([24, 60, 60, 48,  0,  0, 49]) ==> Output 72
Input tensor([24, 60, 60, 48,  0,  0, 49, 72]) ==> Output 1


## Batching

In [10]:
# How many independent sequences will be processed in parallel
batch_size = 4

def get_batch(split='train', batch_size=4, block_size=8):
    """
    Generates a small batch of data with inputs x and targets y
    """
    data = train_data if split == 'train' else val_data

    # Create a tensor of randint, with shape [batch_size];
    # this is where we start the training data
    ix = torch.randint(len(data) - block_size, (batch_size,))

    x = torch.stack([data[i:i+block_size] for i in ix])
    y = torch.stack([data[i+1:i+block_size+1] for i in ix])
    return x, y

# Example batch:
xx, yy = get_batch('train')
print("=== Inputs ===")
print(f"Shape: {xx.shape}")
print(xx)
print("=== Targets ===")
print(f"Shape: {yy.shape}")
print(yy)
print()

print('=== What this means ===')

for b in range(batch_size): # batch dimension
    for t in range(block_size): # time dimension
        context = xx[b, :t+1]
        target = yy[b,t]
        print(f"  Input {context.tolist()} ==> target {target}")

=== Inputs ===
Shape: torch.Size([4, 8])
tensor([[66, 66,  1, 48, 61, 51,  1, 66],
        [55, 52, 48, 51,  1, 48,  1, 59],
        [55, 56, 60,  1, 61, 62, 67,  1],
        [ 1, 56, 67,  8,  1, 28,  1, 51]])
=== Targets ===
Shape: torch.Size([4, 8])
tensor([[66,  1, 48, 61, 51,  1, 66, 48],
        [52, 48, 51,  1, 48,  1, 59, 56],
        [56, 60,  1, 61, 62, 67,  1, 67],
        [56, 67,  8,  1, 28,  1, 51, 62]])

=== What this means ===
  Input [66] ==> target 66
  Input [66, 66] ==> target 1
  Input [66, 66, 1] ==> target 48
  Input [66, 66, 1, 48] ==> target 61
  Input [66, 66, 1, 48, 61] ==> target 51
  Input [66, 66, 1, 48, 61, 51] ==> target 1
  Input [66, 66, 1, 48, 61, 51, 1] ==> target 66
  Input [66, 66, 1, 48, 61, 51, 1, 66] ==> target 48
  Input [55] ==> target 52
  Input [55, 52] ==> target 48
  Input [55, 52, 48] ==> target 51
  Input [55, 52, 48, 51] ==> target 1
  Input [55, 52, 48, 51, 1] ==> target 48
  Input [55, 52, 48, 51, 1, 48] ==> target 1
  Input [55, 52, 4

## Bigram model

We start off with a simple bigram model.

In [11]:
import torch
import torch.nn as nn
from torch.nn import functional as F

In [12]:
# Some notes on dimensions
print(f"n = vocab_size = {vocab_size}")
print(f"B = batch_size = {batch_size} = how many independent sequences are being processed at once")
print(f"T = time = length of the running sequence")
print(f"C = channel = {vocab_size} = size of the feature vector at each position = embedding dimension")
print(f"Right now C = vocab_size")

n = vocab_size = 83
B = batch_size = 4 = how many independent sequences are being processed at once
T = time = length of the running sequence
C = channel = 83 = size of the feature vector at each position = embedding dimension
Right now C = vocab_size


In [13]:
class BigramLanguageModel(nn.Module):
    def __init__(self, vocab_size):
        super().__init__()
        # Create "embedding" table.
        # - Usually a token's embedding carries semantic meaning, but in a
        #   bigram model, it just predicts "what comes next".
        # - In this lookup table, each token gets mapped to the logits of
        #   the next token.
        # - The lookup table is of dimension (n,n).
        # - nn.Embedding initializes with random values
        self.token_embedding_table = nn.Embedding(vocab_size, vocab_size)

    def forward(self, indices, targets=None):
        # `indices` and `targets` are both (B,T) tensor of integers,

        # For each idx in `indices`, we fetch its corresponding logits;
        # this produces a (B,T,C) tensor
        logits = self.token_embedding_table(indices)

        if targets is None:
            loss = None
        else:
            B, T, C = logits.shape

            # We want to flatten `logits` so that we have a total of B*T
            # feature vectors of length C.
            logits = logits.view(B*T, C)

            # Also flatten `targets` so that it contains B*T target outputs
            # for each of the feature vectors in `logits`.
            targets = targets.view(B*T)

            # Compute loss
            loss = F.cross_entropy(logits, targets)

        return logits, loss

    def generate(self, indices, max_new_tokens):
        # `indices` is a (B,T) tensor of indices in the current context

        for _ in range(max_new_tokens):
            # Get predictions;
            # `logits` is (B,T,C)
            logits, loss = self(indices) # calls forward()

            # `logits` contains the logits for every index in `indices`,
            # but we actually only need the last time step in each batch
            logits = logits[:, -1, :] # becomes (B,C)

            # Apply softmax to get probabilities
            probs = F.softmax(logits, dim=-1) # (B,C)

            # Sample from the probability distribution
            next_idx = torch.multinomial(probs, num_samples=1) # (B,1)

            # Append sampled index to the context for each batch
            indices = torch.cat((indices, next_idx), dim=1) # (B,T+1)

        return indices

### Run the model now without training

In [14]:
# Example:
# Run the model and see what it generates right now (it's not trained)
m = BigramLanguageModel(vocab_size)
logits, loss = m(xx, yy) # recall that xx and yy are a batch in the training set
print("Logits shape:", logits.shape)
print("Loss:", loss)

# Generate some output, starting with [0]
gen = m.generate(indices = torch.zeros((1,1), dtype=torch.long), max_new_tokens=100)
print("Generated:")
print(decode(gen[0].tolist()))

Logits shape: torch.Size([32, 83])
Loss: tensor(5.1510, grad_fn=<NllLossBackward0>)
Generated:

NIïI”;t‘o)Q;cihE,Qp;)xdE-xtmk 4WYgIà—MjU;TU2jT,YuU]g’xb!-Yée‘U2YLcH4-éV3Bqs6P,8N(J;?;gQ”;cn)kw T&iVl


### Train the model

In [15]:
m = BigramLanguageModel(vocab_size)

# PyTorch optimizer
optimizer = torch.optim.AdamW(m.parameters(), lr=1e-3)

In [16]:
# Use bigger batch size
batch_size = 32

# Train for some iterations
iterations = 50000
print_interval = 5000
for step in range(iterations):
    # Sample a batch of data
    xx, yy = get_batch('train', batch_size)

    # Evaluate loss
    logits, loss = m(xx, yy)
    optimizer.zero_grad(set_to_none=True)
    loss.backward()
    optimizer.step()

    if step == 0 or step == iterations-1 or (step+1) % print_interval == 0:
        print(f"Loss at step {step+1}: {loss.item()}")

Loss at step 1: 5.069802761077881
Loss at step 5000: 2.678922176361084
Loss at step 10000: 2.4121274948120117
Loss at step 15000: 2.4237539768218994
Loss at step 20000: 2.4447739124298096
Loss at step 25000: 2.481267213821411
Loss at step 30000: 2.442058563232422
Loss at step 35000: 2.5376250743865967
Loss at step 40000: 2.493802785873413
Loss at step 45000: 2.3820197582244873
Loss at step 50000: 2.394484281539917


### Generate some text

In [17]:
# Generate some output, starting with [0]
gen = m.generate(indices = torch.zeros((1,1), dtype=torch.long), max_new_tokens=500)
print("Generated:")
print(decode(gen[0].tolist()))

Generated:

cr swas. win Balddwhe w. ad. ar! icetist be podeshe s aste aly, iss (eat ithelerend d heverorodliloulsh!—Shenqucer
m, te ppad ghef I l to nk
Mr; Emongelouathathamood aldinerso ws l chitlelemusertlell. oread dlsth am algatire blkeman. ood our ees w my_ustonod tr ame hilin, vesth st, mul minofofeand s. condaren monen helive tcio o shil Fancr burertous d ales ald he
hacoty avef
 chitie a? orr; a I lffalapt Shath!—‘Wemawallof e peissby head tebee, mestilld indepeenthed oro Ing se spe whino ryout bth


## Introducing self-attention

We would like the tokens to start talking to each other.

Information only flows from previous context into the future. A token cannot talk to a future token.

In [18]:
# Toy example
B, T, C = 4, 8, 2 # batch, time, channels

x = torch.randn(B,T,C)
print(x.shape)
print(x[0])

torch.Size([4, 8, 2])
tensor([[ 0.2809, -0.6028],
        [-0.8396, -0.4434],
        [ 0.5945,  1.3260],
        [-1.2213, -0.6181],
        [-0.1123, -0.2627],
        [ 0.2901, -0.6130],
        [-0.8300,  0.0629],
        [ 0.0556,  0.2140]])


### Self-attention by taking the average of the context

In [19]:
# Let's start by taking just the *average* of all previous tokens + current token.
# i.e. xbow[b,t] = mean_{i<=t} x[b,i]

# xbow = x "bag of words"
# "bag of words" just means we are just taking the average

xbow = torch.zeros((B,T,C))
for b in range(B):
    for t in range(T):
        xprev = x[b, :t+1] # (t,C)
        xbow[b,t] = torch.mean(xprev, 0) # average along `time` dimension => (C,)

print(xbow.shape)
print(xbow[0])

torch.Size([4, 8, 2])
tensor([[ 0.2809, -0.6028],
        [-0.2794, -0.5231],
        [ 0.0119,  0.0933],
        [-0.2964, -0.0846],
        [-0.2596, -0.1202],
        [-0.1680, -0.2023],
        [-0.2625, -0.1645],
        [-0.2228, -0.1172]])


### Trick using matrix multiplication

We can use matrix multiplication with a `wei` array to achieve the same effect of taking the average of all previous tokens.

In [20]:
wei = torch.tril(torch.ones(T, T))
wei

tensor([[1., 0., 0., 0., 0., 0., 0., 0.],
        [1., 1., 0., 0., 0., 0., 0., 0.],
        [1., 1., 1., 0., 0., 0., 0., 0.],
        [1., 1., 1., 1., 0., 0., 0., 0.],
        [1., 1., 1., 1., 1., 0., 0., 0.],
        [1., 1., 1., 1., 1., 1., 0., 0.],
        [1., 1., 1., 1., 1., 1., 1., 0.],
        [1., 1., 1., 1., 1., 1., 1., 1.]])

In [21]:
wei = wei / wei.sum(1, keepdims=True)
wei

tensor([[1.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000],
        [0.5000, 0.5000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000],
        [0.3333, 0.3333, 0.3333, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000],
        [0.2500, 0.2500, 0.2500, 0.2500, 0.0000, 0.0000, 0.0000, 0.0000],
        [0.2000, 0.2000, 0.2000, 0.2000, 0.2000, 0.0000, 0.0000, 0.0000],
        [0.1667, 0.1667, 0.1667, 0.1667, 0.1667, 0.1667, 0.0000, 0.0000],
        [0.1429, 0.1429, 0.1429, 0.1429, 0.1429, 0.1429, 0.1429, 0.0000],
        [0.1250, 0.1250, 0.1250, 0.1250, 0.1250, 0.1250, 0.1250, 0.1250]])

In [22]:
xbow2 = wei @ x
xbow2[0]

# Note on wei @ x:
# - wei is (T,T) but x is (B,T,C)
# - matrix multiplication will create a B dimension for wei => (B, T, T)
# - the result will be (B,T,C)

# xbow2 will be identical to xbow

tensor([[ 0.2809, -0.6028],
        [-0.2794, -0.5231],
        [ 0.0119,  0.0933],
        [-0.2964, -0.0846],
        [-0.2596, -0.1202],
        [-0.1680, -0.2023],
        [-0.2625, -0.1645],
        [-0.2228, -0.1172]])

### Another way by using Softmax

In [23]:
# Start by initializing `wei` as all 0's
wei = torch.zeros((T,T))
wei

tensor([[0., 0., 0., 0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 0., 0., 0., 0.]])

In [24]:
tril = torch.tril(torch.ones(T,T))
tril

tensor([[1., 0., 0., 0., 0., 0., 0., 0.],
        [1., 1., 0., 0., 0., 0., 0., 0.],
        [1., 1., 1., 0., 0., 0., 0., 0.],
        [1., 1., 1., 1., 0., 0., 0., 0.],
        [1., 1., 1., 1., 1., 0., 0., 0.],
        [1., 1., 1., 1., 1., 1., 0., 0.],
        [1., 1., 1., 1., 1., 1., 1., 0.],
        [1., 1., 1., 1., 1., 1., 1., 1.]])

In [25]:
wei = wei.masked_fill(tril == 0, float('-inf'))
wei

tensor([[0., -inf, -inf, -inf, -inf, -inf, -inf, -inf],
        [0., 0., -inf, -inf, -inf, -inf, -inf, -inf],
        [0., 0., 0., -inf, -inf, -inf, -inf, -inf],
        [0., 0., 0., 0., -inf, -inf, -inf, -inf],
        [0., 0., 0., 0., 0., -inf, -inf, -inf],
        [0., 0., 0., 0., 0., 0., -inf, -inf],
        [0., 0., 0., 0., 0., 0., 0., -inf],
        [0., 0., 0., 0., 0., 0., 0., 0.]])

In [26]:
wei = F.softmax(wei, dim=-1)
wei

tensor([[1.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000],
        [0.5000, 0.5000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000],
        [0.3333, 0.3333, 0.3333, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000],
        [0.2500, 0.2500, 0.2500, 0.2500, 0.0000, 0.0000, 0.0000, 0.0000],
        [0.2000, 0.2000, 0.2000, 0.2000, 0.2000, 0.0000, 0.0000, 0.0000],
        [0.1667, 0.1667, 0.1667, 0.1667, 0.1667, 0.1667, 0.0000, 0.0000],
        [0.1429, 0.1429, 0.1429, 0.1429, 0.1429, 0.1429, 0.1429, 0.0000],
        [0.1250, 0.1250, 0.1250, 0.1250, 0.1250, 0.1250, 0.1250, 0.1250]])

In [27]:
xbow3 = wei @ x
xbow3[0]

# xbow3 should be identical to xbow3 and xbow

tensor([[ 0.2809, -0.6028],
        [-0.2794, -0.5231],
        [ 0.0119,  0.0933],
        [-0.2964, -0.0846],
        [-0.2596, -0.1202],
        [-0.1680, -0.2023],
        [-0.2625, -0.1645],
        [-0.2228, -0.1172]])

**Note:**

We will use softmax because when we do `wei = wei.masked_fill(tril == 0, float('-inf'))`, we can treat `-inf` as saying "these future tokens have no effect on the current token." By extension, the values before `-inf` don't all have to be 0 - these tokens can start talking to each other and take on different weights => self-attention!

### Self-attention!

Instead of just taking the average, we let tokens talk to each other.

In [28]:
B,T,C = 4,8,32
x = torch.randn(B,T,C)

In [29]:
# Previously we did
# wei = torch.zeros((T,T))

# But now we don't want this to be all uniform; instead,
# we want to be able to gather info from the past.

# Every single token at each position will emit 2 vectors: query + key.
# - Query: "what am I looking for"
# - Key: "what do I contain"
# We use dot product to get affinity between tokens,
# i.e. "my query" /dot "your key" => this becomes wei

In [30]:
#
# Single head perform self-attention
#

head_size = 16

key = nn.Linear(C, head_size, bias=False)
query = nn.Linear(C, head_size, bias=False)
k = key(x)   # (B,T,16)
q = query(x) # (B,T,16)

value = nn.Linear(C, head_size, bias=False)
v = value(x) # (B,T,16)

# We can think of k, q, and v as follows:
# - k: "here's what I have"
# - q: "here's what I'm interested in"
# - v: "if you find me interesting, here's what I will communicate to you"

# We can think of key(x) and query(x) as "mapping each embedding
# onto a query/key space." If the dot product of q and k is large,
# then the query and key are closely related.

In [31]:
# Dot product of q and k (need to transpose the last 2 dimensions of k);
# this is the "affinity" between tokens
wei = q @ k.transpose(-2, -1) # (B,T,T)
wei[0]

tensor([[-0.4602,  0.5454,  0.4547,  1.7431,  0.6014,  0.2491,  0.1107,  1.6866],
        [ 0.3623, -0.1856, -0.9913, -0.2403, -1.7746,  0.9635, -0.0194, -2.5452],
        [ 0.2473, -0.9395, -0.8748, -2.2320, -0.4213, -0.0547,  0.6652,  0.1099],
        [-0.3361,  1.7243, -0.0799,  1.2426, -0.2039,  1.9121, -0.8425, -0.5857],
        [ 0.3423, -0.3405, -0.4143, -0.5832, -0.7255,  0.9499, -0.6766, -0.9575],
        [-0.3074,  0.4796,  0.0699, -0.3749,  0.1905, -0.7777,  0.2963,  0.4356],
        [-0.2227,  2.2977, -1.1799, -0.9838, -1.4433,  0.7707, -0.1478, -0.0225],
        [-1.9678,  1.7853, -0.3128, -0.1258,  0.1161, -0.0888,  0.0289, -0.2981]],
       grad_fn=<SelectBackward0>)

In [32]:
tril = torch.tril(torch.ones(T,T))
wei = wei.masked_fill(tril == 0, float('-inf'))
wei = F.softmax(wei, dim=-1)
wei[0]

tensor([[1.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000],
        [0.6336, 0.3664, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000],
        [0.6132, 0.1871, 0.1997, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000],
        [0.0667, 0.5236, 0.0862, 0.3235, 0.0000, 0.0000, 0.0000, 0.0000],
        [0.3684, 0.1861, 0.1729, 0.1460, 0.1266, 0.0000, 0.0000, 0.0000],
        [0.1272, 0.2795, 0.1855, 0.1189, 0.2093, 0.0795, 0.0000, 0.0000],
        [0.0545, 0.6773, 0.0209, 0.0254, 0.0161, 0.1471, 0.0587, 0.0000],
        [0.0121, 0.5173, 0.0635, 0.0765, 0.0975, 0.0794, 0.0893, 0.0644]],
       grad_fn=<SelectBackward0>)

In [33]:
# Instead of:
# out = wei @ x

# here we do:
out = wei @ v
out.shape

torch.Size([4, 8, 16])

**Notes:**
- Attention is a **communication mechanism**. Can be seen as nodes in a directed graph looking at each other and aggregating information with a weighted sum from all nodes that point to them, with data-dependent weights.
- Each example across batch dimension is processed completely independently.
- What we have here is a "decoder" attention block because it has triangular masking; this is usually used in autoregressive settings, like language modeling. There's also "encoder" attention block, which allows all tokens to communicate and is used in situations like sentiment analysis. In an "encoder" block, just remove the `tril` line that does masking.
- "Self-attention" just means that the keys and values are produced from the same source as queries (i.e. `x` in our case). In "cross-attention", the queries still get produced from `x`, but the keys and values come from some other external source (e.g. an encoder module).

### Scaled self-attention

In "scaled" self-attention, we further divide `wei` by **1/sqrt(head_size)**. This makes it so when input Q, K are unit variance, `wei` will be unit variance too. This ensures softmax will stay diffuse and not saturate too much (not converge towards one-hot vector).

In [34]:
k = torch.randn(B,T,head_size)
q = torch.randn(B,T,head_size)
wei = q @ k.transpose(-2, -1)

print(k.var())
print(q.var())
print(wei.var()) # head_size

tensor(0.9121)
tensor(1.0475)
tensor(16.3956)


In [35]:
k = torch.randn(B,T,head_size)
q = torch.randn(B,T,head_size)
wei = q @ k.transpose(-2, -1) * head_size**-0.5

print(k.var())
print(q.var())
print(wei.var()) # 1

tensor(1.0559)
tensor(0.9348)
tensor(0.9975)


In [36]:
# Why is low variance good?

# low variance
print(torch.softmax(torch.tensor([0.1, -0.2, 0.3, -0.2, 0.5]), dim=-1))

# high variance:
# this will get too peaky; convergs to one hot
print(torch.softmax(torch.tensor([0.1, -0.2, 0.3, -0.2, 0.5]) * 8, dim=-1))

tensor([0.1925, 0.1426, 0.2351, 0.1426, 0.2872])
tensor([0.0326, 0.0030, 0.1615, 0.0030, 0.8000])


## Add self-attention to our model

### Add self-attention module

In [37]:
class Head(nn.Module):
    """
    One head of self-attention.
    """
    def __init__(self, head_size):
        super().__init__()
        self.key = nn.Linear(n_embd, head_size, bias=False)
        self.query = nn.Linear(n_embd, head_size, bias=False)
        self.value = nn.Linear(n_embd, head_size, bias=False)

        # `tril` is a "buffer", i.e. it's not a parameter of the module.
        # We have to call register_buffer on it.
        self.register_buffer('tril', torch.tril(torch.ones(block_size, block_size)))

    def forward(self, x):
        B,T,C = x.shape
        k = self.key(x)   # (B,T,C)
        q = self.query(x) # (B,T,C)
        
        # Compute attention scores ("affinities")
        wei = q @ k.transpose(-2,-1) * C**-0.5 # (B, T, C) @ (B, C, T) -> (B, T, T)
        
        wei = wei.masked_fill(self.tril[:T, :T] == 0, float('-inf')) # (B, T, T)
        wei = F.softmax(wei, dim=-1) # (B, T, T)
        
        # Perform the weighted aggregation of the values
        v = self.value(x) # (B,T,C)
        out = wei @ v     # (B,T,T) @ (B,T,C) -> (B,T,C)
        return out

### Updating our model

This is built on our `BigramLanguageModel`, but since it has ceased to be a bigram model, we will call it `GPTLanguageModel`.

In [38]:
# Introduce new variable: number of embedding dimensions
n_embd = 32

In [39]:
# Some notes on dimensions again
print(f"n = vocab_size = {vocab_size}")
print(f"B = batch_size = {batch_size} = how many independent sequences are being processed at once")
print(f"T = time = length of the running sequence")
print(f"C = channel = {n_embd} = size of the feature vector at each position = embedding dimension")
print(f"** C will no longer be equal to vocab_size; it will be n_embd instead **")
print(f"n_embd = {n_embd} = number of embedding dimensions")

n = vocab_size = 83
B = batch_size = 32 = how many independent sequences are being processed at once
T = time = length of the running sequence
C = channel = 32 = size of the feature vector at each position = embedding dimension
** C will no longer be equal to vocab_size; it will be n_embd instead **
n_embd = 32 = number of embedding dimensions


In [40]:
class GPTLanguageModel(nn.Module):
    def __init__(self, vocab_size):
        super().__init__()
        # Create "embedding" table.
        # - Maps each token in the vocabulary to an embedding of dimension `n_embd`
        self.token_embedding_table = nn.Embedding(vocab_size, n_embd)

        # New: add position embedding table
        # - Each position in the block gets its own embedding vector
        self.position_embedding_table = nn.Embedding(block_size, n_embd)

        # New: add linear layer between embeddings (of dimension `n_embd`)
        # and the logits (dimension `vocab_size`)
        self.lm_head = nn.Linear(n_embd, vocab_size)

        # New: self-attention head
        self.sa_head = Head(n_embd)

    def forward(self, indices, targets=None):
        # `indices` and `targets` are both (B,T) tensor of integers,
        B, T = indices.shape

        # For each idx in `indices`, we need to fetch its corresponding logits:
        
        # (1) New: for each idx in `indices`, we first fetch its embedding
        token_emb = self.token_embedding_table(indices) # (B,T,C)
        
        # (2) New: create the position embedding for each position in the block
        pos_emb = self.position_embedding_table(torch.arange(T)) # (T,C)

        # (3) New: add the token embedding to position embedding
        # - this basically means we add the embedding for each idx in `indices` to
        #   the position embedding for its position in the block
        # - note the dimension broadcasting here
        x = token_emb + pos_emb  # (B,T,C)

        # (4) New: apply one head of self-attention
        x = self.sa_head(x)      # (B,T,C)
        
        # (5) New: we then fetch the logits using the lm_head layer
        logits = self.lm_head(x) # (B,T,vocab_size)

        if targets is None:
            loss = None
        else:
            B, T, C = logits.shape

            # We want to flatten `logits` so that we have a total of B*T
            # feature vectors of length C.
            logits = logits.view(B*T, C)

            # Also flatten `targets` so that it contains B*T target outputs
            # for each of the feature vectors in `logits`.
            targets = targets.view(B*T)

            # Compute loss
            loss = F.cross_entropy(logits, targets)

        return logits, loss

    def generate(self, indices, max_new_tokens):
        # `indices` is a (B,T) tensor of indices in the current context

        for _ in range(max_new_tokens):
            # New: we need to crop the context; otherwise it won't
            # fit into our position_embedding_table
            indices_cropped = indices[:, -block_size:]
            
            # Get predictions;
            # `logits` is (B,T,C)
            logits, loss = self(indices_cropped) # calls forward()

            # `logits` contains the logits for every index in `indices`,
            # but we actually only need the last time step in each batch
            logits = logits[:, -1, :] # becomes (B,C)

            # Apply softmax to get probabilities
            probs = F.softmax(logits, dim=-1) # (B,C)

            # Sample from the probability distribution
            next_idx = torch.multinomial(probs, num_samples=1) # (B,1)

            # Append sampled index to the context for each batch
            indices = torch.cat((indices, next_idx), dim=1) # (B,T+1)

        return indices

### Train the model

In [41]:
m = GPTLanguageModel(vocab_size)

# PyTorch optimizer
optimizer = torch.optim.AdamW(m.parameters(), lr=1e-3)

In [42]:
batch_size = 32

# Train for some iterations
iterations = 50000
print_interval = 5000
for step in range(iterations):
    # Sample a batch of data
    xx, yy = get_batch('train', batch_size)

    # Evaluate loss
    logits, loss = m(xx, yy)
    optimizer.zero_grad(set_to_none=True)
    loss.backward()
    optimizer.step()

    if step == 0 or step == iterations-1 or (step+1) % print_interval == 0:
        print(f"Loss at step {step+1}: {loss.item()}")

Loss at step 1: 4.422560214996338
Loss at step 5000: 2.260261058807373
Loss at step 10000: 2.5313291549682617
Loss at step 15000: 2.241060495376587
Loss at step 20000: 2.279775381088257
Loss at step 25000: 2.36753249168396
Loss at step 30000: 2.323002338409424
Loss at step 35000: 2.1705868244171143
Loss at step 40000: 2.1151444911956787
Loss at step 45000: 2.1987550258636475
Loss at step 50000: 2.2804505825042725


### Generate some text

In [44]:
# Generate some output, starting with [0]
gen = m.generate(indices = torch.zeros((1,1), dtype=torch.long), max_new_tokens=500)
print("Generated:")
print(decode(gen[0].tolist()))

Generated:

Mire veracnt poinot hanthan?—cerisa thilooff faind rofand ther!” se
vo dr agid wely and sth ccam cthad whit do
utte, ur.”

Mit hawis ay silet frean fo obeedive ononf sito the’s.—She,
Mr.


wanwe tes hot ind t
gthatencond cimbeea Mr efeel
aber hes wa, in inout
and junconqua out owothen yer! I Herarse sat _he derant s it odu sereand Msreder esh fto whour line lfu men an ethe selil, woro,
bered ando yere hallld d me,” fivas che nd rint will onowsey merallewa ttheal—nay on igfe rsa ouldins. Weeaeneq


## Adding multi-head self-attention

In [66]:
class Head(nn.Module):
    """
    One head of self-attention.
    """
    def __init__(self, head_size):
        super().__init__()
        self.key = nn.Linear(n_embd, head_size, bias=False)
        self.query = nn.Linear(n_embd, head_size, bias=False)
        self.value = nn.Linear(n_embd, head_size, bias=False)

        # `tril` is a "buffer", i.e. it's not a parameter of the module.
        # We have to call register_buffer on it.
        self.register_buffer('tril', torch.tril(torch.ones(block_size, block_size)))

    def forward(self, x):
        B,T,C = x.shape
        k = self.key(x)   # (B,T,C)
        q = self.query(x) # (B,T,C)
        
        # Compute attention scores ("affinities")
        wei = q @ k.transpose(-2,-1) * k.shape[-1]**-0.5 # (B, T, head_size) @ (B, head_size, T) -> (B, T, T)
        
        wei = wei.masked_fill(self.tril[:T, :T] == 0, float('-inf')) # (B, T, T)
        wei = F.softmax(wei, dim=-1) # (B, T, T)
        
        # Perform the weighted aggregation of the values
        v = self.value(x) # (B,T,C)
        out = wei @ v     # (B,T,T) @ (B,T,C) -> (B,T,C)
        return out

In [67]:
class MultiHeadAttention(nn.Module):
    """
    Multiple heads of self-attention in parallel.
    """
    def __init__(self, num_heads, head_size):
        super().__init__()
        # Create multiple heads
        self.heads = nn.ModuleList([Head(head_size) for _ in range(num_heads)])

    def forward(self, x):
        # Concatenate the result of each head
        return torch.cat([h(x) for h in self.heads], dim=-1) # concatenate over the channel dimension

In [68]:
class GPTLanguageModel(nn.Module):
    def __init__(self, vocab_size):
        super().__init__()
        # Create "embedding" table.
        # - Maps each token in the vocabulary to an embedding of dimension `n_embd`
        self.token_embedding_table = nn.Embedding(vocab_size, n_embd)

        # Add position embedding table
        # - Each position in the block gets its own embedding vector
        self.position_embedding_table = nn.Embedding(block_size, n_embd)

        # Add linear layer between embeddings (of dimension `n_embd`)
        # and the logits (dimension `vocab_size`)
        self.lm_head = nn.Linear(n_embd, vocab_size)

        # New: self-attention heads
        # i.e. 4 heads of 8-dimensional self-attention
        self.sa_heads = MultiHeadAttention(4, n_embd // 4)

    def forward(self, indices, targets=None):
        # `indices` and `targets` are both (B,T) tensor of integers
        B, T = indices.shape

        # For each idx in `indices`, we need to fetch its corresponding logits:
        
        # (1) For each idx in `indices`, we first fetch its embedding
        token_emb = self.token_embedding_table(indices) # (B,T,C)
        
        # (2) Create the position embedding for each position in the block
        pos_emb = self.position_embedding_table(torch.arange(T)) # (T,C)

        # (3) Add the token embedding to position embedding
        # - this basically means we add the embedding for each idx in `indices` to
        #   the position embedding for its position in the block
        # - note the dimension broadcasting here
        x = token_emb + pos_emb  # (B,T,C)

        # (4) New: apply multi-head self-attention
        x = self.sa_heads(x)      # (B,T,C)
        
        # (5) We then fetch the logits using the lm_head layer
        logits = self.lm_head(x) # (B,T,vocab_size)

        if targets is None:
            loss = None
        else:
            B, T, C = logits.shape

            # We want to flatten `logits` so that we have a total of B*T
            # feature vectors of length C.
            logits = logits.view(B*T, C)

            # Also flatten `targets` so that it contains B*T target outputs
            # for each of the feature vectors in `logits`.
            targets = targets.view(B*T)

            # Compute loss
            loss = F.cross_entropy(logits, targets)

        return logits, loss

    def generate(self, indices, max_new_tokens):
        # `indices` is a (B,T) tensor of indices in the current context

        for _ in range(max_new_tokens):
            # We need to crop the context; otherwise it won't
            # fit into our position_embedding_table
            indices_cropped = indices[:, -block_size:]
            
            # Get predictions;
            # `logits` is (B,T,C)
            logits, loss = self(indices_cropped) # calls forward()

            # `logits` contains the logits for every index in `indices`,
            # but we actually only need the last time step in each batch
            logits = logits[:, -1, :] # becomes (B,C)

            # Apply softmax to get probabilities
            probs = F.softmax(logits, dim=-1) # (B,C)

            # Sample from the probability distribution
            next_idx = torch.multinomial(probs, num_samples=1) # (B,1)

            # Append sampled index to the context for each batch
            indices = torch.cat((indices, next_idx), dim=1) # (B,T+1)

        return indices

### Train the model

In [69]:
m = GPTLanguageModel(vocab_size)

# PyTorch optimizer
optimizer = torch.optim.AdamW(m.parameters(), lr=1e-3)

In [70]:
batch_size = 32

# Train for some iterations
iterations = 50000
print_interval = 5000
for step in range(iterations):
    # Sample a batch of data
    xx, yy = get_batch('train', batch_size)

    # Evaluate loss
    logits, loss = m(xx, yy)
    optimizer.zero_grad(set_to_none=True)
    loss.backward()
    optimizer.step()

    if step == 0 or step == iterations-1 or (step+1) % print_interval == 0:
        print(f"Loss at step {step+1}: {loss.item()}")

Loss at step 1: 4.453753471374512
Loss at step 5000: 2.2575976848602295
Loss at step 10000: 2.086601972579956
Loss at step 15000: 2.221362829208374
Loss at step 20000: 2.1340878009796143
Loss at step 25000: 2.0068166255950928
Loss at step 30000: 1.95281183719635
Loss at step 35000: 1.8485158681869507
Loss at step 40000: 1.9005964994430542
Loss at step 45000: 2.0300536155700684
Loss at step 50000: 2.085979461669922


### Generate some text

In [71]:
# Generate some output, starting with [0]
gen = m.generate(indices = torch.zeros((1,1), dtype=torch.long), max_new_tokens=500)
print("Generated:")
print(decode(gen[0].tolist()))

Generated:

willed.”

“You might mort. Wet hour,
sholuld.

Them consicould fan ligh it the chiet a munt havior a mon!
such her dif frome for not his it non dight ken
the Hartull, and you
aboketfion fa, I had woul holds medead stlaoken’s wifeland at is a cart!

pay forettlec with be andiinding her, a hat all sought crapper was lind, and bebkem. Your eveir, thad sencetleld himuch tup i very the father bodived a be tre wreet’s whobut se, bout in Batiettrighice sto-tate ashm.


“OLI had in surchy Mrs. Wing of M


## Adding feed-forward layer (MLP)

In [72]:
class FeedFoward(nn.Module):
    """
    A linear layer followed by a non-linearity
    """
    def __init__(self, n_embd):
        super().__init__()
        self.net = nn.Sequential(
            nn.Linear(n_embd, n_embd), # linear layer
            nn.ReLU()                  # non-linearity
        )

    def forward(self, x):
        return self.net(x)

### Transformer block: self-attention + feed-forward

In [73]:
class Block(nn.Module):
    """
    A Transformer block:
    communication (self-attention) followed by computation (feed-forward)
    """
    def __init__(self, n_embd, n_head):
        super().__init__()
        head_size = n_embd // n_head
        self.sa = MultiHeadAttention(n_head, head_size)
        self.ffwd = FeedFoward(n_embd)

    def forward(self, x):
        x = self.sa(self.ln1(x))
        x = self.ffwd(self.ln2(x))
        return x

In [74]:
class GPTLanguageModel(nn.Module):
    def __init__(self, vocab_size):
        super().__init__()
        # Create "embedding" table.
        # - Maps each token in the vocabulary to an embedding of dimension `n_embd`
        self.token_embedding_table = nn.Embedding(vocab_size, n_embd)

        # Add position embedding table
        # - Each position in the block gets its own embedding vector
        self.position_embedding_table = nn.Embedding(block_size, n_embd)

        # Add linear layer between embeddings (of dimension `n_embd`)
        # and the logits (dimension `vocab_size`)
        self.lm_head = nn.Linear(n_embd, vocab_size)

        # New: add transformer blocks
        self.blocks = nn.Sequential(
            Block(n_embd, n_head=4),
            Block(n_embd, n_head=4),
            Block(n_embd, n_head=4),
        )

    def forward(self, indices, targets=None):
        # `indices` and `targets` are both (B,T) tensor of integers
        B, T = indices.shape

        # For each idx in `indices`, we need to fetch its corresponding logits:
        
        # (1) For each idx in `indices`, we first fetch its embedding
        token_emb = self.token_embedding_table(indices) # (B,T,C)
        
        # (2) Create the position embedding for each position in the block
        pos_emb = self.position_embedding_table(torch.arange(T)) # (T,C)

        # (3) Add the token embedding to position embedding
        # - this basically means we add the embedding for each idx in `indices` to
        #   the position embedding for its position in the block
        # - note the dimension broadcasting here
        x = token_emb + pos_emb  # (B,T,C)

        # (4) New: apply transformer block: self-attention + feed-forward
        x = self.blocks(x)       # (B,T,C)
        
        # (5) We then fetch the logits using the lm_head layer
        logits = self.lm_head(x)  # (B,T,vocab_size)

        if targets is None:
            loss = None
        else:
            B, T, C = logits.shape

            # We want to flatten `logits` so that we have a total of B*T
            # feature vectors of length C.
            logits = logits.view(B*T, C)

            # Also flatten `targets` so that it contains B*T target outputs
            # for each of the feature vectors in `logits`.
            targets = targets.view(B*T)

            # Compute loss
            loss = F.cross_entropy(logits, targets)

        return logits, loss

    def generate(self, indices, max_new_tokens):
        # `indices` is a (B,T) tensor of indices in the current context

        for _ in range(max_new_tokens):
            # We need to crop the context; otherwise it won't
            # fit into our position_embedding_table
            indices_cropped = indices[:, -block_size:]
            
            # Get predictions;
            # `logits` is (B,T,C)
            logits, loss = self(indices_cropped) # calls forward()

            # `logits` contains the logits for every index in `indices`,
            # but we actually only need the last time step in each batch
            logits = logits[:, -1, :] # becomes (B,C)

            # Apply softmax to get probabilities
            probs = F.softmax(logits, dim=-1) # (B,C)

            # Sample from the probability distribution
            next_idx = torch.multinomial(probs, num_samples=1) # (B,1)

            # Append sampled index to the context for each batch
            indices = torch.cat((indices, next_idx), dim=1) # (B,T+1)

        return indices

Above won't give very result because we are getting to a pretty deep neural network that suffers from optimization issues.

We introduce two optimizations to help with the depth:
- residual block: "ADD"
- layer normalization: "NORM"

Also: dropout layer
- prevents neural net from overfitting

### Optimizations for the transformer block

In [75]:
dropout = 0.0

class Head(nn.Module):
    """
    One head of self-attention.
    """
    def __init__(self, head_size):
        super().__init__()
        self.key = nn.Linear(n_embd, head_size, bias=False)
        self.query = nn.Linear(n_embd, head_size, bias=False)
        self.value = nn.Linear(n_embd, head_size, bias=False)

        # `tril` is a "buffer", i.e. it's not a parameter of the module.
        # We have to call register_buffer on it.
        self.register_buffer('tril', torch.tril(torch.ones(block_size, block_size)))

        # New: dropout layer
        self.dropout = nn.Dropout(dropout)

    def forward(self, x):
        B,T,C = x.shape
        k = self.key(x)   # (B,T,C)
        q = self.query(x) # (B,T,C)
        
        # Compute attention scores ("affinities")
        wei = q @ k.transpose(-2,-1) * k.shape[-1]**-0.5 # (B, T, head_size) @ (B, head_size, T) -> (B, T, T)
        wei = wei.masked_fill(self.tril[:T, :T] == 0, float('-inf')) # (B, T, T)
        wei = F.softmax(wei, dim=-1) # (B, T, T)

        # New: dropout layer
        wei = self.dropout(wei)
        
        # Perform the weighted aggregation of the values
        v = self.value(x) # (B,T,C)
        out = wei @ v     # (B,T,T) @ (B,T,C) -> (B,T,C)
        return out

In [76]:
class MultiHeadAttention(nn.Module):
    """
    Multiple heads of self-attention in parallel.
    """
    def __init__(self, num_heads, head_size):
        super().__init__()
        self.heads = nn.ModuleList([Head(head_size) for _ in range(num_heads)])

        # New: projection
        # - it will mix/weigh the outputs from each head
        self.proj = nn.Linear(n_embd, n_embd)

        # New: dropout layer
        self.dropout = nn.Dropout(dropout)

    def forward(self, x):
        # Each head outputs (B, T, head_size)
        # After torch.cat: out = (B, T, n_head * head_size) = (B, T, n_embd)
        out = torch.cat([h(x) for h in self.heads], dim=-1)
        # New: projection layer
        out = self.proj(out)
        # New: dropout layer
        out = self.dropout(out)
        return out

In [77]:
class FeedFoward(nn.Module):
    """
    A linear layer followed by a non-linearity
    """
    def __init__(self, n_embd):
        super().__init__()
        self.net = nn.Sequential(
            nn.Linear(n_embd, 4 * n_embd), # linear layer
            nn.ReLU(),                     # non-linearity
            nn.Linear(4 * n_embd, n_embd), # New: projection layer
            nn.Dropout(dropout),           # New: dropout layer

            # Also note the multipler of 4: this is to follow the attention paper
        )

    def forward(self, x):
        return self.net(x)

In [78]:
class Block(nn.Module):
    """
    A Transformer block:
    communication (self-attention) followed by computation (feed-forward)
    """
    def __init__(self, n_embd, n_head):
        super().__init__()
        head_size = n_embd // n_head
        self.sa = MultiHeadAttention(n_head, head_size)
        self.ffwd = FeedFoward(n_embd)

        # New: LayerNorm
        self.ln1 = nn.LayerNorm(n_embd)
        self.ln2 = nn.LayerNorm(n_embd)

    def forward(self, x):
        # x: (B,T,C)
        # recall that C = n_embd
        x = x + self.sa(self.ln1(x))     # residual block: introduce addition with x
        x = x + self.ffwd(self.ln2(x))   # residual block: introduce ddition with x
        return x

In [79]:
# Number of transformer blocks
n_layer = 4

# Number of self-attention heads
n_head = 4

class GPTLanguageModel(nn.Module):
    def __init__(self, vocab_size):
        super().__init__()
        # Create "embedding" table.
        # - Maps each token in the vocabulary to an embedding of dimension `n_embd`
        self.token_embedding_table = nn.Embedding(vocab_size, n_embd)

        # Add position embedding table
        # - Each position in the block gets its own embedding vector
        self.position_embedding_table = nn.Embedding(block_size, n_embd)

        # Add linear layer between embeddings (of dimension `n_embd`)
        # and the logits (dimension `vocab_size`)
        self.lm_head = nn.Linear(n_embd, vocab_size)

        # Add transformer blocks
        self.blocks = nn.Sequential(*[Block(n_embd, n_head=n_head) for _ in range(n_layer)])

        # New: LayerNorm
        self.ln_f = nn.LayerNorm(n_embd) # final layer norm

    def forward(self, indices, targets=None):
        # `indices` and `targets` are both (B,T) tensor of integers
        B, T = indices.shape

        # For each idx in `indices`, we need to fetch its corresponding logits:
        
        # (1) For each idx in `indices`, we first fetch its embedding
        token_emb = self.token_embedding_table(indices) # (B,T,C)
        
        # (2) Create the position embedding for each position in the block
        pos_emb = self.position_embedding_table(torch.arange(T)) # (T,C)

        # (3) Add the token embedding to position embedding
        # - this basically means we add the embedding for each idx in `indices` to
        #   the position embedding for its position in the block
        # - note the dimension broadcasting here
        x = token_emb + pos_emb  # (B,T,C)

        # (4) New: apply transformer block: self-attention + feed-forward
        x = self.blocks(x)       # (B,T,C)

        # (5) New: apply LayerNorm
        x = self.ln_f(x)         # (B,T,C)
        
        # (6) We then fetch the logits using the lm_head layer
        logits = self.lm_head(x)  # (B,T,vocab_size)

        if targets is None:
            loss = None
        else:
            B, T, C = logits.shape

            # We want to flatten `logits` so that we have a total of B*T
            # feature vectors of length C.
            logits = logits.view(B*T, C)

            # Also flatten `targets` so that it contains B*T target outputs
            # for each of the feature vectors in `logits`.
            targets = targets.view(B*T)

            # Compute loss
            loss = F.cross_entropy(logits, targets)

        return logits, loss

    def generate(self, indices, max_new_tokens):
        # `indices` is a (B,T) tensor of indices in the current context

        for _ in range(max_new_tokens):
            # We need to crop the context; otherwise it won't
            # fit into our position_embedding_table
            indices_cropped = indices[:, -block_size:]
            
            # Get predictions;
            # `logits` is (B,T,C)
            logits, loss = self(indices_cropped) # calls forward()

            # `logits` contains the logits for every index in `indices`,
            # but we actually only need the last time step in each batch
            logits = logits[:, -1, :] # becomes (B,C)

            # Apply softmax to get probabilities
            probs = F.softmax(logits, dim=-1) # (B,C)

            # Sample from the probability distribution
            next_idx = torch.multinomial(probs, num_samples=1) # (B,1)

            # Append sampled index to the context for each batch
            indices = torch.cat((indices, next_idx), dim=1) # (B,T+1)

        return indices

### Train the model

In [80]:
m = GPTLanguageModel(vocab_size)

# PyTorch optimizer
optimizer = torch.optim.AdamW(m.parameters(), lr=1e-3)

In [81]:
batch_size = 32

# Train for some iterations
iterations = 100000
print_interval = 5000
for step in range(iterations):
    # Sample a batch of data
    xx, yy = get_batch('train', batch_size)

    # Evaluate loss
    logits, loss = m(xx, yy)
    optimizer.zero_grad(set_to_none=True)
    loss.backward()
    optimizer.step()

    if step == 0 or step == iterations-1 or (step+1) % print_interval == 0:
        print(f"Loss at step {step+1}: {loss.item()}")

Loss at step 1: 4.669200420379639
Loss at step 5000: 1.7074384689331055
Loss at step 10000: 1.7004404067993164
Loss at step 15000: 1.6498773097991943
Loss at step 20000: 1.713290810585022
Loss at step 25000: 1.658354640007019
Loss at step 30000: 1.7331057786941528
Loss at step 35000: 1.7596385478973389
Loss at step 40000: 1.6325125694274902
Loss at step 45000: 1.7235300540924072
Loss at step 50000: 1.6771314144134521
Loss at step 55000: 1.6490967273712158
Loss at step 60000: 1.6554733514785767
Loss at step 65000: 1.656504511833191
Loss at step 70000: 1.5700922012329102
Loss at step 75000: 1.5170801877975464
Loss at step 80000: 1.5542258024215698
Loss at step 85000: 1.5349010229110718
Loss at step 90000: 1.6250503063201904
Loss at step 95000: 1.6151597499847412
Loss at step 100000: 1.5942085981369019


### Generate some text

In [83]:
# Generate some output, starting with [0]
gen = m.generate(indices = torch.zeros((1,1), dtype=torch.long), max_new_tokens=1000)
print("Generated:")
print(decode(gen[0].tolist()))

Generated:

justicular
imaginell the Frank
Colong.
Hence, you nother best, by the agreably extralled, and
seceshamed Jane Fairfax at all only thoughth, it was scruptions, and _weather passionable that I am pain to you this declination?”—and worth him spiritéing servaking away, and she did I should never
most desome once bethlunted it drequest anxion of
farther, who concew any recoxcepte,
in steased, one brought,” createst-Miss Bates’s were slight conshaded
what is quite
from any body’s one bring? who half in the great he kind; and
solution, that to
Mr. Elton issurple, his middles than understanding. He
had
only living us. She few the little, you short placle some
days could her own used him; and all there
is quite for so.
Colongut sometingsfore her
most to be disappoitations what is her
father’s enjictely did spodering the nobody that she disgue, Jane, if he us, suffulnuicate Box to convise present it thought another. Mr. Cole, which she than Harreated to
her worthing
advarer.”

“She s