# Building GPT from scratch in code

---

## overview

[reference video](https://www.youtube.com/watch?v=kCc8FmEb1nY)

### goal
train a small scale GPT (generative pretrained transformer) to predict shakespeare-like text character by character.

### input dataset
Tiny Shakespeare: 40k lines from a variety of Shakespeare plays

---

In [2]:
!wget https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshakespeare/input.txt

--2025-11-23 20:22:39--  https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshakespeare/input.txt
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.109.133, 185.199.108.133, 185.199.111.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.109.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1115394 (1.1M) [text/plain]
Saving to: ‘input.txt.2’


2025-11-23 20:22:39 (113 MB/s) - ‘input.txt.2’ saved [1115394/1115394]



## getting familiar w the dataset

In [3]:
with open('input.txt', 'r', encoding='utf-8') as f:
    text = f.read()

In [4]:
print(f'length of dataset in characters {len(text)}')

length of dataset in characters 1115394


### sample first 1000 characters

In [5]:
print(text[0:1000])

First Citizen:
Before we proceed any further, hear me speak.

All:
Speak, speak.

First Citizen:
You are all resolved rather to die than to famish?

All:
Resolved. resolved.

First Citizen:
First, you know Caius Marcius is chief enemy to the people.

All:
We know't, we know't.

First Citizen:
Let us kill him, and we'll have corn at our own price.
Is't a verdict?

All:
No more talking on't; let it be done: away, away!

Second Citizen:
One word, good citizens.

First Citizen:
We are accounted poor citizens, the patricians good.
What authority surfeits on would relieve us: if they
would yield us but the superfluity, while it were
wholesome, we might guess they relieved us humanely;
but they think we are too dear: the leanness that
afflicts us, the object of our misery, is as an
inventory to particularise their abundance; our
sufferance is a gain to them Let us revenge this with
our pikes, ere we become rakes: for the gods know I
speak this in hunger for bread, not in thirst for revenge.



In [6]:
chars = sorted(list(set(text)))
vocab_size = len(chars)
print(''.join(chars))
print(f'vocab size: {vocab_size}')


 !$&',-.3:;?ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz
vocab size: 65


## Tokenizing

### construct a simple tokenizer

nothing complicated here, just mapping each of the 65 unique characters to an int in 0 to 64

In [7]:
# mapping of char -> index
stoi = {ch:i for i,ch in enumerate(chars)}
# also need a reverse mapping to translate gpt outputs into characters (decoder):
itos = {i:ch for i,ch in enumerate(chars)}

encode = lambda s: [stoi[c] for c in s] # encoder function: input a string of chars, output a list of integers (tokenized representation)
decode = lambda l: ''.join([itos[i] for i in l]) # decoder function: input transformer output (list of tokens) into an english language string

In [8]:
print(encode('hello'))
print(decode(encode('hello')))

[46, 43, 50, 50, 53]
hello


### tokenize the entire training dataset

In [9]:
import torch

In [10]:
data = torch.tensor(encode(text), dtype=torch.long)
print(data.shape, data.dtype)
print('first 1000 characters tokenized:', data[0:1000])

torch.Size([1115394]) torch.int64
first 1000 characters tokenized: tensor([18, 47, 56, 57, 58,  1, 15, 47, 58, 47, 64, 43, 52, 10,  0, 14, 43, 44,
        53, 56, 43,  1, 61, 43,  1, 54, 56, 53, 41, 43, 43, 42,  1, 39, 52, 63,
         1, 44, 59, 56, 58, 46, 43, 56,  6,  1, 46, 43, 39, 56,  1, 51, 43,  1,
        57, 54, 43, 39, 49,  8,  0,  0, 13, 50, 50, 10,  0, 31, 54, 43, 39, 49,
         6,  1, 57, 54, 43, 39, 49,  8,  0,  0, 18, 47, 56, 57, 58,  1, 15, 47,
        58, 47, 64, 43, 52, 10,  0, 37, 53, 59,  1, 39, 56, 43,  1, 39, 50, 50,
         1, 56, 43, 57, 53, 50, 60, 43, 42,  1, 56, 39, 58, 46, 43, 56,  1, 58,
        53,  1, 42, 47, 43,  1, 58, 46, 39, 52,  1, 58, 53,  1, 44, 39, 51, 47,
        57, 46, 12,  0,  0, 13, 50, 50, 10,  0, 30, 43, 57, 53, 50, 60, 43, 42,
         8,  1, 56, 43, 57, 53, 50, 60, 43, 42,  8,  0,  0, 18, 47, 56, 57, 58,
         1, 15, 47, 58, 47, 64, 43, 52, 10,  0, 18, 47, 56, 57, 58,  6,  1, 63,
        53, 59,  1, 49, 52, 53, 61,  1, 15, 39, 47, 5

## separate data into train/validation set

In [11]:
n = int(0.9*len(data)) # truncate to a whole number bc index can only be an int

train_data = data[0:n] # first 90% of data will be used to train
val_data = data[n:] # remaining 10% will be held out for validation

### generating x's and y's

in supervised learning when it comes to categorization problems, there's a clear x input and y label. For example, in sentiment analysis u could have a movie review passage and a y label that indicates that the passage is positive or negative sentiment. In language generation, our training data isn't exactly labeled in this manner. We can fashion labels out of this by making each word itself the target y output and some of the words that precede the target word as the x. To get the transformer used to seeing different lengths of x, we can vary the context window from 1 to some block_size parameter. This allows us to also generate multiple training examples using one subset of the training dataset. You'll end up generating block_size examples

In [12]:
block_size = 8
train_data[0:block_size+1] 

tensor([18, 47, 56, 57, 58,  1, 15, 47, 58])

In [13]:
x = train_data[0:block_size]
y = train_data[1:block_size+1] # start @ 1 so that u predict the second word given 1st; +1 allows the 8th example to have a target y

In [14]:
for t in range(block_size): 
    context = x[0:t+1] # second index excluded
    target = y[t] # y lists starts @ second character (see above cell)
    print(f'when input is {context} the target is {target}')

when input is tensor([18]) the target is 47
when input is tensor([18, 47]) the target is 56
when input is tensor([18, 47, 56]) the target is 57
when input is tensor([18, 47, 56, 57]) the target is 58
when input is tensor([18, 47, 56, 57, 58]) the target is 1
when input is tensor([18, 47, 56, 57, 58,  1]) the target is 15
when input is tensor([18, 47, 56, 57, 58,  1, 15]) the target is 47
when input is tensor([18, 47, 56, 57, 58,  1, 15, 47]) the target is 58


### next we batch
each block is like a training data snippet that we generate multiple training examples from. Then, we process multiple snippets at a time bc the gpu can do parallel processing (**batching**). this is different from multiple *epochs*.

In [25]:
torch.manual_seed(1337) # set random seed for reproducibility
batch_size = 4 # how many snippets we process in parallel
block_size = 8 # max length any x snippet can be (they'll range from 1 to block_size)

def get_batch(split):
    data = train_data if split == 'train' else val_data
    idx = torch.randint(0, len(data)-block_size, (batch_size,)) # generate batch_size number of random indices to pull snippets from
    x = torch.stack([data[i:i+block_size] for i in idx]) # torch.stack() takes in a list/tuple of tensors
    y = torch.stack([data[i+1:i+block_size+1] for i in idx]) # e.g. for x = snippet[0] y = snippet[1], hence +1
    return x,y



xbatch, ybatch = get_batch('train')
print('inputs:')
print(xbatch.shape)
print(xbatch)
print('targets:')
print(ybatch.shape)
print(ybatch)

print('-'*40)

for b in range(batch_size):
    for t in range(block_size): 
        context = xbatch[b, 0:t+1]
        target = ybatch[b, t]
        print(f'when input is \'{decode(context.tolist())}\' the model should predict: \'{decode(target.unsqueeze(0).tolist())}\' as next char')

inputs:
torch.Size([4, 8])
tensor([[24, 43, 58,  5, 57,  1, 46, 43],
        [44, 53, 56,  1, 58, 46, 39, 58],
        [52, 58,  1, 58, 46, 39, 58,  1],
        [25, 17, 27, 10,  0, 21,  1, 54]])
targets:
torch.Size([4, 8])
tensor([[43, 58,  5, 57,  1, 46, 43, 39],
        [53, 56,  1, 58, 46, 39, 58,  1],
        [58,  1, 58, 46, 39, 58,  1, 46],
        [17, 27, 10,  0, 21,  1, 54, 39]])
----------------------------------------
when input is 'L' the model should predict: 'e' as next char
when input is 'Le' the model should predict: 't' as next char
when input is 'Let' the model should predict: ''' as next char
when input is 'Let'' the model should predict: 's' as next char
when input is 'Let's' the model should predict: ' ' as next char
when input is 'Let's ' the model should predict: 'h' as next char
when input is 'Let's h' the model should predict: 'e' as next char
when input is 'Let's he' the model should predict: 'a' as next char
when input is 'f' the model should predict: 'o' as

## define the transformer model class

In [28]:
import torch
import torch.nn as nn
from torch.nn import functional as F
torch.manual_seed(1337)

class Transformer(nn.Module):
    def __init__(self, vocab_size, emb_dim=20): # emb_dim is a hyperparameter, usually want this smaller than vocab_size
        super().__init__()
        self.token_embedding_table = nn.Embedding(vocab_size, emb_dim)

    def forward(self, idx, targets=None):

        if targets is None:
            loss = None
        else:
            loss = F.cross_entropy(logits, targets)

        return logits, loss

    def generate(self, idx, max_new_tokens):
        for _ in range(max_new_tokens):
            logits, loss = self(idx)
            logits = logits[:, -1, :] # u dont want what it predicts next token to be for the tokens u alr gave it in the prompt, u only want what it's newly generating
            probs = F.softmax(logits, dim=-1) # last dim is probability dist over vocab size
            idx_next = torch.multinomial(probs, num_samples=1) # sample the next char given the probability dist spit out by the model
            idx = torch.cat((idx, idx_next), dim=1) # idx is (B, T) array of indices, use dim=1 to append to time dim
        return idx    

In [29]:
m = Transformer(vocab_size)

In [30]:
optimizer = torch.optim.AdamW(m.parameters(), lr=1e-3)

In [None]:
f