## Preparing a dataset for NanoGPT

Important: Make sure you run every cell in this workbook by using the "Play" button on the left-hand side of each node before moving on to the next one. 

If you have to restart the program for some reason, you might have to run the cells again.  

For any Machine Learning work, we need some data. Let's use the full text of the classic book "Alice in Wonderland" by Lewis Carroll.

Now, let's print some of the text.

In [10]:
import torch
# Change "placeholder.txt" below 
with open('alice.txt', 'r', encoding='utf-8') as f:
    text = f.read()

# Let's see how many characters are in the text
print("length of dataset in characters: ", len(text))

# Let's print some of the text
print(text[180:1000])

length of dataset in characters:  148043
                            CHAPTER I

                      Down the Rabbit-Hole


  Alice was beginning to get very tired of sitting by her sister
on the bank, and of having nothing to do:  once or twice she had
peeped into the book her sister was reading, but it had no
pictures or conversations in it, `and what is the use of a book,'
thought Alice `without pictures or conversation?'

  So she was considering in her own mind (as well as she could,
for the hot day made her feel very sleepy and stupid), whether
the pleasure of making a daisy-chain would be worth the trouble
of getting up and picking the daisies, when suddenly a White
Rabbit with pink eyes ran close by her.

  There was nothing so VERY remarkable in that; nor did Alice
think it so VERY much out of the way to hear the Rabbit say to
itself, `Oh 


Since our language model will operate on character level, let's start by listing all the unique characters that appear in the text. 

In [12]:
# here are all the unique characters that occur in this text
print(f"CUDA available: {torch.cuda.is_available()}")


chars = sorted(list(set(text)))
vocab_size = len(chars)


print("all the unique characters in the text: ", ''.join(chars))
print(str(vocab_size) + " different characters") 

CUDA available: False
all the unique characters in the text:  
 !"'()*,-.03:;?ABCDEFGHIJKLMNOPQRSTUVWXYZ[]`abcdefghijklmnopqrstuvwxyz﻿
72 different characters


### Tokenization

Now we'll need a way to *tokenize* the text. Simply put, computers work more effectively with numbers than with letters or characters. This process is called *tokenization*, and is an essential part of how language models operate. There are different ways to do this, and tokenization gets more complex with more advanced language models like ChatGPT. 

However, as we are building a character-level language model, we can adopt a simple approach: we are going to convert each character into an unique integer (...-3,-2,-1,0,1,2,3...). 

Of course, we won't assign these numbers manually, that would be way too much trouble: Python makes it easy for us to automate this process using a few functions.

*If you don't know anything about programming, don't panic: that isn't the focus here - just run the code using the "Play" button, observe the results and move on.*

In [3]:
stoi = { ch:i for i,ch in enumerate(chars) } # character to integer mapping (stoi is short for "string to integer")
itos = { i:ch for i,ch in enumerate(chars) } # integer to character mapping (itos is short for "integer to string")

# Function to convert a string into a list of integers (encoding)
def encode(text):
    result = []
    for character in text:
        result.append(stoi[character])
    return result

# Function to convert a list of integers back into a string (decoding)
def decode(integers):
    result = ''
    for index in integers:
        result += itos[index]
    return result

Let's test out the encoding function. Feel feel to change the text inside encode("") to be whatever you like and try it out. Note that since our program only knows the characters that were used in the input text we downloaded earlier, using some non-letter characters may crash the program.

In [4]:

encoded_letter_1 = encode("a")
encoded_letter_2 = encode("b")

# Note that uppercase letters are treated as different characters
encoded_letter_3 = encode("A")
encoded_letter_4 = encode("B")

encoded_text_1 = encode("Hello World!")

encoded_text_2 = encode("oldschool emotes are cooler than emojis xD")
print("encoded letter 'a': " + str(encoded_letter_1))
print("encoded letter 'b': " + str(encoded_letter_2))  
print("encoded letter 'A': " + str(encoded_letter_3))
print("encoded letter 'B': " + str(encoded_letter_4))

print("Hello World = " + str(encoded_text_1))
print("What is the capital of Assyria? " + str(encoded_text_2))

encoded letter 'a': [45]
encoded letter 'b': [46]
encoded letter 'A': [16]
encoded letter 'B': [17]
Hello World = [23, 49, 56, 56, 59, 1, 38, 59, 62, 56, 48, 2]
What is the capital of Assyria? [59, 56, 48, 63, 47, 52, 59, 59, 56, 1, 49, 57, 59, 64, 49, 63, 1, 45, 62, 49, 1, 47, 59, 59, 56, 49, 62, 1, 64, 52, 45, 58, 1, 49, 57, 59, 54, 53, 63, 1, 68, 19]


Let's also test out the decoding function. You can try changing some of the numbers inside `decode([])` to see how it converts back to text. 

In [5]:
decoded_letter_1 = decode(encoded_letter_1) 

decoded_text_1 = decode(encoded_text_1)

decoded_text_2 = decode(encoded_text_2)

print(decoded_letter_1)
print(decoded_text_1)
print(decoded_text_2)

a
Hello World!
oldschool emotes are cooler than emojis xD


In [6]:
decoded_text_1 = decode(encoded_text_1)
print(decoded_text_1)

decoded_text_2 = decode(encoded_text_2)
print(decoded_text_2)

Hello World!
oldschool emotes are cooler than emojis xD


### Encoding the full dataset
Now, let's apply that same **encoding and decoding** logic to the entire dataset we downloaded earlier. We're storing the data in a **Torch**, which is a multi-dimensional array that can be computed efficiently on graphics cards - this is basically what enables modern AI to be as effective as it is.

In [7]:
# let's now encode the entire text dataset and store it into a torch.Tensor
import torch # we use PyTorch: https://pytorch.org
data = torch.tensor(encode(text), dtype=torch.long)

print("printing first 200 characters of the encoded text dataset")
print(data.shape, data.dtype)
print(data[:200]) # the 1000 characters we looked at earlier will to the GPT look like this
print("You have now successfully encoded the entiredataset into a tensor of integers.")

printing first 200 characters of the encoded text dataset
torch.Size([148043]) torch.int64
tensor([71, 16, 56, 53, 47, 49,  4, 63,  1, 16, 48, 66, 49, 58, 64, 65, 62, 49,
        63,  1, 53, 58,  1, 38, 59, 58, 48, 49, 62, 56, 45, 58, 48,  0,  0,  1,
         1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1, 16, 27, 24,
        18, 20,  4, 34,  1, 16, 19, 37, 20, 29, 35, 36, 33, 20, 34,  1, 24, 29,
         1, 38, 30, 29, 19, 20, 33, 27, 16, 29, 19,  0,  0,  1,  1,  1,  1,  1,
         1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,
         1,  1,  1, 27, 49, 67, 53, 63,  1, 18, 45, 62, 62, 59, 56, 56,  0,  0,
         1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1, 35, 23, 20,
         1, 28, 24, 27, 27, 20, 29, 29, 24, 36, 28,  1, 21, 36, 27, 18, 33, 36,
        28,  1, 20, 19, 24, 35, 24, 30, 29,  1, 12, 10, 11,  0,  0,  0,  0,  0,
         1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,
         1,  1])
You have now

Soon it is time to train our "baby" GPT model. However, we don't want to train with all of our data - we don't want it to create an exact copy of the input text, we want it to create text that is similar to it in style. So, we will withhold 10% of the dataset for *validation data* that we will compare the output with.

In [8]:
# Let's now split up the data into train and validation sets
n = int(0.9*len(data)) # first 90% will be train, rest val
train_data = data[:n]
val_data = data[n:]

## Saving the files 
Now that we have processed our text data and split it into training and validation sets, we need to save this data for future use in our GPT model training. 

We're going to save our data in a special format called "binary" - think of it like compressing a file to make it smaller and faster to open. This is important because:

- **Smaller files**: Binary format takes up much less space on your computer
- **Faster loading**: When we train our AI, it can read these files much quicker
- **Better for math**: Earlier, we converted letters to numbers. However, we want to compress it further because computers are naturally better at working with numbers that are in binary format, as that is the format in which all computing occurs at the base level!

We'll also save a "dictionary" file (meta.pkl) that remembers which number represents which character. Without this, our AI wouldn't know how to translate its number predictions back into readable text.

In [9]:
# Let's now split up the data into train and validation sets
n = int(0.9*len(data)) # first 90% will be train, rest val
train_data = data[:n]
val_data = data[n:]

block_size = 8
train_data[:block_size+1]

tensor([71, 16, 56, 53, 47, 49,  4, 63,  1])

Here we define our Biggram language model.

## Defining our model
We are using a text prediction model that we import from model.py

In [14]:
import torch
import torch.nn as nn
from torch.nn import functional as F
from model import BabyGPT  # Import the model from separate file

# hyperparameters
device = 'cuda' if torch.cuda.is_available() else 'cpu'
batch_size = 32    
block_size = 64     
eval_interval = 5
eval_iters = 100
dropout = 0.1 # this means that 10% of the neurons will be randomly set to zero during training       
n_embd = 64          
n_head = 4           
n_layer = 4          
learning_rate = 1.5e-3

print(f"Vocabulary size: {vocab_size} unique characters")
print(f"Training will run on: {device}")

# Initialize model with the vocabulary size
torch.manual_seed(1337)
model = BabyGPT(vocab_size, block_size, n_embd, n_head, n_layer, dropout)
m = model.to(device)
print(f"Model has {sum(p.numel() for p in m.parameters())/1e6:.2f}M parameters")

def get_batch(split):
    # generate a small batch of data of inputs x and targets y
    data = train_data if split == 'train' else val_data
    ix = torch.randint(len(data) - block_size, (batch_size,))
    x = torch.stack([data[i:i+block_size] for i in ix])
    y = torch.stack([data[i+1:i+block_size+1] for i in ix])
    x, y = x.to(device), y.to(device)
    return x, y

@torch.no_grad()
def estimate_loss():
    out = {}
    model.eval()
    for split in ['train', 'val']:
        losses = torch.zeros(eval_iters)
        for k in range(eval_iters):
            X, Y = get_batch(split)
            logits, loss = model(X, Y)
            losses[k] = loss.item()
        out[split] = losses.mean()
    model.train()
    return out

Vocabulary size: 72 unique characters
Training will run on: cpu
Model has 0.21M parameters


## Run this to begin training

In [11]:
import time

# Set training time in seconds (e.g., 5 minutes = 300 seconds)
max_train_time = 300  

# Create PyTorch optimizer
optimizer = torch.optim.AdamW(model.parameters(), lr=learning_rate)

# Track metrics for visualization later
train_losses = []
val_losses = []
timestamps = []

# Get starting time
start_time = time.time()
iter_count = 0

# Train until time limit is reached
while (time.time() - start_time) < max_train_time:
    # Check if it's time to evaluate
    current_time = time.time() - start_time
    if int(current_time) % eval_interval == 0 and int(current_time) > 0:
        
        # Store current time elapsed in seconds
        elapsed = time.time() - start_time
        
        # Evaluate model
        losses = estimate_loss()
        print(f"Time: {elapsed:.1f}s, Iteration: {iter_count}, Training prediction error: {losses['train']:.4f}, Generalization error: {losses['val']:.4f}")
        
        # Store metrics for potential visualization
        timestamps.append(elapsed)
        train_losses.append(losses['train'])
        val_losses.append(losses['val'])

    
    # Regular training iteration
    xb, yb = get_batch('train')
    logits, loss = model(xb, yb)
    optimizer.zero_grad(set_to_none=True)
    loss.backward()
    optimizer.step()
    iter_count += 1

# Final evaluation
elapsed = time.time() - start_time
losses = estimate_loss()
print(f"Final results after {elapsed:.1f} seconds ({iter_count} iterations):")
print(f"Training data prediction error: {losses['train']:.4f}, New text prediction error: {losses['val']:.4f}")

print(f"Your system processed {iter_count / elapsed:.2f} iterations per second")

Time: 5.0s, Iteration: 64, Training prediction error: 2.5811, Generalization error: 2.6050
Time: 10.0s, Iteration: 102, Training prediction error: 2.4724, Generalization error: 2.5009
Time: 15.0s, Iteration: 141, Training prediction error: 2.4190, Generalization error: 2.4394
Time: 20.0s, Iteration: 173, Training prediction error: 2.3600, Generalization error: 2.3747
Time: 25.1s, Iteration: 206, Training prediction error: 2.3303, Generalization error: 2.3518
Time: 30.0s, Iteration: 238, Training prediction error: 2.2862, Generalization error: 2.3086
Time: 35.1s, Iteration: 272, Training prediction error: 2.2110, Generalization error: 2.2305
Time: 40.0s, Iteration: 303, Training prediction error: 2.1516, Generalization error: 2.1604
Time: 45.0s, Iteration: 335, Training prediction error: 2.0828, Generalization error: 2.1084
Time: 50.0s, Iteration: 368, Training prediction error: 2.0153, Generalization error: 2.0506
Time: 55.0s, Iteration: 399, Training prediction error: 1.9618, Generali

Train loss/prediction error = how well the model predicts the next character in the text it has seen before
Validation error = how well the model predicts the next character in the text it hasn't seen before

Rough Conversion to Accuracy
Lower values = better predictions
A loss of 1.0 ≈ 60-70% accuracy in next character prediction
A loss of 2.0 ≈ 40-50% accuracy
A loss of 3.0 ≈ 20-30% accuracy
Random guessing would give a loss around 4.2-4.5 (with ~1-2% accuracy)


## Function to generate text based on a prompt

In [None]:
def generate_text(prompt_text, max_new_tokens=500):
    """
    Generate text from a plain text prompt
    
    Parameters:
    - prompt_text: The starting text for generation
    - max_new_tokens: How many characters to generate
    
    Returns:
    - The generated text including the original prompt
    """
    # Check if all characters in the prompt are in our vocabulary
    for char in prompt_text:
        if char not in stoi:
            print(f"Warning: Character '{char}' not in vocabulary. Replacing with a space.")
            prompt_text = prompt_text.replace(char, " ")
    
    # Encode the prompt text into tokens
    tokens = encode(prompt_text)
    
    # Convert to tensor and move to the correct device
    context = torch.tensor([tokens], dtype=torch.long, device=device)
    
    # Generate new tokens
    generated_tokens = m.generate(context, max_new_tokens=max_new_tokens)[0].tolist()
    
    # Decode back to text
    generated_text = decode(generated_tokens)
    
    return generated_text

# Example usage - replace "Alice was" with any starting text
print(generate_text("Alice was", max_new_tokens=1000))

### References
Adapted from Andrej Karpathy's (https://www.youtube.com/watch?v=kCc8FmEb1nY) video and accompanying notebook.