## Preparing a dataset for NanoGPT

Important: Make sure you run every cell in this workbook by using the "Play" button on the left-hand side of each node before moving on to the next one. 

If you have to restart the program for some reason, you might have to run the cells again.  

For any Machine Learning work, we need some data. Let's use the full text of the classic book "Alice in Wonderland" by Lewis Carroll.

Now, let's print some of the text.

In [1]:
# Change "placeholder.txt" below 
with open('alice.txt', 'r', encoding='utf-8') as f:
    text = f.read()

# Let's see how many characters are in the text
print("length of dataset in characters: ", len(text))

# Let's print some of the text
print(text[:1000])

length of dataset in characters:  143833
﻿CHAPTER I.
Down the Rabbit-Hole
Alice was beginning to get very tired of sitting by her sister on the bank, and of having nothing to do: once or twice she had peeped into the book her sister was reading, but it had no pictures or conversations in it, “and what is the use of a book,” thought Alice “without pictures or conversations?”

So she was considering in her own mind (as well as she could, for the hot day made her feel very sleepy and stupid), whether the pleasure of making a daisy-chain would be worth the trouble of getting up and picking the daisies, when suddenly a White Rabbit with pink eyes ran close by her.

There was nothing so very remarkable in that; nor did Alice think it so very much out of the way to hear the Rabbit say to itself, “Oh dear! Oh dear! I shall be late!” (when she thought it over afterwards, it occurred to her that she ought to have wondered at this, but at the time it all seemed quite natural); but when the Rabbit

Since our language model will operate on character level, let's start by listing all the unique characters that appear in the text. 

In [2]:
# here are all the unique characters that occur in this text
chars = sorted(list(set(text)))
vocab_size = len(chars)

print("all the unique characters in the text: ", ''.join(chars))
print(str(vocab_size) + " different characters") 

all the unique characters in the text:  
 !()*,-.:;?ABCDEFGHIJKLMNOPQRSTUVWXYZ[]abcdefghijklmnopqrstuvwxyzù—‘’“”﻿
73 different characters


### Tokenization

Now we'll need a way to *tokenize* the text. Simply put, computers work more effectively with numbers than with letters or characters. This process is called *tokenization*, and is an essential part of how language models operate. There are different ways to do this, and tokenization gets more complex with more advanced language models like ChatGPT. 

However, as we are building a character-level language model, we can adopt a simple approach: we are going to convert each character into an unique integer (...-3,-2,-1,0,1,2,3...). 

Of course, we won't assign these numbers manually, that would be way too much trouble: Python makes it easy for us to automate this process using a few functions.

*If you don't know anything about programming, don't panic: that isn't the focus here - just run the code using the "Play" button, observe the results and move on.*

In [3]:
stoi = { ch:i for i,ch in enumerate(chars) } # character to integer mapping (stoi is short for "string to integer")
itos = { i:ch for i,ch in enumerate(chars) } # integer to character mapping (itos is short for "integer to string")

# Function to convert a string into a list of integers (encoding)
def encode(text):
    result = []
    for character in text:
        result.append(stoi[character])
    return result

# Function to convert a list of integers back into a string (decoding)
def decode(integers):
    result = ''
    for index in integers:
        result += itos[index]
    return result

Let's test out the encoding function. Feel feel to change the text inside encode("") to be whatever you like and try it out. Note that since our program only knows the characters that were used in the input text we downloaded earlier, using some non-letter characters may crash the program.

In [4]:

encoded_letter_1 = encode("a")
encoded_letter_2 = encode("b")

# Note that uppercase letters are treated as different characters
encoded_letter_3 = encode("A")
encoded_letter_4 = encode("B")

encoded_text_1 = encode("Hello World!")

encoded_text_2 = encode("oldschool emotes are cooler than emojis xD")
print("encoded letter 'a': " + str(encoded_letter_1))
print("encoded letter 'b': " + str(encoded_letter_2))  
print("encoded letter 'A': " + str(encoded_letter_3))
print("encoded letter 'B': " + str(encoded_letter_4))

print("Hello World = " + str(encoded_text_1))
print("What is the capital of Assyria? " + str(encoded_text_2))

encoded letter 'a': [40]
encoded letter 'b': [41]
encoded letter 'A': [12]
encoded letter 'B': [13]
Hello World = [19, 44, 51, 51, 54, 1, 34, 54, 57, 51, 43, 2]
What is the capital of Assyria? [54, 51, 43, 58, 42, 47, 54, 54, 51, 1, 44, 52, 54, 59, 44, 58, 1, 40, 57, 44, 1, 42, 54, 54, 51, 44, 57, 1, 59, 47, 40, 53, 1, 44, 52, 54, 49, 48, 58, 1, 63, 15]


Let's also test out the decoding function. You can try changing some of the numbers inside `decode([])` to see how it converts back to text. 

In [5]:
decoded_letter_1 = decode(encoded_letter_1) 

decoded_text_1 = decode(encoded_text_1)

decoded_text_2 = decode(encoded_text_2)

print(decoded_letter_1)
print(decoded_text_1)
print(decoded_text_2)

a
Hello World!
oldschool emotes are cooler than emojis xD


In [6]:
decoded_text_1 = decode(encoded_text_1)
print(decoded_text_1)

decoded_text_2 = decode(encoded_text_2)
print(decoded_text_2)

Hello World!
oldschool emotes are cooler than emojis xD


### Encoding the full dataset
Now, let's apply that same **encoding and decoding** logic to the entire dataset we downloaded earlier. We're storing the data in a **Torch**, which is a multi-dimensional array that can be computed efficiently on graphics cards - this is basically what enables modern AI to be as effective as it is.

In [7]:
# let's now encode the entire text dataset and store it into a torch.Tensor
import torch # we use PyTorch: https://pytorch.org
data = torch.tensor(encode(text), dtype=torch.long)

print("printing first 200 characters of the encoded text dataset")
print(data.shape, data.dtype)
print(data[:200]) # the 1000 characters we looked at earlier will to the GPT look like this
print("You have now successfully encoded the entiredataset into a tensor of integers.")

printing first 200 characters of the encoded text dataset
torch.Size([143833]) torch.int64
tensor([72, 14, 19, 12, 27, 31, 16, 29,  1, 20,  8,  0, 15, 54, 62, 53,  1, 59,
        47, 44,  1, 29, 40, 41, 41, 48, 59,  7, 19, 54, 51, 44,  0, 12, 51, 48,
        42, 44,  1, 62, 40, 58,  1, 41, 44, 46, 48, 53, 53, 48, 53, 46,  1, 59,
        54,  1, 46, 44, 59,  1, 61, 44, 57, 64,  1, 59, 48, 57, 44, 43,  1, 54,
        45,  1, 58, 48, 59, 59, 48, 53, 46,  1, 41, 64,  1, 47, 44, 57,  1, 58,
        48, 58, 59, 44, 57,  1, 54, 53,  1, 59, 47, 44,  1, 41, 40, 53, 50,  6,
         1, 40, 53, 43,  1, 54, 45,  1, 47, 40, 61, 48, 53, 46,  1, 53, 54, 59,
        47, 48, 53, 46,  1, 59, 54,  1, 43, 54,  9,  1, 54, 53, 42, 44,  1, 54,
        57,  1, 59, 62, 48, 42, 44,  1, 58, 47, 44,  1, 47, 40, 43,  1, 55, 44,
        44, 55, 44, 43,  1, 48, 53, 59, 54,  1, 59, 47, 44,  1, 41, 54, 54, 50,
         1, 47, 44, 57,  1, 58, 48, 58, 59, 44, 57,  1, 62, 40, 58,  1, 57, 44,
        40, 43])
You have now

Soon it is time to train our "baby" GPT model. However, we don't want to train with all of our data - we don't want it to create an exact copy of the input text, we want it to create text that is similar to it in style. So, we will withhold 10% of the dataset for *validation data* that we will compare the output with.

In [8]:
# Let's now split up the data into train and validation sets
n = int(0.9*len(data)) # first 90% will be train, rest val
train_data = data[:n]
val_data = data[n:]

## Saving the files 
Now that we have processed our text data and split it into training and validation sets, we need to save this data for future use in our GPT model training. 

We're going to save our data in a special format called "binary" - think of it like compressing a file to make it smaller and faster to open. This is important because:

- **Smaller files**: Binary format takes up much less space on your computer
- **Faster loading**: When we train our AI, it can read these files much quicker
- **Better for math**: Earlier, we converted letters to numbers. However, we want to compress it further because computers are naturally better at working with numbers that are in binary format, as that is the format in which all computing occurs at the base level!

We'll also save a "dictionary" file (meta.pkl) that remembers which number represents which character. Without this, our AI wouldn't know how to translate its number predictions back into readable text.

In [9]:
# Let's now split up the data into train and validation sets
n = int(0.9*len(data)) # first 90% will be train, rest val
train_data = data[:n]
val_data = data[n:]

block_size = 8
train_data[:block_size+1]

tensor([72, 14, 19, 12, 27, 31, 16, 29,  1])

Here we define our Biggram language model.

## Defining our model
We are using a text prediction model that we import from model.py

In [10]:
import torch
import torch.nn as nn
from torch.nn import functional as F
from model import BabyGPT  # Import the model from separate file

# hyperparameters
device = 'cuda' if torch.cuda.is_available() else 'cpu'
batch_size = 32    
block_size = 64     
eval_interval = 5
eval_iters = 100
dropout = 0.1 # this means that 10% of the neurons will be randomly set to zero during training       
n_embd = 64          
n_head = 4           
n_layer = 4          
learning_rate = 1.5e-3

print(f"Vocabulary size: {vocab_size} unique characters")
print(f"Training will run on: {device}")

# Initialize model with the vocabulary size
torch.manual_seed(1337)
model = BabyGPT(vocab_size, block_size, n_embd, n_head, n_layer, dropout)
m = model.to(device)
print(f"Model has {sum(p.numel() for p in m.parameters())/1e6:.2f}M parameters")

def get_batch(split):
    # generate a small batch of data of inputs x and targets y
    data = train_data if split == 'train' else val_data
    ix = torch.randint(len(data) - block_size, (batch_size,))
    x = torch.stack([data[i:i+block_size] for i in ix])
    y = torch.stack([data[i+1:i+block_size+1] for i in ix])
    x, y = x.to(device), y.to(device)
    return x, y

@torch.no_grad()
def estimate_loss():
    out = {}
    model.eval()
    for split in ['train', 'val']:
        losses = torch.zeros(eval_iters)
        for k in range(eval_iters):
            X, Y = get_batch(split)
            logits, loss = model(X, Y)
            losses[k] = loss.item()
        out[split] = losses.mean()
    model.train()
    return out

Vocabulary size: 73 unique characters
Training will run on: cuda
Model has 0.21M parameters


## Run this to begin training

In [11]:
import time

# Set training time in seconds (e.g., 5 minutes = 300 seconds)
max_train_time = 300  

# Create PyTorch optimizer
optimizer = torch.optim.AdamW(model.parameters(), lr=learning_rate)

# Track metrics for visualization later
train_losses = []
val_losses = []
timestamps = []

# Get starting time
start_time = time.time()
iter_count = 0

# Train until time limit is reached
while (time.time() - start_time) < max_train_time:
    # Check if it's time to evaluate
    current_time = time.time() - start_time
    if int(current_time) % eval_interval == 0 and int(current_time) > 0:
        
        # Store current time elapsed in seconds
        elapsed = time.time() - start_time
        
        # Evaluate model
        losses = estimate_loss()
        print(f"Time: {elapsed:.1f}s, Iteration: {iter_count}, Training prediction error: {losses['train']:.4f}, Generalization error: {losses['val']:.4f}")
        
        # Store metrics for potential visualization
        timestamps.append(elapsed)
        train_losses.append(losses['train'])
        val_losses.append(losses['val'])

    
    # Regular training iteration
    xb, yb = get_batch('train')
    logits, loss = model(xb, yb)
    optimizer.zero_grad(set_to_none=True)
    loss.backward()
    optimizer.step()
    iter_count += 1

# Final evaluation
elapsed = time.time() - start_time
losses = estimate_loss()
print(f"Final results after {elapsed:.1f} seconds ({iter_count} iterations):")
print(f"Training data prediction error: {losses['train']:.4f}, New text prediction error: {losses['val']:.4f}")

print(f"Your system processed {iter_count / elapsed:.2f} iterations per second")

Time: 5.0s, Iteration: 226, Training prediction error: 2.2707, Generalization error: 2.2903
Time: 10.0s, Iteration: 406, Training prediction error: 1.9795, Generalization error: 2.0152
Time: 15.0s, Iteration: 603, Training prediction error: 1.7185, Generalization error: 1.7833
Time: 20.0s, Iteration: 799, Training prediction error: 1.6011, Generalization error: 1.6768
Time: 25.0s, Iteration: 987, Training prediction error: 1.5244, Generalization error: 1.6206
Time: 30.0s, Iteration: 1156, Training prediction error: 1.4774, Generalization error: 1.5966
Time: 35.0s, Iteration: 1284, Training prediction error: 1.4459, Generalization error: 1.5638
Time: 40.0s, Iteration: 1311, Training prediction error: 1.4276, Generalization error: 1.5511
Time: 45.0s, Iteration: 1341, Training prediction error: 1.4251, Generalization error: 1.5470
Time: 50.0s, Iteration: 1363, Training prediction error: 1.4289, Generalization error: 1.5515
Time: 55.0s, Iteration: 1406, Training prediction error: 1.4083, G

Train loss/prediction error = how well the model predicts the next character in the text it has seen before
Validation error = how well the model predicts the next character in the text it hasn't seen before

Rough Conversion to Accuracy
Lower values = better predictions
A loss of 1.0 ≈ 60-70% accuracy in next character prediction
A loss of 2.0 ≈ 40-50% accuracy
A loss of 3.0 ≈ 20-30% accuracy
Random guessing would give a loss around 4.2-4.5 (with ~1-2% accuracy)


## Function to generate text based on a prompt

In [12]:
def generate_text(prompt_text, max_new_tokens=500):
    """
    Generate text from a plain text prompt
    
    Parameters:
    - prompt_text: The starting text for generation
    - max_new_tokens: How many characters to generate
    
    Returns:
    - The generated text including the original prompt
    """
    # Check if all characters in the prompt are in our vocabulary
    for char in prompt_text:
        if char not in stoi:
            print(f"Warning: Character '{char}' not in vocabulary. Replacing with a space.")
            prompt_text = prompt_text.replace(char, " ")
    
    # Encode the prompt text into tokens
    tokens = encode(prompt_text)
    
    # Convert to tensor and move to the correct device
    context = torch.tensor([tokens], dtype=torch.long, device=device)
    
    # Generate new tokens
    generated_tokens = m.generate(context, max_new_tokens=max_new_tokens)[0].tolist()
    
    # Decode back to text
    generated_text = decode(generated_tokens)
    
    return generated_text

# Example usage - replace "Alice was" with any starting text
print(generate_text("Alice was", max_new_tokens=1000))

Alice was groned.’ I curner happen it out.”

Then the footm-et of the work too histelf up to hisint tone choken the paoxactably of the way again, neverely to be of sombody. Histers if would lest over, and rest once the Pootmancing. Alice looked at once of that a goodlow under a tone she found it back she was noticed: and leave at the reye sat child saw reeabout to squite are had openir ohdsed—
Too say the spot down enibbly as clearn she could leady about, with the might be angan off to sin the glife this watcher of processing, yould far it da sand, which an rubbed at their find it over now, but in a it off hurighn shout for any of more again! I can look and room both the Catt much sirt of broomse-tirsts evered a mementeer nof roff oldowled on a time.” She said to herself likely again at oning the other the four—
He speakes, sading on the out, without which says that she was again exncs of thung a froser, so being then looked at herself, and rather all Tming usal can her as present matt

### References
Adapted from Andrej Karpathy's (https://www.youtube.com/watch?v=kCc8FmEb1nY) video and accompanying notebook.