# Tutorial 1: nanoGPT - Word-Level Text Generation

Welcome to the nanoGPT tutorial! In this notebook, we will train a simple word-level language model to generate text.

**Goal:** Understand how to train a small transformer model for text generation on a CPU.

**Key Concepts:**
- Word-level tokenization
- Preparing data for language modeling
- Building a GPT-style transformer from scratch (simplified)
- A basic PyTorch training loop
- Generating text with the trained model

## 1. Setup and Imports

First, let's import the necessary libraries. We'll need `torch` for building and training our model.

In [1]:
import torch
import torch.nn as nn
from torch.nn import functional as F
from transformers import GPT2Config, GPT2LMHeadModel
import math
import random
import numpy as np

# For reproducibility
random.seed(1)
np.random.seed(1)
torch.manual_seed(1)



<torch._C.Generator at 0x11b2fe450>

## 2. Configuration

We'll define some hyperparameters for our model and training process. Since we're running on CPU and want this to be quick for a tutorial, these values will be small.

In [2]:
# Hyperparameters
batch_size = 32       # How many independent sequences will we process in parallel 
block_size = 10      # What is the maximum context length for predictions?
max_iters = 5000      # Total training iterations
eval_interval = 500   # How often to evaluate on validation set
learning_rate = 5e-4  # Learning rate for the optimizer
device = 'cpu'        # Explicitly set to CPU
eval_iters = 100      # Number of iterations for evaluation
n_embd = 128          # Embedding dimension (reduced for CPU)
n_head = 4            # Number of attention heads (reduced for CPU)
n_layer = 4           # Number of transformer blocks (reduced for CPU)
dropout = 0.0          # Dropout rate>0 to prevent overfitting

## 3. Data Preparation

We'll use a small text file as our dataset. For this tutorial, let's use a snippet of Shakespeare's writings.

### 3.1 Load Data

In [3]:
# We'll use a small snippet of text for this tutorial.
# You can replace this with the path to a larger .txt file if you wish.
text = """To stale 't a little more.

First Citizen:
Well, I'll hear it, sir: yet you must not think to
fob off our disgrace with a tale: but, an 't please
you, deliver.

MENENIUS:
There was a time when all the body's members
Rebell'd against the belly, thus accused it:
That only like a gulf it did remain
I' the midst o' the body, idle and unactive,
Still cupboarding the viand, never bearing
Like labour with the rest, where the other instruments
Did see and hear, devise, instruct, walk, feel,
And, mutually participate, did minister
Unto the appetite and affection common
Of the whole body. The belly answer'd--

First Citizen:
Well, sir, what answer made the belly?

MENENIUS:
Sir, I shall tell you. With a kind of smile,
Which ne'er came from the lungs, but even thus--
For, look you, I may make the belly smile
As well as speak--it tauntingly replied
To the discontented members, the mutinous parts
That envied his receipt; even so most fitly
As you malign our senators for that
They are not such as you."""

### 3.2 Word-level Tokenization

Since this is a word-level model, our vocabulary will consist of all unique words present in the text. We'll create mappings from words to integers (encode) and integers to words (decode).

In [4]:
# Get all unique words in the text by regular expression
import re 

# Simple word tokenization - split on whitespace and punctuation
words = re.findall(r'\b\w+\b|[^\w\s]', text.lower())
unique_words = sorted(list(set(words)))
vocab_size = len(unique_words)

print("Sample words:", unique_words[:20])
print(f"Vocabulary size: {vocab_size}")

# Create a mapping from words to integers and vice-versa
stoi = {word: i for i, word in enumerate(unique_words)}
itos = {i: word for i, word in enumerate(unique_words)}

def encode(s):
    """Encoder: take a string, output a list of integers"""
    words = re.findall(r'\b\w+\b|[^\w\s]', s.lower())
    return [stoi[word] for word in words if word in stoi]

def decode(l):
    """Decoder: take a list of integers, output a string"""
    words = [itos[i] for i in l]
    # Simple reconstruction - join with spaces, handle punctuation
    result = ""
    for i, word in enumerate(words):
        if word in ".,!?;:":
            result += word
        elif i == 0:
            result += word
        else:
            result += " " + word
    return result

# Test encoding and decoding
test_text = "an answer"
encoded = encode(test_text)
decoded = decode(encoded)
print(f"Original: {test_text}")
print(f"Encoded: {encoded}")
print(f"Decoded: {decoded}")

Sample words: ["'", ',', '-', '.', ':', ';', '?', 'a', 'accused', 'affection', 'against', 'all', 'an', 'and', 'answer', 'appetite', 'are', 'as', 'bearing', 'belly']
Vocabulary size: 123
Original: an answer
Encoded: [12, 14]
Decoded: an answer


#### Explanation: Tokenizer and Tokens

A **tokenizer** is a crucial component in Natural Language Processing (NLP). Its primary job is to break down a piece of text (like a sentence or a paragraph) into smaller units called **tokens**. These tokens can be words, sub-words, or even individual characters, depending on the type of tokenizer used.

Think of it like this: computers don't understand words and sentences directly. They understand numbers. So, a tokenizer first segments the text and then, in conjunction with a vocabulary mapping (like our `stoi` and `itos` dictionaries), converts these tokens into numerical representations that a machine learning model can process.

In this notebook, we are using a **word-level tokenizer**. This means each token is a unique word or punctuation mark found in our text.
- The `re.findall(r'\b\w+\b|[^\w\s]', text.lower())` line is our simple tokenizer. It uses regular expressions to find sequences of word characters (`\w+`) or any character that is not a word character or whitespace (`[^\w\s]`). This helps us capture both words and punctuation marks as separate tokens.
- `stoi` (string to integer) maps each unique token (word/punctuation) to a unique integer.
- `itos` (integer to string) does the reverse, mapping integers back to their original tokens.

This process is fundamental because it allows the model to learn patterns and relationships between these numerical representations of words.

### 3.3 Create Training and Validation Splits

We'll split our dataset into a training set and a validation set. The model learns from the training set, and we use the validation set to check how well it's generalizing.

In [5]:
# Encode the entire text dataset and store it into a torch.Tensor
data = torch.tensor(encode(text), dtype=torch.long)
print(f"Data shape: {data.shape}, Data type: {data.dtype}")
print(f"Sample encoded text: {data[:20].tolist()}")
print(f"Sample decoded text: {decode(data[:20].tolist())}")

# Split up the data into train and validation sets
n = int(0.5*len(data)) # first 50% will be train, rest val
train_data = data[:n]
val_data = data[n:]

print(f"Training data length: {len(train_data)} words")
print(f"Validation data length: {len(val_data)} words")

Data shape: torch.Size([238]), Data type: torch.int64
Sample encoded text: [108, 94, 0, 97, 7, 52, 64, 3, 36, 23, 4, 114, 1, 44, 0, 53, 42, 48, 1, 90]
Sample decoded text: to stale ' t a little more. first citizen: well, i ' ll hear it, sir
Training data length: 119 words
Validation data length: 119 words


### 3.4 Data Loader

We need a way to feed data to our model in batches. The `get_batch` function will randomly sample `batch_size` chunks of `block_size` length from the data.

In [6]:
#  Data loading function 
def get_batch(split):
    # Generate a small batch of data of inputs x and targets y
    data = train_data if split == 'train' else val_data
    ix = torch.randint(len(data) - block_size, (batch_size,))
    x = torch.stack([data[i:i+block_size] for i in ix])
    y = torch.stack([data[i+1:i+block_size+1] for i in ix])
    x, y = x.to(device), y.to(device)
    return x, y

# Example of a batch
xb, yb = get_batch('train')
print('inputs:')
print(xb.shape)
print('targets:')
print(yb.shape)

print('----')
print("Word-level training examples:")
for b in range(min(2, batch_size)):  # Show 2 examples
    for t in range(min(8, block_size)):  # Show first 8 positions
        context_words = decode(xb[b, :t+1].tolist())
        target_word = itos[yb[b,t].item()]
        print(f"Context: '{context_words}' -> Target: '{target_word}'")
        if t == 7:  # Only show first 8 examples
            break
    print()
    if b == 1:  # Only show for first 2 batch items
        break

inputs:
torch.Size([32, 10])
targets:
torch.Size([32, 10])
----
Word-level training examples:
Context: 'instruments' -> Target: 'did'
Context: 'instruments did' -> Target: 'see'
Context: 'instruments did see' -> Target: 'and'
Context: 'instruments did see and' -> Target: 'hear'
Context: 'instruments did see and hear' -> Target: ','
Context: 'instruments did see and hear,' -> Target: 'devise'
Context: 'instruments did see and hear, devise' -> Target: ','
Context: 'instruments did see and hear, devise,' -> Target: 'instruct'

Context: 'only' -> Target: 'like'
Context: 'only like' -> Target: 'a'
Context: 'only like a' -> Target: 'gulf'
Context: 'only like a gulf' -> Target: 'it'
Context: 'only like a gulf it' -> Target: 'did'
Context: 'only like a gulf it did' -> Target: 'remain'
Context: 'only like a gulf it did remain' -> Target: 'i'
Context: 'only like a gulf it did remain i' -> Target: '''



#### Explanation: `get_batch` - Why X and Y have the same length

In the `get_batch` function, both the input `x` and the target `y` are sequences of `block_size` length. This might seem counterintuitive at first, as we are trying to predict the *next* word. Let's clarify:

The task of a language model is to predict the next token in a sequence, given the preceding tokens.
- `x` represents the **input context**: a sequence of words the model sees.
- `y` represents the **target output**: for each position in `x`, `y` contains the word that *immediately follows* it in the original text.

Consider a `block_size` of 10.
If `x` is `[word_1, word_2, ..., word_10]`,
then `y` will be `[word_2, word_3, ..., word_11]`.

So, for each element `x[i]` in the input sequence (at a specific time step `t` within the block), the corresponding `y[i]` is the actual next word that the model should have predicted.

For example, when the model sees:
- `x[0]` (i.e., `word_1`), it tries to predict `y[0]` (i.e., `word_2`).
- `x[0:1]` (i.e., `word_1, word_2`), it tries to predict `y[1]` (i.e., `word_3`).
- ...
- `x[0:9]` (i.e., `word_1, ..., word_10`), it tries to predict `y[9]` (i.e., `word_11`).

The model makes predictions for *every position* in the `block_size`. The loss function then compares all these predictions with the actual next words in `y`. This way, a single block of data provides `block_size` individual training examples for the model to learn from. The "context" for predicting `y[t]` is `x[0...t]`.

## 4. Model Definition (Simplified nanoGPT Style)

Now we'll build a simplified version of the GPT model.

In [7]:
# Define the GPT-2 model configuration using transformers library
config = GPT2Config(
    vocab_size=vocab_size,
    n_positions=block_size,  # Max sequence length
    n_embd=n_embd,
    n_layer=n_layer,
    n_head=n_head,
    resid_pdrop=dropout,
    embd_pdrop=dropout,
    attn_pdrop=dropout,
    bos_token_id=None,  # No beginning of sentence token for word model
    eos_token_id=None   # No end of sentence token for word model
)

# Instantiate the model
model = GPT2LMHeadModel(config)
model.to(device)

# Print the number of parameters in the model
print(f"{sum(p.numel() for p in model.parameters())/1e6:.2f} M parameters")

0.81 M parameters


## 5. Training the Model

### 5.1 Optimizer
We'll use the AdamW optimizer, a common choice for training transformer models.

In [8]:
# Create a PyTorch optimizer
optimizer = torch.optim.AdamW(model.parameters(), lr=learning_rate)

### 5.2 Loss Estimation
We need a function to estimate the loss on the training and validation sets without calculating gradients, which is useful for monitoring training progress.

In [9]:
@torch.no_grad() # Decorator to disable gradient calculation
def estimate_loss():
    out = {}
    model.eval() # Set model to evaluation mode
    for split in ['train', 'val']:
        losses = torch.zeros(eval_iters)
        for k in range(eval_iters):
            X, Y = get_batch(split)
            outputs = model(X, labels=Y)
            loss = outputs.loss
            losses[k] = loss.item()
        out[split] = losses.mean()
    model.train() # Set model back to training mode
    return out

### 5.3 Training Loop
This is the main loop where the model learns. For each iteration, we:
1. Get a batch of data.
2. Perform a forward pass (get model predictions).
3. Calculate the loss (how wrong the predictions are).
4. Perform a backward pass (calculate gradients).
5. Update the model's parameters using the optimizer.

In [10]:
print(f"Training on {device}...")

for iter_num in range(max_iters):

    # Every once in a while, evaluate the loss on train and val sets
    if iter_num % eval_interval == 0 or iter_num == max_iters - 1:
        losses = estimate_loss()
        print(f"step {iter_num}: train loss {losses['train']:.4f}, val loss {losses['val']:.4f}")

    # Sample a batch of data
    xb, yb = get_batch('train')

    # Evaluate the loss
    outputs = model(xb, labels=yb)
    loss = outputs.loss
    
    # Backward pass and optimization step
    optimizer.zero_grad(set_to_none=True) # Zero out gradients from previous step
    loss.backward() # Compute gradients for the current batch
    optimizer.step() # Update model parameters

print("Training finished!")

`loss_type=None` was set in the config but it is unrecognised.Using the default loss: `ForCausalLMLoss`.


Training on cpu...
step 0: train loss 4.8432, val loss 4.8423
step 500: train loss 0.1071, val loss 6.8329
step 1000: train loss 0.0881, val loss 7.4115
step 1500: train loss 0.0837, val loss 7.8524
step 2000: train loss 0.0821, val loss 8.1622
step 2500: train loss 0.0795, val loss 8.4942
step 3000: train loss 0.0800, val loss 8.7935
step 3500: train loss 0.0789, val loss 8.9024
step 4000: train loss 0.0810, val loss 8.9835
step 4500: train loss 0.0770, val loss 9.2195
step 4999: train loss 0.0838, val loss 9.3415
Training finished!


#### Explanation: The Overfitting Issue

**Overfitting** is a common problem in machine learning. It occurs when a model learns the training data *too well*, including its noise and specific details, to the point where it performs poorly on new, unseen data (like our validation set or real-world data).

Imagine a student who memorizes the answers to all the questions in a specific textbook (the training data) but doesn't understand the underlying concepts. When given a new exam with slightly different questions (the validation or test data), the student would likely fail. This is analogous to an overfitted model.

In our case, since the training data (the first part of the text) is very different from the testing data (the second part), the LLM cannot learn to predict the testing data from training. 

**Signs of Overfitting:**
-   **Training loss continues to decrease, while validation loss starts to increase or plateaus.** This is a classic indicator. It means the model is getting better at fitting the training data but worse (or no better) at generalizing to unseen data.
-   The model might produce excellent results on examples it has seen during training but generate nonsensical or poor outputs for new inputs.

**(Optianal) How `dropout` helps (as a regularization technique):**
Dropout is a simple yet effective regularization technique to combat overfitting in neural networks. Here's how it works during training:
1.  At each training step, for every neuron in a layer where dropout is applied, it is "dropped out" (i.e., temporarily removed from the network) with a certain probability (`dropout` rate).
2.  This means the neuron, along with all its incoming and outgoing connections, is ignored during the forward pass and backward pass for that particular training iteration.
3.  The choice of which neurons to drop is random for each training iteration and each input example.

**Why does this help?**
-   **Prevents Co-adaptation:** Neurons become less reliant on specific other neurons because they can't be sure which ones will be active. This encourages each neuron to learn more robust features that are useful in conjunction with different random subsets of other neurons.
-   **Ensemble Effect (Implicit):** Training with dropout can be seen as training a large number of thinned networks (networks with different subsets of neurons). At test time (when dropout is turned off), using the full network can be viewed as an approximation of averaging the predictions of these many thinned networks.

In our configuration, we have `dropout = 0.0`. This means dropout is currently turned off. If you observe overfitting (e.g., validation loss increasing significantly while training loss is very low), you could try increasing the `dropout` rate (e.g., to 0.1 or 0.2) to see if it improves generalization. Note that dropout is typically applied during training only and is turned off during evaluation or inference (which `model.eval()` and `model.train()` handle for relevant layers in PyTorch).

## 6. Generating Text

Now that our model is trained, let's use it to generate some text. We'll start with a context (e.g., a newline word) and ask the model to predict the next words sequentially.

In [11]:
# Generation function - create a simple generation loop
print("Generating text...")
model.eval()

# Start with the first few words from our training data
start_words = block_size
text_test=" ".join(decode(train_data[:start_words].tolist()).split()[:start_words])
text_test="sir: yet you must not think to fob off our disgrace with a tale"
context = torch.tensor(encode(text_test), 
                      dtype=torch.long, device=device).unsqueeze(0)

print(f"Starting context: '{decode(context[0].tolist())}'")

# Simple generation loop
generated = context
with torch.no_grad():
    for _ in range(50):  # Generate 50 more words
        # Get predictions for the current context
        # Crop to last block_size tokens if context gets too long
        context_cropped = generated[:, -block_size:] if generated.size(1) > block_size else generated
        
        outputs = model(context_cropped)
        logits = outputs.logits
        
        # Get the logits for the last token
        logits = logits[:, -1, :]  # (batch_size, vocab_size)
        
        # Apply temperature for sampling
        temperature = 1
        logits = logits / temperature
        
        # Sample from the distribution
        probs = F.softmax(logits, dim=-1)
        next_token = torch.multinomial(probs, num_samples=1)
        
        # Append to generated sequence
        generated = torch.cat([generated, next_token], dim=1)

# Decode the generated text
generated_text = decode(generated[0].tolist())
print("--- Generated Text ---")
print(generated_text)
print("--- Original Text ---")
print('''First Citizen:
Well, I'll hear it, sir: yet you must not think to
fob off our disgrace with a tale: but, an 't please
you, deliver.

MENENIUS:
There was a time when all the body's members
Rebell'd against the belly, thus accused it:
That only like a gulf it did remain''')
print("----------------------")

Generating text...
Starting context: 'sir: yet you must not think to fob off our disgrace with a tale'
--- Generated Text ---
sir: yet you must not think to fob off our disgrace with a tale but an t you deliver menenius there a more first: was time all body s rebell d the, ' against belly thus it remain ' midst ' accused:, accused:, ' hear,,,,,,,,: was ' hear,
--- Original Text ---
First Citizen:
Well, I'll hear it, sir: yet you must not think to
fob off our disgrace with a tale: but, an 't please
you, deliver.

MENENIUS:
There was a time when all the body's members
Rebell'd against the belly, thus accused it:
That only like a gulf it did remain
----------------------


#### Explanation: Temperature and Sampling in Text Generation

When our model generates text, it doesn't just pick the single most probable next word. Instead, it often uses a technique called **sampling**, sometimes influenced by a parameter called **temperature**.

1.  **Logits and Probabilities:**
    After processing the input context, the model outputs **logits**. These are raw, unnormalized scores for each word in the vocabulary. To turn these logits into probabilities (i.e., values between 0 and 1 that sum up to 1), we apply a **softmax function**. The word with the highest probability is the one the model thinks is most likely to come next.

2.  **Sampling:**
    Instead of always picking the word with the absolute highest probability (which is called greedy decoding and can lead to repetitive or boring text), we can *sample* from the probability distribution. This means a word that is slightly less probable might still be chosen, introducing randomness and creativity into the generated text.
    `torch.multinomial(probs, num_samples=1)` is the function that performs this sampling. It takes the probability distribution (`probs`) and picks one word based on these probabilities.

3.  **Temperature:**
    Temperature is a hyperparameter that controls the randomness of the sampling. It's applied to the logits *before* the softmax function.
    -   **Low Temperature (e.g., < 1.0, closer to 0):** Dividing logits by a small number makes the differences between them larger. When softmax is applied, the probability distribution becomes "sharper" or "spikier." The model becomes more confident and deterministic, tending to pick the most likely words. This can lead to more focused and coherent text, but potentially less creative.
    -   **High Temperature (e.g., > 1.0):** Dividing logits by a larger number makes the differences between them smaller. The resulting softmax probability distribution becomes "flatter" or "softer." The model becomes less confident, and less likely words have a higher chance of being selected. This increases randomness and creativity, but can also lead to more errors or nonsensical text.
    -   **Temperature = 1.0:** This is the standard setting where the original probabilities are used for sampling.

In our generation code:
`logits = logits / temperature`
`probs = F.softmax(logits, dim=-1)`

By adjusting the `temperature` value, you can control the trade-off between coherence and creativity in the generated text. For this tutorial, we use a temperature of 1.0, which means we sample directly from the model's learned probability distribution.