# NanoGPT Implementation - Training on Pride and Prejudice

This notebook implements a small GPT model from scratch using PyTorch and trains it on Jane Austen's "Pride and Prejudice". Just run all cells in order - it will automatically:
1. Download Pride and Prejudice from Project Gutenberg
2. Train a small GPT model (takes about 5-10 minutes)
3. Generate Jane Austen-style text

No additional data or setup needed!

In [1]:
# Install required packages
!pip install torch requests



In [2]:
import torch
import torch.nn as nn
from torch.nn import functional as F
import requests
from typing import List, Dict, Tuple

# Check if GPU is available
device = 'cuda' if torch.cuda.is_available() else 'cpu'
print(f'Using device: {device}')

Using device: cuda


## Configuration - Optimized for Pride and Prejudice

In [3]:
class Config:
    # Data parameters
    batch_size = 32
    block_size = 128  # Increased for better context in novel text

    # Training parameters
    max_iters = 3000  # Increased for better quality
    eval_interval = 100
    learning_rate = 3e-4
    eval_iters = 100

    # Model parameters
    n_embd = 256     # Increased for richer representations
    n_head = 8       # More attention heads
    n_layer = 6      # More layers for complexity
    dropout = 0.2

    # Device configuration
    device = 'cuda' if torch.cuda.is_available() else 'cpu'

    # Dataset URL - Pride and Prejudice from Project Gutenberg
    dataset_url = 'https://www.gutenberg.org/files/1342/1342-0.txt'

config = Config()

## Utility Functions

In [4]:
def download_dataset(url: str) -> str:
    """Download dataset from URL and return text content"""
    response = requests.get(url)
    text = response.text

    # Find the start and end of the actual book content
    start = text.find("PRIDE AND PREJUDICE")
    end = text.find("End of the Project Gutenberg")

    # Extract just the book content
    return text[start:end]

def create_vocab(text: str) -> Tuple[Dict[str, int], Dict[int, str]]:
    """Create vocabulary mappings from text"""
    chars = sorted(list(set(text)))
    stoi = {ch: i for i, ch in enumerate(chars)}
    itos = {i: ch for i, ch in enumerate(chars)}
    return stoi, itos

def encode(s: str, stoi: Dict[str, int]) -> List[int]:
    """Encode string to list of integers"""
    return [stoi[c] for c in s]

def decode(l: List[int], itos: Dict[int, str]) -> str:
    """Decode list of integers to string"""
    return ''.join([itos[i] for i in l])

@torch.no_grad()
def estimate_loss(model, get_batch, config):
    """Estimate loss on train and validation sets"""
    out = {}
    model.eval()
    for split in ['train', 'val']:
        losses = torch.zeros(config.eval_iters)
        for k in range(config.eval_iters):
            X, Y = get_batch(split)
            logits, loss = model(X, Y)
            losses[k] = loss.item()
        out[split] = losses.mean()
    model.train()
    return out

## Data Module

In [5]:
class DataModule:
    def __init__(self, config):
        self.config = config
        self.train_data = None
        self.val_data = None
        self.stoi = None
        self.itos = None
        self.vocab_size = None

    def prepare_data(self):
        """Download and prepare the dataset"""
        print('Downloading Pride and Prejudice...')
        text = download_dataset(self.config.dataset_url)
        print(f'Dataset size: {len(text)} characters')

        # Show a sample of the text
        print('\nSample of the text:')
        print(text[:500], '...\n')

        # Create vocabulary
        self.stoi, self.itos = create_vocab(text)
        self.vocab_size = len(self.stoi)
        print(f'Vocabulary size: {self.vocab_size} unique characters')

        # Encode full text
        data = torch.tensor(encode(text, self.stoi), dtype=torch.long)

        # Split into train and validation
        n = int(0.9 * len(data))
        self.train_data = data[:n]
        self.val_data = data[n:]
        print(f'Split sizes: {len(self.train_data)} train, {len(self.val_data)} validation')

    def get_batch(self, split: str) -> Tuple[torch.Tensor, torch.Tensor]:
        """Generate a small batch of data of inputs x and targets y"""
        data = self.train_data if split == 'train' else self.val_data
        ix = torch.randint(len(data) - self.config.block_size, (self.config.batch_size,))
        x = torch.stack([data[i:i+self.config.block_size] for i in ix])
        y = torch.stack([data[i+1:i+self.config.block_size+1] for i in ix])
        x, y = x.to(self.config.device), y.to(self.config.device)
        return x, y

## Model Architecture

In [6]:
class Head(nn.Module):
    """One head of self-attention"""
    def __init__(self, head_size: int, n_embd: int, block_size: int, dropout: float):
        super().__init__()
        self.key = nn.Linear(n_embd, head_size, bias=False)
        self.query = nn.Linear(n_embd, head_size, bias=False)
        self.value = nn.Linear(n_embd, head_size, bias=False)
        self.register_buffer('tril', torch.tril(torch.ones(block_size, block_size)))
        self.dropout = nn.Dropout(dropout)

    def forward(self, x):
        B, T, C = x.shape
        k = self.key(x)
        q = self.query(x)
        # Compute attention scores
        wei = q @ k.transpose(-2, -1) * C**-0.5
        wei = wei.masked_fill(self.tril[:T, :T] == 0, float('-inf'))
        wei = F.softmax(wei, dim=-1)
        wei = self.dropout(wei)
        # Weighted aggregation of values
        v = self.value(x)
        out = wei @ v
        return out

class MultiHeadAttention(nn.Module):
    """Multiple heads of self-attention in parallel"""
    def __init__(self, config, head_size):
        super().__init__()
        self.heads = nn.ModuleList([
            Head(head_size, config.n_embd, config.block_size, config.dropout)
            for _ in range(config.n_head)
        ])
        self.proj = nn.Linear(config.n_embd, config.n_embd)
        self.dropout = nn.Dropout(config.dropout)

    def forward(self, x):
        out = torch.cat([h(x) for h in self.heads], dim=-1)
        out = self.dropout(self.proj(out))
        return out

class FeedForward(nn.Module):
    """Simple linear layer followed by non-linearity"""
    def __init__(self, n_embd: int, dropout: float):
        super().__init__()
        self.net = nn.Sequential(
            nn.Linear(n_embd, 4 * n_embd),
            nn.ReLU(),
            nn.Linear(4 * n_embd, n_embd),
            nn.Dropout(dropout),
        )

    def forward(self, x):
        return self.net(x)

class Block(nn.Module):
    """Transformer block: communication followed by computation"""
    def __init__(self, config):
        super().__init__()
        head_size = config.n_embd // config.n_head
        self.sa = MultiHeadAttention(config, head_size)
        self.ffwd = FeedForward(config.n_embd, config.dropout)
        self.ln1 = nn.LayerNorm(config.n_embd)
        self.ln2 = nn.LayerNorm(config.n_embd)

    def forward(self, x):
        x = x + self.sa(self.ln1(x))
        x = x + self.ffwd(self.ln2(x))
        return x

class GPT(nn.Module):
    """The main GPT language model"""
    def __init__(self, config, vocab_size):
        super().__init__()
        self.config = config

        self.token_embedding_table = nn.Embedding(vocab_size, config.n_embd)
        self.position_embedding_table = nn.Embedding(config.block_size, config.n_embd)
        self.blocks = nn.Sequential(*[Block(config) for _ in range(config.n_layer)])
        self.ln_f = nn.LayerNorm(config.n_embd)
        self.lm_head = nn.Linear(config.n_embd, vocab_size)

    def forward(self, idx, targets=None):
        B, T = idx.shape

        # Get token and position embeddings
        tok_emb = self.token_embedding_table(idx)
        pos_emb = self.position_embedding_table(torch.arange(T, device=self.config.device))
        x = tok_emb + pos_emb

        # Apply transformer blocks and final layer norm
        x = self.blocks(x)
        x = self.ln_f(x)
        logits = self.lm_head(x)

        # Compute loss if targets provided
        if targets is None:
            loss = None
        else:
            B, T, C = logits.shape
            logits = logits.view(B*T, C)
            targets = targets.view(B*T)
            loss = F.cross_entropy(logits, targets)

        return logits, loss

    def generate(self, idx, max_new_tokens):
        """Generate new tokens from the model"""
        for _ in range(max_new_tokens):
            # Crop context to block_size
            idx_cond = idx[:, -self.config.block_size:]
            # Get predictions
            logits, _ = self(idx_cond)
            # Focus on last time step
            logits = logits[:, -1, :]
            # Apply softmax to get probabilities
            probs = F.softmax(logits, dim=-1)
            # Sample from distribution
            idx_next = torch.multinomial(probs, num_samples=1)
            # Append sampled index to running sequence
            idx = torch.cat((idx, idx_next), dim=1)
        return idx

## Training

This will take about 10-15 minutes. You'll see the training progress every 100 iterations.

In [7]:
print('Preparing data...')
data_module = DataModule(config)
data_module.prepare_data()

print('\nInitializing model...')
model = GPT(config, data_module.vocab_size)
model = model.to(config.device)
print(f'Number of parameters: {sum(p.numel() for p in model.parameters())/1e6:.2f}M')

# Create optimizer
optimizer = torch.optim.AdamW(model.parameters(), lr=config.learning_rate)

print('\nStarting training...')
for iter in range(config.max_iters):
    # Evaluate loss periodically
    if iter % config.eval_interval == 0:
        losses = estimate_loss(model, data_module.get_batch, config)
        print(f"step {iter}: train loss {losses['train']:.4f}, val loss {losses['val']:.4f}")

    # Sample a batch of data
    xb, yb = data_module.get_batch('train')

    # Evaluate loss
    logits, loss = model(xb, yb)
    optimizer.zero_grad(set_to_none=True)
    loss.backward()
    optimizer.step()

print('Training complete!')

Preparing data...
Downloading Pride and Prejudice...
Dataset size: 708409 characters

Sample of the text:
PRIDE AND PREJUDICE·




Chapter I.]


It is a truth universally acknowledged, that a single man in possession
of a good fortune must be in want of a wife.

However little known the feelings or views of such a man may be on his
first entering a neighbourhood, this truth is so well fixed in the minds
of the surrounding families, that he is considered as the rightful
property of some one or other of their daughters.

“My dear Mr. Bennet,” said his lady to him one day, “have you hea ...

Vocabulary size: 89 unique characters
Split sizes: 637568 train, 70841 validation

Initializing model...
Number of parameters: 4.81M

Starting training...
step 0: train loss 4.5917, val loss 4.5880
step 100: train loss 2.4525, val loss 2.4686
step 200: train loss 2.3902, val loss 2.4081
step 300: train loss 2.3446, val loss 2.3692
step 400: train loss 2.2723, val loss 2.2874
step 500: train loss 2.103

## Generate Jane Austen-style Text

Let's see what our trained model can write in the style of Pride and Prejudice!

In [8]:
print('Generating text...\n')
# Start with "It is a truth universally acknowledged" as context
start_text = "It is a truth universally acknowledged"
context = torch.tensor([encode(start_text, data_module.stoi)], dtype=torch.long, device=config.device)

generated_text = decode(
    model.generate(context, max_new_tokens=500)[0].tolist(),
    data_module.itos
)
print(generated_text)

Generating text...

It is a truth universally acknowledged him but
       in his differens would not here; almost it, and, Mr. Darcy characted
her. HThe is verity were father, was less another expect to he lett in
leaving rapets. She purson such and slike of the countrue! I or part for
ever, Miss Darcy warmly to be continued her siters was childned on live
it; and “by sink, you be know hasten rese idensify.”

“The direct well,” said she she same inly my on morning, unly Georsible;
to have conbuded but paces.” Colonel Fitzwitzwish it was nortain
