## End-to-End Transformer Language Model Implementation

### Introduction: From Tokens to Transformers for Text Generation

**Recap: The Role of Tokenization**

As explored previously with techniques like Byte Pair Encoding (BPE), the first step in processing text for machine learning models is **tokenization**. This breaks raw text into manageable units (tokens). For simplicity in this notebook, we will perform basic **character-level tokenization**, where each unique character in our text becomes a distinct token. These tokens are then mapped to unique numerical IDs. Our main focus here is on building the model that learns from sequences of these IDs: a Language Model.

**Goal: Building a Generative Transformer**

Our objective is to construct a basic Language Model using the **Transformer architecture**. Specifically, we will build a **Decoder-only Transformer**, similar in style to models like GPT. This model learns to predict the next token (character, in our case) in a sequence, given the preceding tokens. By repeatedly predicting the next token and feeding it back into the input, the model can generate new sequences of text, character by character.

**The Transformer Architecture: Key Concepts**

The Transformer, introduced in "Attention Is All You Need" (Vaswani et al., 2017), uses **attention mechanisms** to weigh the importance of different input tokens when processing a particular token, eliminating the need for recurrence used in RNNs/LSTMs. Key components we will implement inline:

1.  **Input Embeddings:** Converting numerical token (character) IDs into dense vector representations.
2.  **Positional Encoding:** Adding information about the position of tokens in the sequence.
3.  **Multi-Head Self-Attention (Masked):** Allowing the model to attend to previous tokens in the sequence to predict the next one. The mask prevents attending to future tokens.
4.  **Add & Norm Layers:** Residual connections followed by Layer Normalization for stable training.
5.  **Position-wise Feed-Forward Networks:** Applying a non-linear transformation independently to each token representation.
6.  **Decoder Blocks:** Stacking these components multiple times.
7.  **Final Linear Layer & Softmax:** Mapping final representations back to vocabulary scores (logits) and then probabilities.

**This Notebook's Approach:**

We will implement this architecture step-by-step, inline, without defining functions or classes. Each conceptual part will be broken into minimal code blocks with extremely detailed explanations and mathematical formulations (using LaTeX). We will use a small text dataset and model configuration for transparency.

### Step 0: Setup - Libraries, Corpus, Tokenization, Hyperparameters

**Goal:** Prepare the environment by importing PyTorch, defining the text corpus, performing character-level tokenization, and setting the model's configuration (hyperparameters).

#### Step 0.1: Import Libraries

**Explanation:** We need `torch` for tensor operations, neural network components (`nn`), activation functions (`F`), optimizers (`optim`), and `math` for calculations like square root in attention scaling.

In [None]:
# Import necessary libraries
import torch
import torch.nn as nn
from torch.nn import functional as F
import torch.optim as optim
import math
import os

# For reproducibility (optional, but good practice)
torch.manual_seed(1337)

print(f"PyTorch version: {torch.__version__}")
print("Libraries imported.")

PyTorch version: 2.6.0+cu118
Libraries imported.


#### Step 0.2: Define the Training Corpus

**Explanation:** We'll use the same excerpt from "Alice's Adventures in Wonderland" as our training data. This provides a small but somewhat realistic text source.

In [2]:
# Define the raw text corpus for training
corpus_raw = """
Alice was beginning to get very tired of sitting by her sister on the
bank, and of having nothing to do: once or twice she had peeped into the
book her sister was reading, but it had no pictures or conversations in
it, 'and what is the use of a book,' thought Alice 'without pictures or
conversation?'
So she was considering in her own mind (as well as she could, for the
hot day made her feel very sleepy and stupid), whether the pleasure
of making a daisy-chain would be worth the trouble of getting up and
picking the daisies, when suddenly a White Rabbit with pink eyes ran
close by her.
"""

print(f"Training corpus defined (length: {len(corpus_raw)} characters).")

Training corpus defined (length: 593 characters).


#### Step 0.3: Character-Level Tokenization

**Explanation:** We create a vocabulary consisting of all unique characters present in the corpus. We then create mappings: one from each character to a unique integer ID (`char_to_int`) and its inverse (`int_to_char`). The size of this unique character set determines our `vocab_size`.

In [3]:
# Find all unique characters in the raw corpus
chars = sorted(list(set(corpus_raw)))
vocab_size = len(chars)

# Create character-to-integer mapping (encoding)
char_to_int = { ch:i for i,ch in enumerate(chars) }

# Create integer-to-character mapping (decoding)
int_to_char = { i:ch for i,ch in enumerate(chars) }

print(f"Created character vocabulary of size: {vocab_size}")
print(f"Vocabulary: {''.join(chars)}")
# print(f"Char-to-Int mapping: {char_to_int}") # Optional

Created character vocabulary of size: 36
Vocabulary: 
 '(),-.:?ARSWabcdefghiklmnoprstuvwy


#### Step 0.4: Encode the Corpus

**Explanation:** Convert the entire raw text corpus into a sequence of integer token IDs using the `char_to_int` mapping. This numerical sequence is the actual input the model will process.

In [4]:
# Encode the entire corpus into a list of integer IDs
encoded_corpus = [char_to_int[ch] for ch in corpus_raw]

# Convert the list into a PyTorch tensor
full_data_sequence = torch.tensor(encoded_corpus, dtype=torch.long)

print(f"Encoded corpus into a tensor of shape: {full_data_sequence.shape}")
# print(f"First 100 encoded token IDs: {full_data_sequence[:100].tolist()}") # Optional

Encoded corpus into a tensor of shape: torch.Size([593])


#### Step 0.5: Define Hyperparameters

**Explanation:** Set the configuration for our model and training. We use the `vocab_size` determined from the corpus. Other values are kept small for demonstration.
*   `d_model`: Dimension of embeddings and internal representations.
*   `n_heads`: Number of parallel attention calculations.
*   `n_layers`: Number of decoder blocks.
*   `d_ff`: Hidden dimension in the feed-forward networks.
*   `block_size`: Maximum sequence length processed at once.
*   `learning_rate`, `batch_size`, `epochs`: Training control parameters.
*   `device`: Use GPU ('cuda') if available, otherwise CPU.

In [5]:
# Define Model Hyperparameters (using calculated vocab_size)
# vocab_size = vocab_size # Already defined from data
d_model = 64         # Embedding dimension (increased slightly for characters)
n_heads = 4          # Number of attention heads
n_layers = 3         # Number of Transformer blocks
d_ff = d_model * 4   # Dimension of the feed-forward inner layer
block_size = 32      # Maximum context length (sequence length)
# dropout_rate = 0.1 # Omitting dropout layers for inline simplicity

# Define Training Hyperparameters
learning_rate = 3e-4 # Slightly smaller LR often better for AdamW
batch_size = 16      # Process 16 sequences per step
epochs = 5000        # Increase epochs for character-level model to see learning
eval_interval = 500 # How often to print loss

# Device Configuration
device = 'cuda' if torch.cuda.is_available() else 'cpu'

# Ensure d_model is divisible by n_heads
assert d_model % n_heads == 0, "d_model must be divisible by n_heads"
d_k = d_model // n_heads # Dimension of keys/queries/values per head

print(f"Hyperparameters defined:")
print(f"  vocab_size: {vocab_size}")
print(f"  d_model: {d_model}")
print(f"  n_heads: {n_heads}")
print(f"  d_k (dim per head): {d_k}")
print(f"  n_layers: {n_layers}")
print(f"  d_ff: {d_ff}")
print(f"  block_size: {block_size}")
print(f"  learning_rate: {learning_rate}")
print(f"  batch_size: {batch_size}")
print(f"  epochs: {epochs}")
print(f"  Using device: {device}")

Hyperparameters defined:
  vocab_size: 36
  d_model: 64
  n_heads: 4
  d_k (dim per head): 16
  n_layers: 3
  d_ff: 256
  block_size: 32
  learning_rate: 0.0003
  batch_size: 16
  epochs: 5000
  Using device: cuda


### Step 1: Data Preparation for Training

**Goal:** Structure the encoded data (`full_data_sequence`) into input (`x`) and target (`y`) pairs suitable for training the next-token prediction task.

#### Step 1.1: Create Input (x) and Target (y) Pairs

**Explanation:** The model needs to learn `P(token_i | token_0, ..., token_{i-1})`. We achieve this by creating sequences of length `block_size`. For each input sequence `x` taken from `data[i : i+block_size]`, the corresponding target sequence `y` is the next token for each position in `x`, i.e., `data[i+1 : i+block_size+1]`. We extract all possible overlapping sequences from our encoded corpus.

In [6]:
# Create lists to hold all possible input (x) and target (y) sequences of length block_size
all_x = []
all_y = []

# Iterate through the encoded corpus tensor to extract overlapping sequences
# We need to stop early enough so that we can always get a target sequence of the same length
num_total_tokens = len(full_data_sequence)
for i in range(num_total_tokens - block_size):
    # Extract the input sequence chunk of length block_size
    x_chunk = full_data_sequence[i : i + block_size]
    # Extract the target sequence chunk (shifted one position to the right)
    y_chunk = full_data_sequence[i + 1 : i + block_size + 1]
    
    # Append the chunks to our lists
    all_x.append(x_chunk)
    all_y.append(y_chunk)

# Stack the lists of tensors into single large tensors
# train_x will have shape (num_sequences, block_size)
# train_y will have shape (num_sequences, block_size)
train_x = torch.stack(all_x)
train_y = torch.stack(all_y)

num_sequences_available = train_x.shape[0]
print(f"Created {num_sequences_available} overlapping input/target sequence pairs.")
print(f"Shape of train_x: {train_x.shape}")
print(f"Shape of train_y: {train_y.shape}")

# Optional: Display a sample input/target pair and decode it
# sample_idx = 0
# sample_x_ids = train_x[sample_idx].tolist()
# sample_y_ids = train_y[sample_idx].tolist()
# sample_x_chars = ''.join([int_to_char[id] for id in sample_x_ids])
# sample_y_chars = ''.join([int_to_char[id] for id in sample_y_ids])
# print(f"\nSample Input x[{sample_idx}] IDs:  {sample_x_ids}")
# print(f"Sample Target y[{sample_idx}] IDs: {sample_y_ids}")
# print(f"Sample Input x[{sample_idx}] Chars: '{sample_x_chars}'")
# print(f"Sample Target y[{sample_idx}] Chars:'{sample_y_chars}'")

Created 561 overlapping input/target sequence pairs.
Shape of train_x: torch.Size([561, 32])
Shape of train_y: torch.Size([561, 32])


#### Step 1.2: Batching Strategy (Simplified: Random Sampling)

**Explanation:** Instead of implementing a complex data loader, for each training step we will simply select `batch_size` random indices from our available sequences (`0` to `num_sequences_available - 1`). We then use these indices to grab the corresponding input (`xb`) and target (`yb`) sequences from `train_x` and `train_y`. This simulates drawing random batches from the dataset.

In [7]:
# Check if we have enough sequences for the desired batch size
if num_sequences_available < batch_size:
    print(f"Warning: Number of sequences ({num_sequences_available}) is less than batch size ({batch_size}). Adjusting batch size.")
    batch_size = num_sequences_available

print(f"Data ready for training. Will sample batches of size {batch_size} randomly.")

Data ready for training. Will sample batches of size 16 randomly.


### Step 2: Model Component Initialization

**Goal:** Initialize the learnable parameters for all layers of our Transformer model. Each layer is created as an instance of a `torch.nn` module and moved to the target `device`.

#### Step 2.1: Token Embedding Layer

**Explanation:** Maps integer token IDs (character IDs in our case) to dense vectors. Input `(B, T)` -> Output `(B, T, C)` where `B`=batch, `T`=time/sequence length, `C`=`d_model`.

In [8]:
# Initialize the token embedding table (lookup table)
token_embedding_table = nn.Embedding(vocab_size, d_model).to(device)

print(f"Initialized Token Embedding Layer (Vocab: {vocab_size}, Dim: {d_model}). Device: {device}")

Initialized Token Embedding Layer (Vocab: 36, Dim: 64). Device: cuda


#### Step 2.2: Positional Encoding Matrix

**Explanation:** Creates fixed (non-learned) vectors that encode position information using sine and cosine functions of varying frequencies. This matrix `(1, block_size, d_model)` will be added to the token embeddings.
Formulas:
$$ PE_{(pos, 2i)} = \sin\left(\frac{pos}{10000^{2i / d_{\text{model}}}}\right) $$ 
$$ PE_{(pos, 2i+1)} = \cos\left(\frac{pos}{10000^{2i / d_{\text{model}}}}\right) $$

In [9]:
# Precompute the Sinusoidal Positional Encoding matrix
print("Step 2.2: Creating Positional Encoding matrix...")

# Matrix to store encodings: Shape (block_size, d_model)
positional_encoding = torch.zeros(block_size, d_model, device=device)

# Position indices (0 to block_size-1): Shape (block_size, 1)
position = torch.arange(0, block_size, dtype=torch.float, device=device).unsqueeze(1)

# Dimension indices (0, 2, 4, ...): Shape (d_model/2)
div_term_indices = torch.arange(0, d_model, 2, dtype=torch.float, device=device)
# Denominator term: 1 / (10000^(2i / d_model))
div_term = torch.exp(div_term_indices * (-math.log(10000.0) / d_model))

# Calculate sine for even dimensions
positional_encoding[:, 0::2] = torch.sin(position * div_term)

# Calculate cosine for odd dimensions
positional_encoding[:, 1::2] = torch.cos(position * div_term)

# Add batch dimension: Shape (1, block_size, d_model)
positional_encoding = positional_encoding.unsqueeze(0)

print(f"  Positional Encoding matrix created with shape: {positional_encoding.shape}. Device: {device}")

Step 2.2: Creating Positional Encoding matrix...
  Positional Encoding matrix created with shape: torch.Size([1, 32, 64]). Device: cuda


#### Step 2.3: Transformer Block Components Initialization

**Explanation:** Initialize all components needed for the `n_layers` decoder blocks. We store them in Python lists, where the index corresponds to the layer number (0 to `n_layers-1`).

In [10]:
print(f"Step 2.3: Initializing components for {n_layers} Transformer layers...")

# Lists to store layers for each Transformer block
layer_norms_1 = []      # LayerNorm after MHA
layer_norms_2 = []      # LayerNorm after FFN
mha_qkv_linears = []    # Combined Linear layer for Q, K, V projections
mha_output_linears = [] # Output Linear layer for MHA
ffn_linear_1 = []       # First linear layer in FFN
ffn_linear_2 = []       # Second linear layer in FFN

# Loop through the number of layers
for i in range(n_layers):
    # Layer Normalization 1 (for post-MHA residual)
    ln1 = nn.LayerNorm(d_model).to(device)
    layer_norms_1.append(ln1)

    # Multi-Head Attention: Combined QKV projection layer
    qkv_linear = nn.Linear(d_model, 3 * d_model, bias=False).to(device) # Often bias=False here
    mha_qkv_linears.append(qkv_linear)

    # Multi-Head Attention: Output projection layer
    output_linear = nn.Linear(d_model, d_model).to(device)
    mha_output_linears.append(output_linear)

    # Layer Normalization 2 (for post-FFN residual)
    ln2 = nn.LayerNorm(d_model).to(device)
    layer_norms_2.append(ln2)
    
    # Position-wise Feed-Forward Network: First linear layer
    lin1 = nn.Linear(d_model, d_ff).to(device)
    ffn_linear_1.append(lin1)
    
    # Position-wise Feed-Forward Network: Second linear layer
    lin2 = nn.Linear(d_ff, d_model).to(device)
    ffn_linear_2.append(lin2)
    
    print(f"  Initialized components for Layer {i+1}/{n_layers}.")

print(f"Finished initializing components for {n_layers} layers.")

Step 2.3: Initializing components for 3 Transformer layers...
  Initialized components for Layer 1/3.
  Initialized components for Layer 2/3.
  Initialized components for Layer 3/3.
Finished initializing components for 3 layers.


#### Step 2.4: Final Layers Initialization

**Explanation:** Initialize the final Layer Normalization applied after the last block and the final Linear layer that maps the Transformer's output back to vocabulary logits.

In [11]:
print("Step 2.4: Initializing final LayerNorm and Output layers...")

# Final Layer Normalization
final_layer_norm = nn.LayerNorm(d_model).to(device)
print(f"  Initialized Final LayerNorm. Device: {device}")

# Final Linear Layer (language modeling head)
output_linear_layer = nn.Linear(d_model, vocab_size).to(device)
print(f"  Initialized Output Linear Layer (to vocab size {vocab_size}). Device: {device}")

Step 2.4: Initializing final LayerNorm and Output layers...
  Initialized Final LayerNorm. Device: cuda
  Initialized Output Linear Layer (to vocab size 36). Device: cuda


### Step 3: Defining the Forward Pass (Inline - Conceptual Blocks)

**Goal:** Detail the sequence of operations constituting the Transformer's forward pass. These conceptual blocks will be executed directly within the training and generation loops.

#### Step 3.1: Input Embedding + Positional Encoding

**Explanation:** Convert input token IDs `(B, T)` to embeddings `(B, T, C)` and add positional information. Output `x` has shape `(B, T, C)`.

In [12]:
print("Conceptual Step 3.1 defined (Embedding + Positional Encoding). Executed in loops.")

Conceptual Step 3.1 defined (Embedding + Positional Encoding). Executed in loops.


#### Step 3.2: Transformer Blocks Loop (Conceptual)

**Explanation:** Outline the operations inside the loop that iterates `n_layers` times.

##### Step 3.2.1: Masked Multi-Head Self-Attention (Conceptual)

**Explanation:** Allows each token to attend to previous tokens (including itself). 
1. Project input `x` `(B, T, C)` to Q, K, V `(B, n_heads, T, d_k)`.
2. Calculate scaled dot-product attention scores: $$\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}} + M\right)V$$ where $M$ is the causal mask (setting upper triangle to $-\infty$).
3. Concatenate heads and project back to `(B, T, C)`.

In [13]:
print("Conceptual Step 3.2.1 defined (Multi-Head Attention). Executed in layer loop.")

Conceptual Step 3.2.1 defined (Multi-Head Attention). Executed in layer loop.


##### Step 3.2.2: Add & Norm 1 (Post-Attention) (Conceptual)

**Explanation:** Residual connection (`x + AttentionOutput`) followed by Layer Normalization. $$ \text{LayerNorm}(x) = \gamma \frac{x - \mu}{\sqrt{\sigma^2 + \epsilon}} + \beta $$ where $\mu, \sigma^2$ are mean/variance over the feature dimension $C$, and $\gamma, \beta$ are learnable scale/shift.

In [14]:
print("Conceptual Step 3.2.2 defined (Add & Norm 1). Executed in layer loop.")

Conceptual Step 3.2.2 defined (Add & Norm 1). Executed in layer loop.


##### Step 3.2.3: Position-wise Feed-Forward Network (FFN) (Conceptual)

**Explanation:** Two linear transformations with a non-linear activation (ReLU) in between, applied independently at each position `t`. $$ \text{FFN}(x) = \text{Linear}_2(\text{ReLU}(\text{Linear}_1(x))) $$

In [15]:
print("Conceptual Step 3.2.3 defined (Feed-Forward Network). Executed in layer loop.")

Conceptual Step 3.2.3 defined (Feed-Forward Network). Executed in layer loop.


##### Step 3.2.4: Add & Norm 2 (Post-FFN) (Conceptual)

**Explanation:** Second residual connection (`Norm1Output + FFNOutput`) followed by Layer Normalization. The output of this becomes the input `x` for the next layer.

In [16]:
print("Conceptual Step 3.2.4 defined (Add & Norm 2). Executed in layer loop.")

Conceptual Step 3.2.4 defined (Add & Norm 2). Executed in layer loop.


#### Step 3.3: Final Layers (Conceptual)

**Explanation:** Apply final LayerNorm to the output of the last block, then project to vocabulary logits `(B, T, V)`.

In [17]:
print("Conceptual Step 3.3 defined (Final Layers). Executed after layer loop.")

Conceptual Step 3.3 defined (Final Layers). Executed after layer loop.


### Step 4: Training the Model (Inline Loop)

**Goal:** Iteratively adjust model parameters to minimize the prediction error (loss).

#### Step 4.1: Define Loss Function

**Explanation:** Use Cross-Entropy Loss, suitable for multi-class classification (predicting the next character ID). It requires logits `(N, V)` and targets `(N)`. We'll reshape `(B, T, V)` -> `(B*T, V)` and `(B, T)` -> `(B*T)`.

In [18]:
# Define the loss function
criterion = nn.CrossEntropyLoss()

print(f"Step 4.1: Loss function defined: {type(criterion).__name__}")

Step 4.1: Loss function defined: CrossEntropyLoss


#### Step 4.2: Define Optimizer

**Explanation:** Use AdamW optimizer. Gather *all* learnable parameters from the initialized layers into a single list for the optimizer to manage.

In [19]:
# Gather all model parameters requiring gradients
all_model_parameters = list(token_embedding_table.parameters())
for i in range(n_layers):
    all_model_parameters.extend(list(layer_norms_1[i].parameters()))
    all_model_parameters.extend(list(mha_qkv_linears[i].parameters()))
    all_model_parameters.extend(list(mha_output_linears[i].parameters()))
    all_model_parameters.extend(list(layer_norms_2[i].parameters()))
    all_model_parameters.extend(list(ffn_linear_1[i].parameters()))
    all_model_parameters.extend(list(ffn_linear_2[i].parameters()))
all_model_parameters.extend(list(final_layer_norm.parameters()))
all_model_parameters.extend(list(output_linear_layer.parameters()))

# Define the AdamW optimizer
optimizer = optim.AdamW(all_model_parameters, lr=learning_rate)

print(f"Step 4.2: Optimizer defined: {type(optimizer).__name__}")
print(f"  Managing {len(all_model_parameters)} parameter groups/tensors.")

# Create the lower triangular mask for self-attention ONCE, outside the loop
# Shape: (1, 1, block_size, block_size)
causal_mask = torch.tril(torch.ones(block_size, block_size, device=device)).view(1, 1, block_size, block_size)

Step 4.2: Optimizer defined: AdamW
  Managing 38 parameter groups/tensors.


#### Step 4.3: The Training Loop

**Explanation:** Iterate for `epochs`. In each step: select batch, perform forward pass (executing conceptual steps 3.1-3.3), calculate loss, zero gradients, backpropagate, update weights.

In [20]:
print(f"\nStep 4.3: Starting Training Loop for {epochs} epochs...")

# List to track losses
losses = []

# Set layers to training mode (e.g., for potential dropout, though omitted here)
# This doesn't really do anything without dropout/batchnorm, but good practice
for i in range(n_layers):
    layer_norms_1[i].train()
    mha_qkv_linears[i].train()
    mha_output_linears[i].train()
    layer_norms_2[i].train()
    ffn_linear_1[i].train()
    ffn_linear_2[i].train()
final_layer_norm.train()
output_linear_layer.train()
token_embedding_table.train()

# Training loop
for epoch in range(epochs):
    
    # --- 1. Batch Selection --- 
    indices = torch.randint(0, num_sequences_available, (batch_size,))
    xb = train_x[indices].to(device) # Input batch shape: (B, T)
    yb = train_y[indices].to(device) # Target batch shape: (B, T)
    
    # --- 2. Forward Pass (Inline execution) --- 
    B, T = xb.shape # B = batch_size, T = block_size
    C = d_model     # Embedding dimension
    
    # Step 3.1: Embedding + Positional Encoding
    token_embed = token_embedding_table(xb) # (B, T, C)
    pos_enc_slice = positional_encoding[:, :T, :] # (1, T, C)
    x = token_embed + pos_enc_slice # (B, T, C)
    
    # Step 3.2: Transformer Blocks
    for i in range(n_layers):
        # Input to this block
        x_input_block = x 
        
        # --- MHA --- 
        # Apply LayerNorm *before* MHA (Pre-LN variant - common)
        x_ln1 = layer_norms_1[i](x_input_block)
        # QKV projection
        qkv = mha_qkv_linears[i](x_ln1) # (B, T, 3*C)
        # Split heads
        qkv = qkv.view(B, T, n_heads, 3 * d_k).permute(0, 2, 1, 3) # (B, n_heads, T, 3*d_k)
        q, k, v = qkv.chunk(3, dim=-1) # (B, n_heads, T, d_k)
        # Scaled dot-product attention
        attn_scores = (q @ k.transpose(-2, -1)) * (d_k ** -0.5) # (B, n_heads, T, T)
        # Apply Causal Mask (use the pre-computed mask sliced to T)
        attn_scores_masked = attn_scores.masked_fill(causal_mask[:,:,:T,:T] == 0, float('-inf'))
        attention_weights = F.softmax(attn_scores_masked, dim=-1) # (B, n_heads, T, T)
        # Attention output
        attn_output = attention_weights @ v # (B, n_heads, T, d_k)
        # Concatenate heads
        attn_output = attn_output.permute(0, 2, 1, 3).contiguous().view(B, T, C) # (B, T, C)
        # Output projection
        mha_result = mha_output_linears[i](attn_output) # (B, T, C)
        # Add & Norm 1 (Residual connection adds output to original input)
        x = x_input_block + mha_result # Residual connection 1
        # Note: We moved LN1 to *before* MHA (Pre-LN)
        
        # --- FFN --- 
        # Input to FFN
        x_input_ffn = x 
        # Apply LayerNorm *before* FFN (Pre-LN variant)
        x_ln2 = layer_norms_2[i](x_input_ffn)
        # FFN layers
        ffn_hidden = ffn_linear_1[i](x_ln2) # (B, T, d_ff)
        ffn_activated = F.relu(ffn_hidden)
        ffn_output = ffn_linear_2[i](ffn_activated) # (B, T, C)
        # Add & Norm 2 (Residual connection adds output to FFN input)
        x = x_input_ffn + ffn_output # Residual connection 2
        # Note: We moved LN2 to *before* FFN (Pre-LN)
        # Output 'x' of this block becomes input 'x_input_block' for the next block
        
    # Step 3.3: Final Layers (After loop)
    # Apply final LayerNorm (Pre-LN style, applied before final projection)
    final_norm_output = final_layer_norm(x) # (B, T, C)
    logits = output_linear_layer(final_norm_output) # (B, T, vocab_size)
    
    # --- 3. Calculate Loss --- 
    B_loss, T_loss, V_loss = logits.shape
    logits_for_loss = logits.view(B_loss * T_loss, V_loss) 
    targets_for_loss = yb.view(B_loss * T_loss)
    loss = criterion(logits_for_loss, targets_for_loss)
    
    # --- 4. Zero Gradients --- 
    optimizer.zero_grad()
    
    # --- 5. Backward Pass --- 
    loss.backward()
    
    # --- 6. Update Parameters --- 
    optimizer.step()
    
    # --- Logging --- 
    current_loss = loss.item()
    losses.append(current_loss)
    if epoch % eval_interval == 0 or epoch == epochs - 1:
        print(f"  Epoch {epoch+1}/{epochs}, Loss: {current_loss:.4f}")

print("--- Training Loop Completed ---")


Step 4.3: Starting Training Loop for 5000 epochs...
  Epoch 1/5000, Loss: 3.6902
  Epoch 501/5000, Loss: 0.4272
  Epoch 1001/5000, Loss: 0.1480
  Epoch 1501/5000, Loss: 0.1461
  Epoch 2001/5000, Loss: 0.1226
  Epoch 2501/5000, Loss: 0.1281
  Epoch 3001/5000, Loss: 0.1337
  Epoch 3501/5000, Loss: 0.1288
  Epoch 4001/5000, Loss: 0.1178
  Epoch 4501/5000, Loss: 0.1292
  Epoch 5000/5000, Loss: 0.1053
--- Training Loop Completed ---


### Step 5: Text Generation (Inline)

**Goal:** Use the trained model parameters to generate new text, character by character, starting from a seed context.

#### Step 5.1: Set Generation Seed and Parameters

**Explanation:** Define the starting character(s) for generation and how many characters to generate. We convert the seed character ('t' in this case) to its token ID.

In [21]:
print("\n--- Step 5: Text Generation ---")

# Seed character(s)
seed_chars = "t"
# Convert seed characters to token IDs
seed_ids = [char_to_int[ch] for ch in seed_chars]

# Create the initial context tensor
# Shape: (1, len(seed_ids)) -> Batch dimension = 1
generated_sequence = torch.tensor([seed_ids], dtype=torch.long, device=device)
print(f"Initial seed sequence: '{seed_chars}' -> {generated_sequence.tolist()}")

# Define how many new tokens (characters) to generate
num_tokens_to_generate = 200 
print(f"Generating {num_tokens_to_generate} new tokens...")


--- Step 5: Text Generation ---
Initial seed sequence: 't' -> [[31]]
Generating 200 new tokens...


#### Step 5.2: Generation Loop

**Explanation:** Iterate `num_tokens_to_generate` times. In each iteration:
1. Prepare the current context (last `block_size` tokens).
2. Perform a forward pass using the *trained* model parameters (in evaluation mode - `torch.no_grad()` is used to disable gradient calculation for efficiency).
3. Get the logits for the *last* time step.
4. Apply softmax to get probabilities.
5. Sample the next token ID based on probabilities.
6. Append the new token ID to the `generated_sequence`.

In [22]:
# Set layers to evaluation mode (important if dropout/batchnorm were used)
# This disables dropout. We do it manually since no nn.Module class.
for i in range(n_layers):
    layer_norms_1[i].eval()
    mha_qkv_linears[i].eval()
    mha_output_linears[i].eval()
    layer_norms_2[i].eval()
    ffn_linear_1[i].eval()
    ffn_linear_2[i].eval()
final_layer_norm.eval()
output_linear_layer.eval()
token_embedding_table.eval()

# Disable gradient calculations for generation
with torch.no_grad():
    # Loop to generate tokens one by one
    for _ in range(num_tokens_to_generate):
        # --- 1. Prepare Input Context --- 
        # Take the last block_size tokens as context
        current_context = generated_sequence[:, -block_size:] # Shape: (1, min(current_len, block_size))
        B_gen, T_gen = current_context.shape 
        C_gen = d_model
        
        # --- 2. Forward Pass --- 
        # Embedding + Positional Encoding
        token_embed_gen = token_embedding_table(current_context) # (B_gen, T_gen, C_gen)
        pos_enc_slice_gen = positional_encoding[:, :T_gen, :] 
        x_gen = token_embed_gen + pos_enc_slice_gen # (B_gen, T_gen, C_gen)
        
        # Transformer Blocks
        for i in range(n_layers):
            x_input_block_gen = x_gen
            # Pre-LN MHA
            x_ln1_gen = layer_norms_1[i](x_input_block_gen)
            qkv_gen = mha_qkv_linears[i](x_ln1_gen)
            qkv_gen = qkv_gen.view(B_gen, T_gen, n_heads, 3 * d_k).permute(0, 2, 1, 3)
            q_gen, k_gen, v_gen = qkv_gen.chunk(3, dim=-1)
            attn_scores_gen = (q_gen @ k_gen.transpose(-2, -1)) * (d_k ** -0.5)
            # Use the pre-computed mask sliced to the current context length T_gen
            attn_scores_masked_gen = attn_scores_gen.masked_fill(causal_mask[:,:,:T_gen,:T_gen] == 0, float('-inf'))
            attention_weights_gen = F.softmax(attn_scores_masked_gen, dim=-1)
            attn_output_gen = attention_weights_gen @ v_gen
            attn_output_gen = attn_output_gen.permute(0, 2, 1, 3).contiguous().view(B_gen, T_gen, C_gen)
            mha_result_gen = mha_output_linears[i](attn_output_gen)
            x_gen = x_input_block_gen + mha_result_gen # Residual 1
            # Pre-LN FFN
            x_input_ffn_gen = x_gen
            x_ln2_gen = layer_norms_2[i](x_input_ffn_gen)
            ffn_hidden_gen = ffn_linear_1[i](x_ln2_gen)
            ffn_activated_gen = F.relu(ffn_hidden_gen)
            ffn_output_gen = ffn_linear_2[i](ffn_activated_gen)
            x_gen = x_input_ffn_gen + ffn_output_gen # Residual 2
            
        # Final Layers
        final_norm_output_gen = final_layer_norm(x_gen)
        logits_gen = output_linear_layer(final_norm_output_gen) # (B_gen, T_gen, vocab_size)
        
        # --- 3. Get Logits for Last Time Step --- 
        logits_last_token = logits_gen[:, -1, :] # Shape: (B_gen, vocab_size)
        
        # --- 4. Apply Softmax --- 
        probs = F.softmax(logits_last_token, dim=-1) # Shape: (B_gen, vocab_size)
        
        # --- 5. Sample Next Token --- 
        next_token = torch.multinomial(probs, num_samples=1) # Shape: (B_gen, 1)
        
        # --- 6. Append Sampled Token --- 
        generated_sequence = torch.cat((generated_sequence, next_token), dim=1)

print("\n--- Generation Complete ---")


--- Generation Complete ---


#### Step 5.3: Decode Generated Sequence

**Explanation:** Convert the sequence of generated token IDs back into human-readable characters using the `int_to_char` mapping.

In [23]:
# Get the generated sequence for the first (and only) batch item
final_generated_ids = generated_sequence[0].tolist()

# Decode the list of IDs back into a string
decoded_text = ''.join([int_to_char[id] for id in final_generated_ids])

print(f"\nFinal Generated Text (including seed):")
print(decoded_text)


Final Generated Text (including seed):
the
book her sister was reading, but it had no pictures or conversations in
in
it, 'and what is the use of a book,' thought Alice 'without pictures or
conversation?'
So she was considerinf her wad fe f


### Step 6: Save the model state (optional)

Since our transformer model is implemented "inline" with separate component variables rather than as a PyTorch nn.Module, we need to manually collect all parameters into a state dictionary before saving.

In [None]:
# Create a directory to store the model (if it doesn't exist)
os.makedirs('saved_models', exist_ok=True)

# Create a state dictionary to hold all model parameters
state_dict = {
    'token_embedding_table': token_embedding_table.state_dict(),
    'positional_encoding': positional_encoding,  # This is not a parameter, just a tensor
    'layer_norms_1': [ln.state_dict() for ln in layer_norms_1],
    'mha_qkv_linears': [linear.state_dict() for linear in mha_qkv_linears],
    'mha_output_linears': [linear.state_dict() for linear in mha_output_linears],
    'layer_norms_2': [ln.state_dict() for ln in layer_norms_2],
    'ffn_linear_1': [linear.state_dict() for linear in ffn_linear_1],
    'ffn_linear_2': [linear.state_dict() for linear in ffn_linear_2],
    'final_layer_norm': final_layer_norm.state_dict(),
    'output_linear_layer': output_linear_layer.state_dict(),
    # Save hyperparameters for model reconstruction
    'config': {
        'vocab_size': vocab_size,
        'd_model': d_model,
        'n_heads': n_heads,
        'n_layers': n_layers,
        'd_ff': d_ff,
        'block_size': block_size
    },
    # Save tokenizer info for text generation
    'tokenizer': {
        'char_to_int': char_to_int,
        'int_to_char': int_to_char
    }
}

# Save the state dictionary
torch.save(state_dict, 'saved_models/transformer_model.pt')
print("Model saved successfully to 'saved_models/transformer_model.pt'")

Model saved successfully to 'saved_models/transformer_model.pt'


To load the model later, you would do:

In [25]:
# Load the saved state dictionary
loaded_state_dict = torch.load('saved_models/transformer_model.pt', map_location=device)

# Extract configuration and tokenizer info
config = loaded_state_dict['config']
vocab_size = config['vocab_size']
d_model = config['d_model']
n_heads = config['n_heads']
n_layers = config['n_layers']
d_ff = config['d_ff']
block_size = config['block_size']
d_k = d_model // n_heads

char_to_int = loaded_state_dict['tokenizer']['char_to_int']
int_to_char = loaded_state_dict['tokenizer']['int_to_char']

# Recreate the model components
token_embedding_table = nn.Embedding(vocab_size, d_model).to(device)
token_embedding_table.load_state_dict(loaded_state_dict['token_embedding_table'])

positional_encoding = loaded_state_dict['positional_encoding'].to(device)

# Initialize the layer lists
layer_norms_1 = []
mha_qkv_linears = []
mha_output_linears = []
layer_norms_2 = []
ffn_linear_1 = []
ffn_linear_2 = []

# Load each layer's components
for i in range(n_layers):
    # Layer norm 1
    ln1 = nn.LayerNorm(d_model).to(device)
    ln1.load_state_dict(loaded_state_dict['layer_norms_1'][i])
    layer_norms_1.append(ln1)
    
    # MHA QKV linear
    qkv_linear = nn.Linear(d_model, 3 * d_model, bias=False).to(device)
    qkv_linear.load_state_dict(loaded_state_dict['mha_qkv_linears'][i])
    mha_qkv_linears.append(qkv_linear)
    
    # MHA output linear
    output_linear = nn.Linear(d_model, d_model).to(device)
    output_linear.load_state_dict(loaded_state_dict['mha_output_linears'][i])
    mha_output_linears.append(output_linear)
    
    # Layer norm 2
    ln2 = nn.LayerNorm(d_model).to(device)
    ln2.load_state_dict(loaded_state_dict['layer_norms_2'][i])
    layer_norms_2.append(ln2)
    
    # FFN linear 1
    lin1 = nn.Linear(d_model, d_ff).to(device)
    lin1.load_state_dict(loaded_state_dict['ffn_linear_1'][i])
    ffn_linear_1.append(lin1)
    
    # FFN linear 2
    lin2 = nn.Linear(d_ff, d_model).to(device)
    lin2.load_state_dict(loaded_state_dict['ffn_linear_2'][i])
    ffn_linear_2.append(lin2)

# Final layer norm
final_layer_norm = nn.LayerNorm(d_model).to(device)
final_layer_norm.load_state_dict(loaded_state_dict['final_layer_norm'])

# Output linear layer
output_linear_layer = nn.Linear(d_model, vocab_size).to(device)
output_linear_layer.load_state_dict(loaded_state_dict['output_linear_layer'])

print("Model loaded successfully!")

Model loaded successfully!


### Step 7: Conclusion

This notebook provided an extremely detailed, step-by-step, inline implementation of a character-level Decoder-only Transformer Language Model. By avoiding functions and classes, we exposed the granular operations involved in both training and text generation.

We covered:
1.  **Setup & Tokenization:** Preparing the environment, defining a text corpus, performing character-level tokenization (creating character mappings and encoding the corpus), and setting hyperparameters.
2.  **Data Preparation:** Structuring the encoded corpus into input/target pairs for next-token prediction.
3.  **Model Initialization:** Creating instances of all necessary `torch.nn` layers (Embeddings, Linears, LayerNorms) and precomputing Positional Encodings.
4.  **Forward Pass (Inline):** Detailing and executing the flow through embeddings, positional encoding, multiple Transformer blocks (with Masked Multi-Head Self-Attention and Feed-Forward networks, including residual connections and Layer Normalization - using a Pre-LN structure), and the final output layers.
5.  **Training:** Implementing the training loop with batch sampling, forward pass execution, Cross-Entropy Loss calculation, backpropagation, and parameter updates via the AdamW optimizer.
6.  **Text Generation:** Demonstrating autoregressive generation by starting with a seed, iteratively performing forward passes on the growing context (within `torch.no_grad()`), sampling the next token based on output probabilities, and appending it to the sequence, finally decoding the generated IDs back to text.

While highly verbose, this approach clearly illustrates the fundamental mechanics and data flow within a Transformer LM. Real-world implementations would heavily utilize functions and classes for modularity, reusability, and readability, but this inline method serves as a detailed educational breakdown.