# LLM Framework Quick Start

This notebook demonstrates the core capabilities of the LLM framework:

1. **Building a Model** - Create a Decoder-only Transformer
2. **Training** - Train on synthetic data
3. **Inference** - Generate text
4. **Advanced Features** - Gradient checkpointing, GQA, MoE

## Setup

First, let's import the necessary modules and set up our environment.

In [None]:
import torch
import torch.nn as nn
import torch.optim as optim

from llm.inference import generate, stream_generate
from llm.models.decoder import DecoderModel
from llm.tokenization.simple_tokenizer import SimpleCharacterTokenizer

# Check device
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"Using device: {device}")

## 1. Building a Model

Let's create a simple Decoder-only Transformer model.

In [None]:
# Create a simple tokenizer from sample text
sample_text = [
    "hello world",
    "the quick brown fox jumps over the lazy dog",
    "machine learning is amazing",
    "transformers are powerful models",
]
tokenizer = SimpleCharacterTokenizer(sample_text)
print(f"Vocabulary size: {tokenizer.vocab_size}")
print(f"Sample encoding: 'hello' -> {tokenizer.encode('hello')}")

In [None]:
# Create the model
model = DecoderModel(
    vocab_size=tokenizer.vocab_size,
    hidden_size=128,
    num_layers=2,
    num_heads=4,
    max_seq_len=64,
).to(device)

# Count parameters
total_params = sum(p.numel() for p in model.parameters())
trainable_params = sum(p.numel() for p in model.parameters() if p.requires_grad)
print(f"Total parameters: {total_params:,}")
print(f"Trainable parameters: {trainable_params:,}")

## 2. Training

Let's train the model on synthetic data for a few steps.

In [None]:
# Generate synthetic training data
batch_size = 8
seq_len = 32

input_ids = torch.randint(0, tokenizer.vocab_size, (batch_size, seq_len), device=device)
labels = torch.randint(0, tokenizer.vocab_size, (batch_size, seq_len), device=device)

print(f"Input shape: {input_ids.shape}")
print(f"Labels shape: {labels.shape}")

In [None]:
# Training loop
model.train()
optimizer = optim.AdamW(model.parameters(), lr=1e-3)
criterion = nn.CrossEntropyLoss()

losses = []
for step in range(20):
    optimizer.zero_grad()

    # Forward pass
    logits = model(input_ids)

    # Compute loss
    loss = criterion(logits.view(-1, tokenizer.vocab_size), labels.view(-1))

    # Backward pass
    loss.backward()
    optimizer.step()

    losses.append(loss.item())
    if step % 5 == 0:
        print(f"Step {step}: Loss = {loss.item():.4f}")

print(f"\nTraining complete! Loss: {losses[0]:.4f} -> {losses[-1]:.4f}")

In [None]:
# Plot training loss
import matplotlib.pyplot as plt

plt.figure(figsize=(8, 4))
plt.plot(losses)
plt.xlabel("Step")
plt.ylabel("Loss")
plt.title("Training Loss")
plt.grid(True, alpha=0.3)
plt.show()

## 3. Inference

Now let's generate some text using our trained model.

In [None]:
# Non-streaming generation
model.eval()

prompt = "hello"
generated = generate(
    model=model,
    tokenizer=tokenizer,
    prompt=prompt,
    max_new_tokens=20,
    temperature=0.8,
)
print(f"Prompt: '{prompt}'")
print(f"Generated: '{generated}'")

In [None]:
# Streaming generation
print(f"Streaming: '{prompt}'", end="")
for token in stream_generate(
    model=model,
    tokenizer=tokenizer,
    prompt=prompt,
    max_new_tokens=20,
    temperature=0.8,
):
    print(token, end="", flush=True)
print()

## 4. Advanced Features

### 4.1 Gradient Checkpointing

Reduce memory usage during training by recomputing activations.

In [None]:
# Create model with gradient checkpointing
model_gc = DecoderModel(
    vocab_size=tokenizer.vocab_size,
    hidden_size=128,
    num_layers=4,
    num_heads=4,
    max_seq_len=64,
    gradient_checkpointing=True,  # Enable checkpointing
).to(device)

print(f"Gradient checkpointing enabled: {model_gc.gradient_checkpointing}")

# Can also toggle dynamically
model_gc.disable_gradient_checkpointing()
print(f"After disable: {model_gc.gradient_checkpointing}")

model_gc.enable_gradient_checkpointing()
print(f"After enable: {model_gc.gradient_checkpointing}")

### 4.2 Grouped Query Attention (GQA)

Use fewer KV heads for better memory efficiency.

In [None]:
# Create model with GQA
model_gqa = DecoderModel(
    vocab_size=tokenizer.vocab_size,
    hidden_size=128,
    num_layers=2,
    num_heads=8,
    num_kv_heads=2,  # Use 2 KV heads instead of 8
    max_seq_len=64,
).to(device)

# Test forward pass
test_input = torch.randint(0, tokenizer.vocab_size, (1, 10), device=device)
output = model_gqa(test_input)
print(f"GQA model output shape: {output.shape}")

### 4.3 SwiGLU Activation

Use SwiGLU activation in the MLP layers.

In [None]:
# Create model with SwiGLU
model_swiglu = DecoderModel(
    vocab_size=tokenizer.vocab_size,
    hidden_size=128,
    num_layers=2,
    num_heads=4,
    max_seq_len=64,
    use_glu=True,  # Enable SwiGLU
).to(device)

output = model_swiglu(test_input)
print(f"SwiGLU model output shape: {output.shape}")

### 4.4 Mixture of Experts (MoE)

Use sparse expert routing for larger model capacity.

In [None]:
# Create model with MoE
model_moe = DecoderModel(
    vocab_size=tokenizer.vocab_size,
    hidden_size=128,
    num_layers=2,
    num_heads=4,
    max_seq_len=64,
    use_moe=True,
    num_experts=4,
    top_k=2,
).to(device)

output = model_moe(test_input)
print(f"MoE model output shape: {output.shape}")

# Compare parameter counts
moe_params = sum(p.numel() for p in model_moe.parameters())
base_params = sum(p.numel() for p in model.parameters())
print(f"\nBase model params: {base_params:,}")
print(f"MoE model params: {moe_params:,}")
print(f"Ratio: {moe_params / base_params:.2f}x")

## 5. Using the E2E Pipeline

Run a complete train -> evaluate -> inference workflow.

In [None]:
from llm.utils.e2e import E2EConfig, run_e2e_pipeline

# Configure and run E2E pipeline
config = E2EConfig(
    hidden_size=64,
    num_layers=2,
    num_heads=2,
    epochs=3,
    num_samples=100,
    prompt="hello",
)

result = run_e2e_pipeline(config, device)

print(f"Training: {result.initial_loss:.4f} -> {result.final_loss:.4f}")
print(f"Perplexity: {result.perplexity:.2f}")
print(f"Generated: '{result.generated_text}'")
print(f"All checks passed: {result.all_passed}")

## Next Steps

- **Training on real data**: See `scripts/train_simple_decoder.py`
- **Using HuggingFace tokenizers**: See `docs/usage.md`
- **Serving with API**: Run `llm-serve` and use OpenAI SDK
- **Read the docs**: `docs/tutorial-cpu-llm.md` for comprehensive guide