# LLM Reproducibility with Activation Functions

This notebook conducts reproducibility experiments on a character-level language model (Shakespeare dataset).

**Experiment Design:**
- Train multiple models with identical hyperparameters but different activation functions
- Test: SmeLU (β=0.5, 1.0), ReLU, GELU, Swish
- Measure reproducibility using Relative Prediction Difference (PD)
- Compare accuracy vs reproducibility trade-offs

**Expected Runtime:** ~2-3 hours for all experiments on M4 Pro CPU

## Setup

In [1]:
import torch
import numpy as np
import matplotlib.pyplot as plt
import json
from pathlib import Path

# Local imports
from config import Config
from prepare_data import load_shakespeare, prepare_data
from tokenizer import CharTokenizer
from activations import get_activation
from model import CharLM
from train import run_experiment, set_seed

# Set up matplotlib
%matplotlib inline
plt.style.use('seaborn-v0_8-darkgrid')

print(f"PyTorch version: {torch.__version__}")
print(f"Device: {torch.device('cpu')}")

Matplotlib is building the font cache; this may take a moment.


PyTorch version: 2.9.1
Device: cpu


## Configuration

In [2]:
# Initialize config
config = Config()

# Display configuration
print("Experiment Configuration:")
print(f"  Model: {config.n_layer} layers, {config.n_embd} hidden dim, {config.n_head} heads")
print(f"  Context length: {config.block_size} characters")
print(f"  Training iterations: {config.max_iters}")
print(f"  Batch size: {config.batch_size}")
print(f"  Learning rate: {config.learning_rate}")
print(f"  Trials per activation: {config.trials_per_activation}")
print(f"  Activation functions: {list(config.activation_functions.keys())}")
print(f"\n{config}")

Experiment Configuration:
  Model: 6 layers, 384 hidden dim, 6 heads
  Context length: 256 characters
  Training iterations: 5000
  Batch size: 64
  Learning rate: 0.0003
  Trials per activation: 3
  Activation functions: ['smelu_05', 'smelu_1', 'relu', 'gelu', 'swish']

Config(model=6L-384H, iters=5000, batch=64)


## Quick Test: Single Model Training

First, test the setup by training a single model with ReLU activation (~5-10 minutes).

In [3]:
# Test with single model
print("Testing setup with single model...\n")

# Load data
text = load_shakespeare()
train_text, val_text = prepare_data(text, config.train_split)
tokenizer = CharTokenizer(text)
config.vocab_size = len(tokenizer)

train_data = tokenizer.encode(train_text)
val_data = tokenizer.encode(val_text)

print(f"\nVocabulary size: {config.vocab_size}")
print(f"Train data: {len(train_data):,} characters")
print(f"Val data: {len(val_data):,} characters")

Testing setup with single model...

[INFO] Downloading Shakespeare dataset from https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshakespeare/input.txt
[INFO] Downloaded 1,115,394 characters to data/shakespeare.txt
[INFO] Train size: 1,003,854 characters
[INFO] Val size: 111,540 characters
[INFO] Vocabulary size: 65
[INFO] Characters: 
 !$&',-.3:;?ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz

Vocabulary size: 65
Train data: 1,003,854 characters
Val data: 111,540 characters
[INFO] Downloaded 1,115,394 characters to data/shakespeare.txt
[INFO] Train size: 1,003,854 characters
[INFO] Val size: 111,540 characters
[INFO] Vocabulary size: 65
[INFO] Characters: 
 !$&',-.3:;?ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz

Vocabulary size: 65
Train data: 1,003,854 characters
Val data: 111,540 characters


In [None]:
# Create and train a test model
from train import train_model

set_seed(config.seed_base)
test_model = CharLM(
    vocab_size=config.vocab_size,
    n_embd=config.n_embd,
    n_head=config.n_head,
    n_layer=config.n_layer,
    block_size=config.block_size,
    activation=get_activation('relu'),
    dropout=config.dropout
).to(config.device)

test_results = train_model(test_model, train_data, val_data, config, trial_id=0)
print("\n✓ Test training completed successfully!")

[INFO] Model initialized with 10,795,776 parameters

Training Trial 0
Step     0 | Train loss: 4.3444 | Val loss: 4.3433 | Time: 417.0s
Step     0 | Train loss: 4.3444 | Val loss: 4.3433 | Time: 417.0s
Step   500 | Train loss: 1.7497 | Val loss: 1.9022 | Time: 78114.4s
Step   500 | Train loss: 1.7497 | Val loss: 1.9022 | Time: 78114.4s
Step  1000 | Train loss: 1.4091 | Val loss: 1.6333 | Time: 131029.1s
Step  1000 | Train loss: 1.4091 | Val loss: 1.6333 | Time: 131029.1s


In [None]:
# Test text generation
set_seed(42)
test_model.eval()

# Generate from a prompt
prompt = "ROMEO:"
context = torch.tensor(tokenizer.encode(prompt), dtype=torch.long).unsqueeze(0).to(config.device)
generated = test_model.generate(context, max_new_tokens=200, temperature=0.8)
generated_text = tokenizer.decode(generated[0].tolist())

print("Generated text:")
print("=" * 60)
print(generated_text)
print("=" * 60)

## Full Experiments: All Activation Functions

Now run the complete experiment suite. This will:
1. Train 3 models for each activation function (5 activations × 3 trials = 15 models)
2. Calculate reproducibility metrics (Relative PD)
3. Save all results

**Estimated time: 2-3 hours**

In [None]:
# Run experiments for all activation functions
all_results = {}
all_models = {}

activations_to_test = ['smelu_05', 'smelu_1', 'relu', 'gelu', 'swish']

for activation_name in activations_to_test:
    results, models, tokenizer = run_experiment(config, activation_name)
    all_results[activation_name] = results
    all_models[activation_name] = models
    
    # Brief pause between experiments
    print("\n" + "="*60 + "\n")

print("\n✓ All experiments completed!")

## Results Analysis

In [None]:
# Compile summary statistics
summary = []
for activation_name, results in all_results.items():
    summary.append({
        'Activation': activation_name,
        'Avg Val Loss': f"{results['avg_val_loss']:.4f}",
        'Std Val Loss': f"{results['std_val_loss']:.4f}",
        'Avg Relative PD': f"{results['avg_relative_pd']:.6f}",
        'Avg Time (s)': f"{results['avg_training_time']:.1f}"
    })

import pandas as pd
summary_df = pd.DataFrame(summary)
print("\nExperiment Summary:")
print("="*80)
print(summary_df.to_string(index=False))
print("="*80)

In [None]:
# Plot: Validation Loss vs Relative PD
fig, ax = plt.subplots(figsize=(10, 6))

activations = list(all_results.keys())
val_losses = [all_results[act]['avg_val_loss'] for act in activations]
relative_pds = [all_results[act]['avg_relative_pd'] for act in activations]

# Create scatter plot
colors = plt.cm.viridis(np.linspace(0, 1, len(activations)))
for i, act in enumerate(activations):
    ax.scatter(relative_pds[i], val_losses[i], s=200, c=[colors[i]], 
               label=act.upper(), alpha=0.7, edgecolors='black', linewidth=1.5)

ax.set_xlabel('Relative Prediction Difference (Lower = More Reproducible)', fontsize=12)
ax.set_ylabel('Validation Loss (Lower = Better)', fontsize=12)
ax.set_title('Reproducibility vs Accuracy Trade-off\nCharacter-Level LM on Shakespeare', 
             fontsize=14, fontweight='bold')
ax.legend(loc='best', fontsize=10)
ax.grid(True, alpha=0.3)

plt.tight_layout()
plt.savefig('results/reproducibility_vs_accuracy.png', dpi=150, bbox_inches='tight')
plt.show()

print("Plot saved to results/reproducibility_vs_accuracy.png")

In [None]:
# Plot: Relative PD comparison
fig, ax = plt.subplots(figsize=(10, 6))

x = np.arange(len(activations))
bars = ax.bar(x, relative_pds, color=colors, alpha=0.7, edgecolor='black', linewidth=1.5)

ax.set_xlabel('Activation Function', fontsize=12)
ax.set_ylabel('Average Relative Prediction Difference', fontsize=12)
ax.set_title('Reproducibility Comparison Across Activation Functions\n(Lower is Better)', 
             fontsize=14, fontweight='bold')
ax.set_xticks(x)
ax.set_xticklabels([act.upper() for act in activations], rotation=45)
ax.grid(True, alpha=0.3, axis='y')

# Add value labels on bars
for i, (bar, val) in enumerate(zip(bars, relative_pds)):
    height = bar.get_height()
    ax.text(bar.get_x() + bar.get_width()/2., height,
            f'{val:.6f}', ha='center', va='bottom', fontsize=9)

plt.tight_layout()
plt.savefig('results/relative_pd_comparison.png', dpi=150, bbox_inches='tight')
plt.show()

print("Plot saved to results/relative_pd_comparison.png")

In [None]:
# Plot: Validation loss comparison
fig, ax = plt.subplots(figsize=(10, 6))

std_losses = [all_results[act]['std_val_loss'] for act in activations]
bars = ax.bar(x, val_losses, yerr=std_losses, color=colors, alpha=0.7, 
              edgecolor='black', linewidth=1.5, capsize=5)

ax.set_xlabel('Activation Function', fontsize=12)
ax.set_ylabel('Average Validation Loss', fontsize=12)
ax.set_title('Validation Loss Comparison Across Activation Functions\n(Lower is Better)', 
             fontsize=14, fontweight='bold')
ax.set_xticks(x)
ax.set_xticklabels([act.upper() for act in activations], rotation=45)
ax.grid(True, alpha=0.3, axis='y')

# Add value labels
for i, (bar, val) in enumerate(zip(bars, val_losses)):
    height = bar.get_height()
    ax.text(bar.get_x() + bar.get_width()/2., height,
            f'{val:.4f}', ha='center', va='bottom', fontsize=9)

plt.tight_layout()
plt.savefig('results/val_loss_comparison.png', dpi=150, bbox_inches='tight')
plt.show()

print("Plot saved to results/val_loss_comparison.png")

## Sample Text Generation

Generate text samples from each activation function to qualitatively compare outputs.

In [None]:
# Generate samples from each activation function
prompts = ["ROMEO:", "JULIET:", "To be or not to be,"]

for prompt in prompts:
    print(f"\n{'='*60}")
    print(f"Prompt: {prompt}")
    print(f"{'='*60}\n")
    
    for activation_name, models in all_models.items():
        model = models[0]  # Use first model from each activation
        model.eval()
        set_seed(42)
        
        context = torch.tensor(tokenizer.encode(prompt), dtype=torch.long).unsqueeze(0).to(config.device)
        generated = model.generate(context, max_new_tokens=100, temperature=0.8)
        generated_text = tokenizer.decode(generated[0].tolist())
        
        print(f"[{activation_name.upper()}]")
        print(generated_text)
        print()

## Key Findings

Analyze the results to answer:
1. Which activation function provides the best reproducibility?
2. Is there a trade-off between accuracy and reproducibility?
3. Do smooth activations (SmeLU, Swish, GELU) show better reproducibility than ReLU?

In [None]:
# Find best activation for reproducibility
best_reproducibility = min(all_results.items(), key=lambda x: x[1]['avg_relative_pd'])
print(f"Best Reproducibility: {best_reproducibility[0].upper()}")
print(f"  Relative PD: {best_reproducibility[1]['avg_relative_pd']:.6f}")
print(f"  Val Loss: {best_reproducibility[1]['avg_val_loss']:.4f}")

# Find best activation for accuracy
best_accuracy = min(all_results.items(), key=lambda x: x[1]['avg_val_loss'])
print(f"\nBest Accuracy: {best_accuracy[0].upper()}")
print(f"  Val Loss: {best_accuracy[1]['avg_val_loss']:.4f}")
print(f"  Relative PD: {best_accuracy[1]['avg_relative_pd']:.6f}")

# Compare smooth vs non-smooth activations
smooth = ['smelu_05', 'smelu_1', 'gelu', 'swish']
non_smooth = ['relu']

smooth_avg_pd = np.mean([all_results[act]['avg_relative_pd'] for act in smooth if act in all_results])
non_smooth_avg_pd = np.mean([all_results[act]['avg_relative_pd'] for act in non_smooth if act in all_results])

print(f"\nSmooth Activations Avg PD: {smooth_avg_pd:.6f}")
print(f"Non-Smooth Activations Avg PD: {non_smooth_avg_pd:.6f}")
print(f"\n{'Smooth activations are MORE reproducible!' if smooth_avg_pd < non_smooth_avg_pd else 'ReLU is MORE reproducible!'}")

## Save Final Summary

In [None]:
# Save complete summary
final_summary = {
    'config': config.__dict__,
    'results': {k: v for k, v in all_results.items()},
    'best_reproducibility': {
        'activation': best_reproducibility[0],
        'relative_pd': best_reproducibility[1]['avg_relative_pd'],
        'val_loss': best_reproducibility[1]['avg_val_loss']
    },
    'best_accuracy': {
        'activation': best_accuracy[0],
        'val_loss': best_accuracy[1]['avg_val_loss'],
        'relative_pd': best_accuracy[1]['avg_relative_pd']
    },
    'smooth_vs_nonsmooth': {
        'smooth_avg_pd': smooth_avg_pd,
        'nonsmooth_avg_pd': non_smooth_avg_pd
    }
}

with open('results/final_summary.json', 'w') as f:
    json.dump(final_summary, f, indent=2)

print("Final summary saved to results/final_summary.json")
print("\n✓ Experiment complete!")