# MNIST Classification Analysis

This notebook demonstrates how to use the research project template for MNIST classification analysis. It shows how to run experiments, analyze results, and create visualizations.

## Overview

The notebook covers:
1. Setting up the experiment environment
2. Running experiments with different models
3. Analyzing and comparing results
4. Creating visualizations
5. Interpreting model performance


## Setup

First, let's import the necessary libraries and set up the environment.

In [None]:
import sys
import os
import yaml
import json
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from pathlib import Path

# Add src to path to import our modules
sys.path.append('../src')

from experiment import ExperimentConfig, Experiment
from experiment.schemas import load_config

# Set up plotting style
plt.style.use('seaborn-v0_8')
sns.set_palette("husl")

# Create directories if they don't exist
Path('../results/plots').mkdir(parents=True, exist_ok=True)
Path('../cache').mkdir(parents=True, exist_ok=True)

print("Environment setup complete!")

## 1. Single Experiment Example

Let's start by running a single experiment with the linear model.

In [None]:
# Load configuration
config_path = Path('../configs/local.yaml')
config = load_config(config_path)

print("Configuration loaded:")
print(f"- Experiment: {config.experiment.name}")
print(f"- Model: {config.model.name}")
print(f"- Epochs: {config.training.epochs}")
print(f"- Batch size: {config.data.batch_size}")

In [None]:
# Create and run experiment
experiment = Experiment.from_config(config)

print("Running experiment...")
print("This may take a few minutes for the first run (downloading MNIST data)")

# Run the experiment
results = experiment.run()

print("\nExperiment completed!")
print(f"Results saved to: {results['results_path']}")

In [None]:
# Display results
print("\n=== EXPERIMENT RESULTS ===")
print(f"Experiment time: {results['experiment_time']:.2f} seconds")

# Show metrics for each split
for key, value in results.items():
    if key.endswith('_metrics') and isinstance(value, dict):
        split_name = key.replace('_metrics', '').title()
        print(f"\n{split_name} Metrics:")
        for metric_name, metric_value in value.items():
            if isinstance(metric_value, float):
                print(f"  {metric_name}: {metric_value:.4f}")

## 2. Comparing Multiple Models

Now let's run experiments with different models and compare their performance.

In [None]:
# Define configurations for different models
model_configs = {
    'linear': {
        "name": "linear",
        "input_size": 784,
        "num_classes": 10
    },
    'mlp': {
        "name": "mlp",
        "input_size": 784,
        "hidden_size": 128,
        "num_layers": 2,
        "dropout": 0.2
    },
    'cnn': {
        "name": "cnn",
        "input_channels": 1,
        "channels": [32, 64],
        "kernel_size": 3,
        "dropout": 0.2
    }
}

# Base configuration (same for all models)
base_config = {
    "data": {
        "dataset": "mnist",
        "batch_size": 32,
        "validation_split": 0.1
    },
    "training": {
        "epochs": 5,  # Reduced for demonstration
        "learning_rate": 0.001,
        "optimizer": "adam"
    },
    "evaluation": {
        "metrics": ["accuracy", "f1_macro"]
    },
    "experiment": {
        "random_seed": 42,
        "device": "auto",
        "save_model": False  # Don't save models in demo
    },
    "logging": {
        "log_every_n_steps": 50
    }
}

print("Model comparison configurations created.")

In [None]:
# Run experiments for all models
model_results = {}

for model_name, model_config in model_configs.items():
    print(f"\nRunning experiment with {model_name.upper()} model...")
    
    # Create full configuration
    full_config = base_config.copy()
    full_config['model'] = model_config
    full_config['experiment']['name'] = f'mnist_{model_name}_comparison'
    
    # Create experiment config object
    experiment_config = ExperimentConfig(**full_config)
    
    # Run experiment
    experiment = Experiment.from_config(experiment_config)
    results = experiment.run()
    
    # Store results
    model_results[model_name] = results
    
    print(f"{model_name.upper()} completed in {results['experiment_time']:.2f}s")

print("\nAll experiments completed!")

## 3. Results Analysis and Visualization

Now let's analyze and visualize the results from all models.

In [None]:
# Extract metrics for comparison
comparison_data = []

for model_name, results in model_results.items():
    # Get test metrics
    test_metrics = results.get('test_metrics', {})
    
    row = {
        'Model': model_name.upper(),
        'Test Accuracy': test_metrics.get('accuracy', 0),
        'Test F1 Macro': test_metrics.get('f1_macro', 0),
        'Training Time (s)': results.get('experiment_time', 0)
    }
    comparison_data.append(row)

# Create DataFrame for easy analysis
comparison_df = pd.DataFrame(comparison_data)

print("Model Comparison Results:")
print(comparison_df.to_string(index=False, float_format='%.4f'))

In [None]:
# Create visualization comparing model performance
fig, axes = plt.subplots(1, 3, figsize=(15, 5))

# Accuracy comparison
axes[0].bar(comparison_df['Model'], comparison_df['Test Accuracy'], 
           color=['#FF6B6B', '#4ECDC4', '#45B7D1'])
axes[0].set_title('Test Accuracy by Model')
axes[0].set_ylabel('Accuracy')
axes[0].set_ylim(0, 1)
for i, v in enumerate(comparison_df['Test Accuracy']):
    axes[0].text(i, v + 0.01, f'{v:.3f}', ha='center', va='bottom')

# F1 Score comparison
axes[1].bar(comparison_df['Model'], comparison_df['Test F1 Macro'], 
           color=['#FF6B6B', '#4ECDC4', '#45B7D1'])
axes[1].set_title('Test F1 Macro Score by Model')
axes[1].set_ylabel('F1 Macro Score')
axes[1].set_ylim(0, 1)
for i, v in enumerate(comparison_df['Test F1 Macro']):
    axes[1].text(i, v + 0.01, f'{v:.3f}', ha='center', va='bottom')

# Training time comparison
axes[2].bar(comparison_df['Model'], comparison_df['Training Time (s)'], 
           color=['#FF6B6B', '#4ECDC4', '#45B7D1'])
axes[2].set_title('Training Time by Model')
axes[2].set_ylabel('Time (seconds)')
for i, v in enumerate(comparison_df['Training Time (s)']):
    axes[2].text(i, v + 1, f'{v:.1f}s', ha='center', va='bottom')

plt.tight_layout()
plt.savefig('../results/plots/model_comparison.png', dpi=300, bbox_inches='tight')
plt.show()

print("Model comparison plot saved to results/plots/model_comparison.png")

## 4. Training History Analysis

Let's analyze the training history for one of the models to understand the learning process.

In [None]:
# Load training history from the MLP experiment (usually has the most interesting dynamics)
mlp_results = model_results['mlp']

# Load the cached training history
from experiment.infra import create_infra

infra = create_infra(Path('../cache'), 'analysis')
training_history = infra.load_artifact('training_history')

print("Training history loaded:")
print(f"Epochs trained: {len(training_history['train_loss'])}")
print(f"Final train loss: {training_history['train_loss'][-1]:.4f}")
print(f"Final train accuracy: {training_history['train_accuracy'][-1]:.4f}")
if 'val_loss' in training_history:
    print(f"Final val loss: {training_history['val_loss'][-1]:.4f}")
    print(f"Final val accuracy: {training_history['val_accuracy'][-1]:.4f}")

In [None]:
# Plot training curves
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(12, 4))

epochs = range(1, len(training_history['train_loss']) + 1)

# Loss curves
ax1.plot(epochs, training_history['train_loss'], 'o-', label='Training Loss', linewidth=2)
if 'val_loss' in training_history and len(training_history['val_loss']) > 0:
    ax1.plot(epochs, training_history['val_loss'], 's-', label='Validation Loss', linewidth=2)
ax1.set_xlabel('Epoch')
ax1.set_ylabel('Loss')
ax1.set_title('Training and Validation Loss')
ax1.legend()
ax1.grid(True, alpha=0.3)

# Accuracy curves
ax2.plot(epochs, training_history['train_accuracy'], 'o-', label='Training Accuracy', linewidth=2)
if 'val_accuracy' in training_history and len(training_history['val_accuracy']) > 0:
    ax2.plot(epochs, training_history['val_accuracy'], 's-', label='Validation Accuracy', linewidth=2)
ax2.set_xlabel('Epoch')
ax2.set_ylabel('Accuracy')
ax2.set_title('Training and Validation Accuracy')
ax2.legend()
ax2.grid(True, alpha=0.3)

plt.tight_layout()
plt.savefig('../results/plots/training_curves.png', dpi=300, bbox_inches='tight')
plt.show()

print("Training curves saved to results/plots/training_curves.png")

## 5. Model Analysis and Insights

Let's create a summary of our findings and provide insights about the different models.

In [None]:
# Create a detailed analysis report
print("=== MNIST CLASSIFICATION ANALYSIS REPORT ===")
print("\n1. MODEL PERFORMANCE SUMMARY")
print("-" * 50)

# Find best performing model
best_accuracy_idx = comparison_df['Test Accuracy'].idxmax()
best_model = comparison_df.loc[best_accuracy_idx]

print(f"Best performing model: {best_model['Model']}")
print(f"  - Test Accuracy: {best_model['Test Accuracy']:.4f}")
print(f"  - Test F1 Macro: {best_model['Test F1 Macro']:.4f}")
print(f"  - Training Time: {best_model['Training Time (s)']:.2f}s")

print("\n2. MODEL COMPARISON INSIGHTS")
print("-" * 50)

for _, row in comparison_df.iterrows():
    model_name = row['Model']
    accuracy = row['Test Accuracy']
    time = row['Training Time (s)']
    
    if model_name == 'LINEAR':
        print(f"📊 {model_name}:")
        print(f"   - Simplest model with {accuracy:.3f} accuracy")
        print(f"   - Fastest training ({time:.1f}s)")
        print(f"   - Good baseline for simple classification tasks")
        
    elif model_name == 'MLP':
        print(f"🧠 {model_name}:")
        print(f"   - Neural network with {accuracy:.3f} accuracy")
        print(f"   - Moderate complexity and training time ({time:.1f}s)")
        print(f"   - Good balance of performance and interpretability")
        
    elif model_name == 'CNN':
        print(f"🖼️ {model_name}:")
        print(f"   - Convolutional model with {accuracy:.3f} accuracy")
        print(f"   - Designed for image data ({time:.1f}s training)")
        print(f"   - Can learn spatial features and patterns")

print("\n3. RECOMMENDATIONS")
print("-" * 50)

if best_model['Model'] == 'CNN':
    print("✅ CNN performs best - recommended for image classification tasks")
    print("✅ Consider using CNN for production deployment")
elif best_model['Model'] == 'MLP':
    print("✅ MLP provides good balance of accuracy and speed")
    print("✅ Consider MLP for moderate-complexity tasks")
else:
    print("✅ Linear model is surprisingly effective")
    print("✅ Consider linear model for fast inference requirements")

print("\n4. NEXT STEPS")
print("-" * 50)
print("📈 Try hyperparameter tuning with the best model")
print("📊 Collect more training data if accuracy is insufficient")
print("🔄 Experiment with data augmentation techniques")
print("⚡ Consider ensemble methods for better performance")
print("🎯 Analyze misclassified examples to understand failure modes")

## 6. Experiment Configuration Management

This section demonstrates how to manage and create different experiment configurations programmatically.

In [None]:
# Create a custom configuration for hyperparameter tuning
tuning_config = {
    "data": {
        "dataset": "mnist",
        "batch_size": 64,
        "validation_split": 0.2
    },
    "model": {
        "name": "mlp",
        "input_size": 784,
        "hidden_size": 256,  # Larger hidden size
        "num_layers": 3,     # More layers
        "dropout": 0.3,      # More dropout
        "activation": "relu"
    },
    "training": {
        "epochs": 10,
        "learning_rate": 0.001,
        "optimizer": "adam",
        "weight_decay": 0.0001,
        "scheduler": {
            "name": "reduce_lr_on_plateau",
            "factor": 0.5,
            "patience": 3,
            "min_lr": 1e-6
        },
        "early_stopping": {
            "patience": 5,
            "min_delta": 0.001
        }
    },
    "evaluation": {
        "metrics": ["accuracy", "f1_macro", "f1_micro"],
        "save_predictions": True,
        "save_confusion_matrix": True
    },
    "experiment": {
        "name": "mnist_tuned_mlp",
        "description": "Hyperparameter tuned MLP for MNIST",
        "random_seed": 42,
        "device": "auto",
        "save_model": True
    },
    "logging": {
        "log_every_n_steps": 100,
        "log_gradients": False
    }
}

# Save this configuration for future use
tuned_config_path = Path('../configs/tuned_mlp.yaml')
with open(tuned_config_path, 'w') as f:
    yaml.dump(tuning_config, f, default_flow_style=False, indent=2)

print(f"Tuned configuration saved to: {tuned_config_path}")
print("\nConfiguration preview:")
print(f"  Model: {tuning_config['model']['name']} with {tuning_config['model']['hidden_size']} hidden units")
print(f"  Layers: {tuning_config['model']['num_layers']}")
print(f"  Dropout: {tuning_config['model']['dropout']}")
print(f"  Learning rate scheduling: {tuning_config['training']['scheduler']['name']}")
print(f"  Early stopping: enabled with patience {tuning_config['training']['early_stopping']['patience']}")

## 7. Caching and Reproducibility

This section demonstrates the caching capabilities and how to ensure reproducible experiments.

In [None]:
# Examine cache structure
cache_dir = Path('../cache')
if cache_dir.exists():
    print("Cache directory contents:")
    for item in cache_dir.rglob('*'):
        if item.is_file():
            size_mb = item.stat().st_size / (1024 * 1024)
            print(f"  {item.relative_to(cache_dir)}: {size_mb:.2f} MB")
else:
    print("No cache directory found.")

# Show how to clean cache if needed
print("\nTo clean cache, you can use:")
print("  python -m src.cli clean --cache-dir cache --keep 3")
print("  (keeps 3 most recent files of each type)")

In [None]:
# Demonstrate reproducibility by running the same experiment twice
print("Testing reproducibility...")

# Create a simple config for reproducibility test
repro_config = {
    "data": {"dataset": "mnist", "batch_size": 16, "download": False},
    "model": {"name": "linear", "input_size": 784, "num_classes": 10},
    "training": {"epochs": 1, "learning_rate": 0.01, "optimizer": "sgd"},
    "evaluation": {"metrics": ["accuracy"]},
    "experiment": {
        "name": "reproducibility_test",
        "random_seed": 123,  # Fixed seed
        "device": "cpu",
        "save_model": False
    },
    "logging": {"log_every_n_steps": 10}
}

results_run1 = []
results_run2 = []

# Run 1
config1 = ExperimentConfig(**repro_config)
exp1 = Experiment.from_config(config1)
res1 = exp1.run()
results_run1.append(res1.get('test_metrics', {}).get('accuracy', 0))

# Run 2 (same configuration)
config2 = ExperimentConfig(**repro_config)
exp2 = Experiment.from_config(config2)
res2 = exp2.run()
results_run2.append(res2.get('test_metrics', {}).get('accuracy', 0))

print(f"Run 1 accuracy: {results_run1[0]:.6f}")
print(f"Run 2 accuracy: {results_run2[0]:.6f}")
print(f"Difference: {abs(results_run1[0] - results_run2[0]):.6f}")

if abs(results_run1[0] - results_run2[0]) < 1e-5:
    print("✅ Results are reproducible!")
else:
    print("❌ Results differ - check random seed settings")

## 8. Summary and Conclusions

This notebook demonstrated the key features of the research project template:

### ✅ **What we accomplished:**
1. **Configuration Management**: Loaded and validated experiment configurations
2. **Multiple Models**: Compared Linear, MLP, and CNN models on MNIST
3. **Automated Pipeline**: Used ExCa caching for efficient experiment runs
4. **Results Analysis**: Created visualizations and performance comparisons
5. **Reproducibility**: Demonstrated consistent results with fixed random seeds
6. **Extensibility**: Showed how to create custom configurations

### 🔑 **Key Benefits of this Template:**
- **Type Safety**: Pydantic ensures configuration validation
- **Caching**: ExCa speeds up repeated experiments
- **Reproducibility**: Fixed seeds ensure consistent results
- **Modularity**: Easy to swap models, optimizers, and datasets
- **Testing**: Comprehensive test suite ensures code quality
- **CI/CD**: Automated testing on every commit

### 🚀 **Next Steps for Research:**
1. **Hyperparameter Tuning**: Use the tuned configuration we created
2. **Data Augmentation**: Add rotation, scaling, and other transformations
3. **Advanced Models**: Implement ResNet, Vision Transformer, etc.
4. **Different Datasets**: Extend to CIFAR-10, ImageNet, or custom datasets
5. **Ensemble Methods**: Combine multiple models for better performance
6. **Analysis Tools**: Add confusion matrices, t-SNE visualizations, etc.

### 📚 **For New Lab Members:**
- Clone this template for your own projects
- Follow the development workflow (feature branches → dev → main)
- Write tests for new functionality
- Use the CLI for running experiments: `python -m src.cli run configs/local.yaml`
- Check the README.md for detailed setup instructions

**Happy researching!** 🔬