# Training CIFAR-100 Model (Local Mac)
This notebook trains a model on CIFAR-100 locally on Mac with the **new modular codebase**.

**Training Command:**
```bash
python train.py --epochs 50 --batch-size 256 --model resnet50 --scheduler onecycle --lr-finder
```

**Modular Structure:**
- Datasets in `datasets/` - Easy to add new datasets
- Models in `models/` - Clean separation of architectures  
- Training components in `training/` - Reusable optimizer, scheduler, LR finder
- Utilities in `utils/` - Checkpointing, metrics, HuggingFace upload

## Check Python Environment

In [2]:
import sys
import platform
import os

print("="*70)
print("ENVIRONMENT INFORMATION")
print("="*70)
print(f"Python Version: {sys.version}")
print(f"Platform: {platform.platform()}")
print(f"Processor: {platform.processor()}")
print(f"Working Directory: {os.getcwd()}")
print("="*70)

ENVIRONMENT INFORMATION
Python Version: 3.12.3 (main, Oct  7 2025, 19:27:29) [Clang 17.0.0 (clang-1700.0.13.5)]
Platform: macOS-15.5-arm64-arm-64bit
Processor: arm
Working Directory: /Users/pandurang/projects/pandurang/erav4-backpropbay/session8-more-trials


## Check GPU/MPS Availability

In [3]:
try:
    import torch
    
    print("\n" + "="*70)
    print("PYTORCH & GPU DETECTION")
    print("="*70)
    print(f"PyTorch Version: {torch.__version__}")
    
    # Check for CUDA
    if torch.cuda.is_available():
        print(f"✓ CUDA is available")
        print(f"  GPU: {torch.cuda.get_device_name(0)}")
        print(f"  CUDA Version: {torch.version.cuda}")
        device = torch.device('cuda')
    # Check for MPS (Apple Silicon)
    elif hasattr(torch.backends, 'mps') and torch.backends.mps.is_available():
        print(f"✓ Apple MPS (Metal Performance Shaders) is available")
        print(f"  This Mac has Apple Silicon GPU acceleration")
        device = torch.device('mps')
    else:
        print(f"⚠ No GPU detected. Training will use CPU")
        device = torch.device('cpu')
    
    print(f"✓ Using device: {device}")
    print("="*70)
    
except ImportError:
    print("⚠ PyTorch not installed. Will install dependencies in next step.")


PYTORCH & GPU DETECTION
PyTorch Version: 2.9.0
✓ Apple MPS (Metal Performance Shaders) is available
  This Mac has Apple Silicon GPU acceleration
✓ Using device: mps


## Install Dependencies

In [4]:
# Install dependencies from requirements.txt
!pip install -r requirements.txt


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m24.0[0m[39;49m -> [0m[32;49m25.2[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m


## Import and Verify Libraries

In [5]:
import torch
import torchvision
import torchsummary
import torchinfo
import tqdm
import matplotlib
import numpy
import plotille
import albumentations

print("\n" + "="*70)
print("LIBRARY VERSIONS")
print("="*70)
print(f"PyTorch: {torch.__version__}")
print(f"TorchVision: {torchvision.__version__}")
print(f"NumPy: {numpy.__version__}")
print(f"Matplotlib: {matplotlib.__version__}")
print(f"Albumentations: {albumentations.__version__}")
print(f"TQDM: {tqdm.__version__}")
print("="*70)
print("✓ All dependencies successfully imported")
print("="*70)

  from .autonotebook import tqdm as notebook_tqdm



LIBRARY VERSIONS
PyTorch: 2.9.0
TorchVision: 0.24.0
NumPy: 2.2.6
Matplotlib: 3.10.7
Albumentations: 2.0.8
TQDM: 4.67.1
✓ All dependencies successfully imported


## Verify Training Files

In [6]:
import os

print("\n" + "="*70)
print("VERIFYING TRAINING FILES AND MODULAR STRUCTURE")
print("="*70)

# Check required files
required_files = [
    'train.py',
    'config.json',
    'requirements.txt'
]

print("\nRequired Files:")
files_ok = True
for file in required_files:
    exists = os.path.exists(file)
    status = "✓" if exists else "✗"
    print(f"{status} {file}")
    if not exists:
        files_ok = False

# Check modular directories
required_dirs = ['datasets', 'models', 'training', 'utils']
print("\nModular Directories:")
dirs_ok = True
for dir in required_dirs:
    exists = os.path.isdir(dir)
    status = "✓" if exists else "✗"
    print(f"{status} {dir}/")
    if not exists:
        dirs_ok = False

print("="*70)
if files_ok and dirs_ok:
    print("✓ All required files and modular structure verified!")
else:
    print("⚠ Some files or directories are missing. Please check your directory.")
print("="*70)


VERIFYING TRAINING FILES AND MODULAR STRUCTURE

Required Files:
✓ train.py
✓ config.json
✓ requirements.txt

Modular Directories:
✓ datasets/
✓ models/
✓ training/
✓ utils/
✓ All required files and modular structure verified!


## Review Training Configuration

The training will use the following configuration with the **new modular codebase**:

- **Model**: resnet50 (from `models/resnet50.py`)
  - Alternative: `wideresnet28-10` (from `models/wideresnet.py`)
- **Epochs**: 50
- **Batch Size**: 256
- **Scheduler**: OneCycle Learning Rate Policy
- **LR Finder**: Enabled (will automatically find optimal learning rate)

The LR Finder will:
1. Run a learning rate range test before training
2. Automatically determine the best `max_lr` and `base_lr` for OneCycle scheduler
3. Save the LR finder plot to `checkpoint_N/lr_finder_plot.png`

**Available Models:**
- `resnet50` - ResNet50 (23.5M parameters)
- `wideresnet28-10` - WideResNet-28-10 (36.5M parameters)

## Start Training

**Note:** This will take a significant amount of time depending on your hardware.

Training progress will be displayed below with:
- Real-time loss and accuracy metrics
- Learning rate schedule visualization
- Checkpoint saving at key epochs
- Early stopping if no improvement

**Expected behavior:**
1. LR Finder will run first (3 epochs of range testing)
2. LR Finder will suggest optimal learning rates
3. Main training will begin with the suggested learning rates
4. Model checkpoints will be saved to `checkpoint_N/` folder

**Using the new modular codebase** - models loaded from `models/` directory!

In [7]:
# Run training with the specified configuration
# Note: Using modular structure - model loaded from models/resnet50.py
!python train.py --epochs 50 --batch-size 256 --model resnet50 --scheduler onecycle --lr-finder

✓ Loaded config from: ./config.json

TRAINING CONFIGURATION
Model: resnet50
Dataset: cifar100
Epochs: 50
Batch Size: 256
Optimizer: sgd
Scheduler: onecycle
Augmentation: strong
MixUp: True (alpha=0.2)
Label Smoothing: 0.1
Mixed Precision: True
Gradient Clipping: 1.0
LR Finder: True


GPU DETECTION AND CONFIGURATION
✓ Apple MPS (Metal Performance Shaders) is available
✓ Using device: mps
✓ PyTorch Version: 2.9.0


📊 Dataset: CIFAR-100
   Classes: 100
   Train samples: 50000
   Test samples: 10000

Loading datasets...
✓ Train batches: 196
✓ Test batches: 40

Creating model: resnet50
✓ Model created

✓ Optimizer: sgd

✓ Scheduler: onecycle

📁 Checkpoint folder: ./checkpoint_1
  self.scaler = GradScaler() if self.use_amp else None
  super().__init__(

STARTING LR FINDER

LEARNING RATE FINDER - RANGE TEST
Testing learning rates from 1.00e-06 to 1.00e+00
Running for 3 epochs
Note: Using clean data (no MixUp, no Label Smoothing)

LR Finder Epoch 1/3:   0%|                              | 0/196

## Training Complete - View Results

After training completes, you can view the results below.

In [None]:
# List checkpoint directories
import glob
import json

checkpoint_dirs = sorted(glob.glob('checkpoint_*'), reverse=True)

if checkpoint_dirs:
    latest_checkpoint = checkpoint_dirs[0]
    print(f"\n{'='*70}")
    print(f"LATEST CHECKPOINT: {latest_checkpoint}")
    print(f"{'='*70}\n")
    
    # Load and display metrics
    metrics_file = os.path.join(latest_checkpoint, 'metrics.json')
    if os.path.exists(metrics_file):
        with open(metrics_file, 'r') as f:
            metrics = json.load(f)
        
        print(f"Best Test Accuracy: {metrics['best_test_accuracy']:.2f}%")
        print(f"Best Epoch: {metrics['best_epoch']}")
        print(f"Total Epochs Trained: {len(metrics['epochs'])}")
        print(f"\nFinal Metrics:")
        print(f"  - Train Accuracy: {metrics['train_accuracies'][-1]:.2f}%")
        print(f"  - Test Accuracy: {metrics['test_accuracies'][-1]:.2f}%")
        print(f"  - Train Loss: {metrics['train_losses'][-1]:.4f}")
        print(f"  - Test Loss: {metrics['test_losses'][-1]:.4f}")
    
    # List saved files
    print(f"\nSaved Files in {latest_checkpoint}:")
    for file in sorted(os.listdir(latest_checkpoint)):
        file_path = os.path.join(latest_checkpoint, file)
        if os.path.isfile(file_path):
            file_size = os.path.getsize(file_path) / (1024 * 1024)  # MB
            print(f"  - {file} ({file_size:.2f} MB)")
    
    print(f"\n{'='*70}")
else:
    print("No checkpoint directories found. Training may not have completed successfully.")

## View Training Curves

In [None]:
from IPython.display import Image, display

if checkpoint_dirs:
    latest_checkpoint = checkpoint_dirs[0]
    
    # Display training curves
    curves_path = os.path.join(latest_checkpoint, 'training_curves.png')
    if os.path.exists(curves_path):
        print("Training Curves:")
        display(Image(filename=curves_path))
    else:
        print("Training curves not found.")
    
    # Display LR Finder plot
    lr_finder_path = os.path.join(latest_checkpoint, 'lr_finder_plot.png')
    if os.path.exists(lr_finder_path):
        print("\nLR Finder Plot:")
        display(Image(filename=lr_finder_path))
    else:
        print("LR Finder plot not found.")

## Load and Test Best Model

You can load the best saved model and use it for inference or further testing.

In [None]:
import torch

if checkpoint_dirs:
    latest_checkpoint = checkpoint_dirs[0]
    best_model_path = os.path.join(latest_checkpoint, 'best_model.pth')
    
    if os.path.exists(best_model_path):
        # Load the checkpoint with weights_only=False for PyTorch 2.6+
        checkpoint = torch.load(best_model_path, map_location='cpu', weights_only=False)
        
        print(f"\n{'='*70}")
        print("BEST MODEL CHECKPOINT INFORMATION")
        print(f"{'='*70}")
        print(f"Epoch: {checkpoint['epoch']}")
        print(f"Train Accuracy: {checkpoint['train_accuracy']:.2f}%")
        print(f"Test Accuracy: {checkpoint['test_accuracy']:.2f}%")
        print(f"Train Loss: {checkpoint['train_loss']:.4f}")
        print(f"Test Loss: {checkpoint['test_loss']:.4f}")
        print(f"Timestamp: {checkpoint['timestamp']}")
        
        print(f"\nModel Configuration:")
        for key, value in checkpoint['config'].items():
            print(f"  - {key}: {value}")
        
        print(f"{'='*70}\n")
        
        # Load model using the new modular structure
        from models import get_model
        
        model_name = checkpoint['config'].get('model', 'resnet50')
        model = get_model(model_name, num_classes=100)
        model.load_state_dict(checkpoint['model_state_dict'])
        model.eval()
        print(f"✓ Model '{model_name}' loaded successfully from modular structure")
        print("✓ Model ready for inference")
    else:
        print("⚠ Best model checkpoint not found.")

## Summary

Training is complete using the **new modular codebase**! The following artifacts have been saved:

### 📁 Checkpoint Files:
- **Best Model**: `checkpoint_N/best_model.pth` - The model with the best test accuracy
- **Training Curves**: `checkpoint_N/training_curves.png` - Visualization of training progress
- **LR Finder Plot**: `checkpoint_N/lr_finder_plot.png` - Learning rate range test results
- **Metrics**: `checkpoint_N/metrics.json` - Complete training history
- **Config**: `checkpoint_N/config.json` - Training configuration
- **Model Card**: `checkpoint_N/README.md` - Detailed model documentation

### 🏗️ Modular Structure Benefits:
- **Datasets** (`datasets/`) - Easy to add CIFAR-10, ImageNet, etc.
- **Models** (`models/`) - Clean separation of architectures
- **Training** (`training/`) - Reusable optimizer, scheduler, LR finder
- **Utils** (`utils/`) - Checkpointing, metrics, HuggingFace upload

### 🎯 Model Usage (New Modular Way):
```python
import torch
from models import get_model

# Load checkpoint (PyTorch 2.6+ requires weights_only=False)
checkpoint = torch.load('checkpoint_N/best_model.pth', 
                       map_location='cpu', weights_only=False)

# Get model using modular factory
model = get_model('resnet50', num_classes=100)  # or 'wideresnet28-10'
model.load_state_dict(checkpoint['model_state_dict'])
model.eval()
```

### 🆕 Available Models:
- `resnet50` - ResNet50 (23.5M parameters)
- `wideresnet28-10` - WideResNet-28-10 (36.5M parameters)
- More models can be easily added to `models/` directory!

You can find all checkpoints in the `checkpoint_N/` directories where N is the run number.

---

**Modular codebase makes it easy to extend and maintain!**