# Training CIFAR-100 Model (Local Mac)
This notebook trains a model on CIFAR-100 locally on Mac with the specified configuration.

**Training Command:**
```bash
python train.py --epochs 50 --batch-size 256 --model resnet50 --scheduler onecycle --lr-finder
```

## Check Python Environment

In [7]:
import sys
import platform
import os

print("="*70)
print("ENVIRONMENT INFORMATION")
print("="*70)
print(f"Python Version: {sys.version}")
print(f"Platform: {platform.platform()}")
print(f"Processor: {platform.processor()}")
print(f"Working Directory: {os.getcwd()}")
print("="*70)

ENVIRONMENT INFORMATION
Python Version: 3.12.3 (main, Oct  7 2025, 19:27:29) [Clang 17.0.0 (clang-1700.0.13.5)]
Platform: macOS-15.5-arm64-arm-64bit
Processor: arm
Working Directory: /Users/pandurang/projects/pandurang/erav4-backpropbay/session8-more-trials


## Check GPU/MPS Availability

In [8]:
try:
    import torch
    
    print("\n" + "="*70)
    print("PYTORCH & GPU DETECTION")
    print("="*70)
    print(f"PyTorch Version: {torch.__version__}")
    
    # Check for CUDA
    if torch.cuda.is_available():
        print(f"✓ CUDA is available")
        print(f"  GPU: {torch.cuda.get_device_name(0)}")
        print(f"  CUDA Version: {torch.version.cuda}")
        device = torch.device('cuda')
    # Check for MPS (Apple Silicon)
    elif hasattr(torch.backends, 'mps') and torch.backends.mps.is_available():
        print(f"✓ Apple MPS (Metal Performance Shaders) is available")
        print(f"  This Mac has Apple Silicon GPU acceleration")
        device = torch.device('mps')
    else:
        print(f"⚠ No GPU detected. Training will use CPU")
        device = torch.device('cpu')
    
    print(f"✓ Using device: {device}")
    print("="*70)
    
except ImportError:
    print("⚠ PyTorch not installed. Will install dependencies in next step.")


PYTORCH & GPU DETECTION
PyTorch Version: 2.9.0
✓ Apple MPS (Metal Performance Shaders) is available
  This Mac has Apple Silicon GPU acceleration
✓ Using device: mps


## Install Dependencies

In [9]:
# Install dependencies from requirements.txt
!pip install -r requirements.txt


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m24.0[0m[39;49m -> [0m[32;49m25.2[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m


## Import and Verify Libraries

In [10]:
import torch
import torchvision
import torchsummary
import torchinfo
import tqdm
import matplotlib
import numpy
import plotille
import albumentations

print("\n" + "="*70)
print("LIBRARY VERSIONS")
print("="*70)
print(f"PyTorch: {torch.__version__}")
print(f"TorchVision: {torchvision.__version__}")
print(f"NumPy: {numpy.__version__}")
print(f"Matplotlib: {matplotlib.__version__}")
print(f"Albumentations: {albumentations.__version__}")
print(f"TQDM: {tqdm.__version__}")
print("="*70)
print("✓ All dependencies successfully imported")
print("="*70)


LIBRARY VERSIONS
PyTorch: 2.9.0
TorchVision: 0.24.0
NumPy: 2.2.6
Matplotlib: 3.10.7
Albumentations: 2.0.8
TQDM: 4.67.1
✓ All dependencies successfully imported


## Verify Training Files

In [11]:
import os

print("\n" + "="*70)
print("VERIFYING TRAINING FILES")
print("="*70)

required_files = [
    'train.py',
    'model.py',
    'config.json',
    'requirements.txt'
]

all_exist = True
for file in required_files:
    exists = os.path.exists(file)
    status = "✓" if exists else "✗"
    print(f"{status} {file}")
    if not exists:
        all_exist = False

print("="*70)
if all_exist:
    print("✓ All required files found")
else:
    print("⚠ Some files are missing. Please check your directory.")
print("="*70)


VERIFYING TRAINING FILES
✓ train.py
✓ model.py
✓ config.json
✓ requirements.txt
✓ All required files found


## Review Training Configuration

The training will use the following configuration:

- **Model**: resnet50
- **Epochs**: 50
- **Batch Size**: 256
- **Scheduler**: OneCycle Learning Rate Policy
- **LR Finder**: Enabled (will automatically find optimal learning rate)

The LR Finder will:
1. Run a learning rate range test before training
2. Automatically determine the best `max_lr` and `base_lr` for OneCycle scheduler
3. Save the LR finder plot to `checkpoint_N/lr_finder_plot.png`

## Start Training

**Note:** This will take a significant amount of time depending on your hardware.

Training progress will be displayed below with:
- Real-time loss and accuracy metrics
- Learning rate schedule visualization
- Checkpoint saving at key epochs
- Early stopping if no improvement

**Expected behavior:**
1. LR Finder will run first (3 epochs of range testing)
2. LR Finder will suggest optimal learning rates
3. Main training will begin with the suggested learning rates
4. Model checkpoints will be saved to `checkpoint_N/` folder

In [12]:
# Run training with the specified configuration
!python train.py --epochs 50 --batch-size 256 --model resnet50 --scheduler onecycle --lr-finder

✓ Loaded LR Finder config from: ./config.json

GPU DETECTION AND CONFIGURATION
✓ Apple MPS (Metal Performance Shaders) is available
✓ Using device: mps
✓ PyTorch Version: 2.9.0

✓ Loaded OneCycleLR config from: ./config.json
  Parameters: {'max_lr': 0.1, 'pct_start': 0.3, 'anneal_strategy': 'cos', 'div_factor': 25.0, 'final_div_factor': 10000.0, 'three_phase': False}
  self.scaler = GradScaler() if self.use_amp else None
  super().__init__(
📁 Creating checkpoint folder: ./checkpoint_10
Training resnet50 for 50 epochs
Configuration:
  - Model: WideResNet-28-10 (36.5M parameters)
  - Dataset: CIFAR-100
  - Batch Size: 256
  - Epochs: 50
  - Scheduler: ONECYCLE
    • OneCycleLR config: defaults
  - MixUp: True (alpha=0.2)
  - Label Smoothing: 0.1
  - Mixed Precision: True
  - Gradient Clipping: 1.0
  - Checkpoint Epochs: [10, 20, 25, 30, 40, 50, 60, 75, 90]
  - HuggingFace Upload: DISABLED (no token or repo provided)
  ℹ Models will be saved locally to ./checkpoints/

Model Architecture S

## Training Complete - View Results

After training completes, you can view the results below.

In [None]:
# List checkpoint directories
import glob
import json

checkpoint_dirs = sorted(glob.glob('checkpoint_*'), reverse=True)

if checkpoint_dirs:
    latest_checkpoint = checkpoint_dirs[0]
    print(f"\n{'='*70}")
    print(f"LATEST CHECKPOINT: {latest_checkpoint}")
    print(f"{'='*70}\n")
    
    # Load and display metrics
    metrics_file = os.path.join(latest_checkpoint, 'metrics.json')
    if os.path.exists(metrics_file):
        with open(metrics_file, 'r') as f:
            metrics = json.load(f)
        
        print(f"Best Test Accuracy: {metrics['best_test_accuracy']:.2f}%")
        print(f"Best Epoch: {metrics['best_epoch']}")
        print(f"Total Epochs Trained: {len(metrics['epochs'])}")
        print(f"\nFinal Metrics:")
        print(f"  - Train Accuracy: {metrics['train_accuracies'][-1]:.2f}%")
        print(f"  - Test Accuracy: {metrics['test_accuracies'][-1]:.2f}%")
        print(f"  - Train Loss: {metrics['train_losses'][-1]:.4f}")
        print(f"  - Test Loss: {metrics['test_losses'][-1]:.4f}")
    
    # List saved files
    print(f"\nSaved Files in {latest_checkpoint}:")
    for file in sorted(os.listdir(latest_checkpoint)):
        file_path = os.path.join(latest_checkpoint, file)
        if os.path.isfile(file_path):
            file_size = os.path.getsize(file_path) / (1024 * 1024)  # MB
            print(f"  - {file} ({file_size:.2f} MB)")
    
    print(f"\n{'='*70}")
else:
    print("No checkpoint directories found. Training may not have completed successfully.")

## View Training Curves

In [None]:
from IPython.display import Image, display

if checkpoint_dirs:
    latest_checkpoint = checkpoint_dirs[0]
    
    # Display training curves
    curves_path = os.path.join(latest_checkpoint, 'training_curves.png')
    if os.path.exists(curves_path):
        print("Training Curves:")
        display(Image(filename=curves_path))
    else:
        print("Training curves not found.")
    
    # Display LR Finder plot
    lr_finder_path = os.path.join(latest_checkpoint, 'lr_finder_plot.png')
    if os.path.exists(lr_finder_path):
        print("\nLR Finder Plot:")
        display(Image(filename=lr_finder_path))
    else:
        print("LR Finder plot not found.")

## Load and Test Best Model

You can load the best saved model and use it for inference or further testing.

In [None]:
import torch
import importlib

if checkpoint_dirs:
    latest_checkpoint = checkpoint_dirs[0]
    best_model_path = os.path.join(latest_checkpoint, 'best_model.pth')
    
    if os.path.exists(best_model_path):
        # Load the checkpoint
        checkpoint = torch.load(best_model_path, map_location='cpu')
        
        print(f"\n{'='*70}")
        print("BEST MODEL CHECKPOINT INFORMATION")
        print(f"{'='*70}")
        print(f"Epoch: {checkpoint['epoch']}")
        print(f"Train Accuracy: {checkpoint['train_accuracy']:.2f}%")
        print(f"Test Accuracy: {checkpoint['test_accuracy']:.2f}%")
        print(f"Train Loss: {checkpoint['train_loss']:.4f}")
        print(f"Test Loss: {checkpoint['test_loss']:.4f}")
        print(f"Timestamp: {checkpoint['timestamp']}")
        
        print(f"\nModel Configuration:")
        for key, value in checkpoint['config'].items():
            print(f"  - {key}: {value}")
        
        print(f"{'='*70}\n")
        
        # Optional: Load model for inference
        # Uncomment the following lines if you want to load the model
        # model_module = importlib.import_module('resnet50')
        # model = model_module.Net()
        # model.load_state_dict(checkpoint['model_state_dict'])
        # model.eval()
        # print("✓ Model loaded successfully and ready for inference")
    else:
        print("Best model checkpoint not found.")

## Summary

Training is complete! The following artifacts have been saved:

- **Best Model**: `checkpoint_N/best_model.pth` - The model with the best test accuracy
- **Training Curves**: `checkpoint_N/training_curves.png` - Visualization of training progress
- **LR Finder Plot**: `checkpoint_N/lr_finder_plot.png` - Learning rate range test results
- **Metrics**: `checkpoint_N/metrics.json` - Complete training history
- **Config**: `checkpoint_N/config.json` - Training configuration
- **Model Card**: `checkpoint_N/README.md` - Detailed model documentation

You can find all checkpoints in the `checkpoint_N/` directories where N is the run number.