# CIFAR-10 Optimizer Benchmark — Run Layout & Reproducibility

> **Note:** We created a results folder because training was launched **outside this notebook**.  
> There isn’t a single, continuous terminal log here. Each run wrote its own logs and summaries
> into a dedicated subfolder. This notebook will **read/plot the saved artifacts** rather than
> re-train everything inline.

## What the training script does
- Loads **CIFAR-10** with strong augmentation (RandomCrop/Flip, TrivialAugmentWide, RandomErasing) and normalization.
- Builds the model:
  - `--model resnet` → ResNet-18–style CNN.
  - `--model vit` → small, CIFAR-friendly Vision Transformer (pre-norm, patch size 4).
- Picks the optimizer:
  - `--opt adam`  → AdamW baseline.
  - `--opt muon`  → Newton–Schulz orthonormal updates on 2D weights; AdamW on IO/scalars.
  - `--opt scion` → spectral step for hidden matrices; ℓ∞ step for IO/scalars.
  - `--opt dion`  → low-rank orthonormal update (with error-feedback) + AdamW on IO/scalars.
- Trains for `--epochs 80` with **AMP** (`--amp`), logs **loss/accuracy/time** per epoch, and saves:
  - per-run CSVs (metrics by epoch),
  - a summary JSON,
  - optional plots (if enabled in the script).

### Key flags
- `--model {resnet|vit}`: choose architecture  
- `--opt {adam|muon|scion|dion}`: choose optimizer  
- `--epochs 80`: training length  
- `--amp`: enable mixed precision  
- `--out_dir ./results/<name>`: where logs/CSVs/summaries are saved

## Exact commands used
Specific commands used to train the models, it took us about 2 hours to run on RTX5090. To modify the other default parameters either refer to Args dataclass in trianing.py, or pass additional arguments in the CLIs that follows the naming in the training.py.
```bash
# ResNet18 + Optimizers
python training.py --model resnet --opt adam  --epochs 80 --amp --out_dir ./results/resnet_adam
python training.py --model resnet --opt dion  --epochs 80 --amp --out_dir ./results/resnet_dion
python training.py --model resnet --opt muon  --epochs 80 --amp --out_dir ./results/resnet_muon
python training.py --model resnet --opt scion --epochs 80 --amp --out_dir ./results/resnet_scion

# MiniViT (ViT-small for CIFAR) + Optimizers
python training.py --model vit --opt adam  --epochs 80 --amp --out_dir ./results/vit_adam
python training.py --model vit --opt dion  --epochs 80 --amp --out_dir ./results/vit_dion
python training.py --model vit --opt muon  --epochs 80 --amp --out_dir ./results/vit_muon
python training.py --model vit --opt scion --epochs 80 --amp --out_dir ./results/vit_scion


The script below collects the outcomes of **all training runs** stored under `./results/`
(e.g., `resnet_adam/`, `vit_dion/`, …) and builds a **single, unified view** of the experiment.

What it does
1. **Scan** each run folder in `./results/`.
2. **Load metrics** from per-run CSV/JSON (epoch-wise loss, accuracy, epoch time, best/final accuracy).
3. **Assemble a summary table** (`summary_results.csv`) with one row per (model, optimizer) run.
4. **Generate comparison plots**:
   - `train_loss_vs_epoch.png`
   - `val_acc_vs_epoch.png`
   - `val_acc_vs_time.png` (accuracy vs. cumulative training time)
   - `best_val_acc_bar.png` (best accuracy across runs)
   - `time_to_60_bar.png`
   -  `time_to_70_bar.png` 

In [1]:
from visualizations import main

# save plots in png, save data in csv
main()

✅ Done.
 - Summary: ./report\summary_results.csv
 - Per-epoch: ./report\epoch_metrics_long.csv
 - Graphs in: ./report


## Conclusions
Across all figures, the ResNet experiments show that Adam, Muon, and Scion converge to a narrow band of high final accuracy, while Dion trails by several points. 
In the “best validation accuracy” bar plot, the three leading methods for ResNet cluster in the low-90% range, with Scion and Muon marginally ahead and Adam essentially tied within visual uncertainty; Dion is clearly lower. 

The milestone-time plots (“time to 60%/70%”) indicate that Scion and Adam reach useful accuracy fastest on ResNet, with Muon requiring more wall-clock and Dion the slowest.
These patterns are consistent with the Scion design—spectral control on hidden layers and $\ell_\infty$ scaling on I/O layers—which the paper argues should stabilize early updates and enable aggressive learning rates. These are likewise consistent with Muon’s Newton–Schulz orthonormalization, which improves conditioning but adds per-step computation, often yielding slower early-phase progress yet competitive or superior late-epoch accuracy on models with substantial 2D linear structure. 
Dion’s underperformance on ResNet is expected in this configuration: the method is primarily targeted at distributed training and does not natively support convolutional kernels.

For the Vision Transformer, the ordering differs: the “best validation accuracy” plot places Adam first (low-80% range) with Scion a close second, Muon lower, and Dion substantially lower. 
The milestone-time figures show that Scion tends to reach 60% and 70% sooner than Adam on the ViT, but the final accuracy favors Adam under the present recipe, a result aligned with common practice on small/medium ViTs where Adam(W) with cosine scheduling and moderate regularization remains a strong baseline.

Taken together, the findings align with the central claims of the referenced works once scale and architectural fit are considered.
- Scion demonstrates the expected early-epoch acceleration and retains competitive final accuracy, particularly on ResNet. 
- Muon exhibits the anticipated trade-off—more expensive updates but good conditioning—and would be expected to realize larger gains on longer runs or larger, linear-heavy models after modest sweeps of learning rate and Newton–Schulz iterations.
- Dion is optimized for synchronous, communication-efficient training at scale, which isn’t fully exercised on one GPU. Its missing native conv support is not perfect for CIFAR-10/ResNet, nevertheless, it remains slightly advantageous here.
- Adam remains a strong general-purpose baseline, especially for small ViTs under standard training protocols.