AMUSE: Anytime Muon with Stable Gradient Evaluation
Jueun Kim* · Baekrok Shin* · Jihun Yun · Beomhan Baek · Minhak Song · Chulhee Yun
AMUSE combines Muon with Schedule-Free updates by maintaining three sequences:
the fast base sequence
where the interpolation coefficient increases after warmup as
The parameter
For matrix-valued hidden parameters, AMUSE applies Muon at
Thus, AMUSE preserves Muon's rapid progress in early training while gradually stabilizing the trajectory through Schedule-Free averaging. This preserves Muon's rapid progress while reducing valley-wall oscillations, enabling schedule-free and anytime training.
Full paper abstract:
Modern deep learning commonly relies on AdamW with prescribed learning rate schedules, but recent works challenge both components: Schedule-Free optimization removes explicit schedules via iterate averaging, and Muon improves the update geometry by orthogonalizing momentum for matrix parameters. Despite Muon's strong empirical performance, its underlying mechanism remains partially understood. We study Muon through the river-valley loss landscape, where useful training progress occurs along a flat, low-curvature bulk subspace, while high-curvature dominant directions form steep valley walls that induce oscillations. We empirically show that while Muon's orthogonalization accelerates river progress by increasing the bulk component, it also amplifies dominant-direction noise, causing oscillatory trajectories. Building on this, we propose Anytime MUon with Stable gradient Evaluation (AMUSE), which integrates Muon's rapid bulk progress with the stabilizing effect of Schedule-Free averaging. AMUSE uses a time-varying interpolation coefficient that initially evaluates gradients near the fast Muon sequence for rapid adaptation, then gradually shifts toward the stable averaged sequence to suppress valley-wall oscillations. As a result, AMUSE requires no learning rate schedules and supports anytime training. Across vision tasks and large language model pretraining, AMUSE consistently improves the performance-iteration Pareto frontier over (Schedule-Free) AdamW and Muon.
amuse/
├── src/lm/ # language model pretraining experiments
├── src/image/ # vision/image experiments
├── src/optim/ # AMUSE and optimizer implementations
├── scripts/ # launch scripts
└── assets/ # figures and result plots
conda create -n amuse python=3.10
conda activate amuse
pip install -r requirements.txtFor language model pretraining, run AMUSE on a 124M Llama-style model with:
bash scripts/lm/124m/amuse.shSet YOUR_DATASET_DIR in the script to the root directory used by the FineWeb-100B loader.
For image classification, run AMUSE on CIFAR-10 with:
bash scripts/image/cifar10/amuse.shOther image experiments are available through:
bash scripts/image/cifar100/amuse.sh
bash scripts/image/svhn/amuse.sh
bash scripts/image/imagenet/amuse.shFor ImageNet, set YOUR_DATASET_DIR in the corresponding script. See src/lm/README.md and src/image/README.md for task-specific optimizer and parameter grouping details.
AMUSE achieves the performance-iteration Pareto frontier in Llama-style pretraining on FineWeb-100B.
The same trend holds across model scales.
AMUSE also performs strongly across standard image classification benchmarks.
@article{kim2026amuse,
title={{AMUSE}: Anytime Muon with Stable Gradient Evaluation},
author={Kim, Jueun and Shin, Baekrok and Yun, Jihun and Baek, Beomhan and Song, Minhak and Yun, Chulhee},
journal={arXiv preprint arXiv:2605.22432},
year={2026}
}

