AMUSE

AMUSE: Anytime Muon with Stable Gradient Evaluation

Jueun Kim* · Baekrok Shin* · Jihun Yun · Beomhan Baek · Minhak Song · Chulhee Yun

Method Overview

AMUSE combines Muon with Schedule-Free updates by maintaining three sequences: the fast base sequence $Z_t$, the averaged sequence $X_t$, and the gradient-evaluation point $Y_t$. At each step, AMUSE evaluates the gradient at

$$ Y_t = (1-\beta_t) Z_t + \beta_t X_t, $$

where the interpolation coefficient increases after warmup as

$$ \beta_t = \begin{cases} \beta_1, & t \le T_0, \\ 1 - \left(\frac{T_0 - 1}{t - 1}\right)^\rho (1-\beta_1), & t > T_0. \end{cases} $$

The parameter $\rho$ controls how quickly the gradient-evaluation point shifts from the fast Muon trajectory $Z_t$ toward the stable averaged trajectory $X_t$.

For matrix-valued hidden parameters, AMUSE applies Muon at $Y_t$:

$$ M_t = \mu M_{t-1} + \nabla L(Y_t), \qquad O_t = \mathrm{NewtonSchulz}(M_t), $$

$$ Z_{t+1} = Z_t - \eta O_t, \qquad X_{t+1} = \left(1-\frac{1}{t+1}\right) X_t + \frac{1}{t+1} Z_{t+1}. $$

Thus, AMUSE preserves Muon's rapid progress in early training while gradually stabilizing the trajectory through Schedule-Free averaging. This preserves Muon's rapid progress while reducing valley-wall oscillations, enabling schedule-free and anytime training.

Full paper abstract:

Modern deep learning commonly relies on AdamW with prescribed learning rate schedules, but recent works challenge both components: Schedule-Free optimization removes explicit schedules via iterate averaging, and Muon improves the update geometry by orthogonalizing momentum for matrix parameters. Despite Muon's strong empirical performance, its underlying mechanism remains partially understood. We study Muon through the river-valley loss landscape, where useful training progress occurs along a flat, low-curvature bulk subspace, while high-curvature dominant directions form steep valley walls that induce oscillations. We empirically show that while Muon's orthogonalization accelerates river progress by increasing the bulk component, it also amplifies dominant-direction noise, causing oscillatory trajectories. Building on this, we propose Anytime MUon with Stable gradient Evaluation (AMUSE), which integrates Muon's rapid bulk progress with the stabilizing effect of Schedule-Free averaging. AMUSE uses a time-varying interpolation coefficient that initially evaluates gradients near the fast Muon sequence for rapid adaptation, then gradually shifts toward the stable averaged sequence to suppress valley-wall oscillations. As a result, AMUSE requires no learning rate schedules and supports anytime training. Across vision tasks and large language model pretraining, AMUSE consistently improves the performance-iteration Pareto frontier over (Schedule-Free) AdamW and Muon.

Repository Structure

amuse/
├── src/lm/       # language model pretraining experiments
├── src/image/    # vision/image experiments
├── src/optim/    # AMUSE and optimizer implementations
├── scripts/      # launch scripts
└── assets/       # figures and result plots

Installation

conda create -n amuse python=3.10
conda activate amuse
pip install -r requirements.txt

Quick Start

For language model pretraining, run AMUSE on a 124M Llama-style model with:

bash scripts/lm/124m/amuse.sh

Set YOUR_DATASET_DIR in the script to the root directory used by the FineWeb-100B loader.

For image classification, run AMUSE on CIFAR-10 with:

bash scripts/image/cifar10/amuse.sh

Other image experiments are available through:

bash scripts/image/cifar100/amuse.sh
bash scripts/image/svhn/amuse.sh
bash scripts/image/imagenet/amuse.sh

For ImageNet, set YOUR_DATASET_DIR in the corresponding script. See src/lm/README.md and src/image/README.md for task-specific optimizer and parameter grouping details.

Results

Language Model Pretraining

AMUSE achieves the performance-iteration Pareto frontier in Llama-style pretraining on FineWeb-100B.

The same trend holds across model scales.

Image Classification

AMUSE also performs strongly across standard image classification benchmarks.

Citation

@article{kim2026amuse,
  title={{AMUSE}: Anytime Muon with Stable Gradient Evaluation},
  author={Kim, Jueun and Shin, Baekrok and Yun, Jihun and Baek, Beomhan and Song, Minhak and Yun, Chulhee},
  journal={arXiv preprint arXiv:2605.22432},
  year={2026}
}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

AMUSE

Method Overview

Repository Structure

Installation

Quick Start

Results

Language Model Pretraining

Image Classification

Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 14 Commits
assets		assets
scripts		scripts
src		src
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

AMUSE

Method Overview

Repository Structure

Installation

Quick Start

Results

Language Model Pretraining

Image Classification

Citation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages