Skip to content

kjeiun/amuse

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

14 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

AMUSE

AMUSE: Anytime Muon with Stable Gradient Evaluation

Jueun Kim* · Baekrok Shin* · Jihun Yun · Beomhan Baek · Minhak Song · Chulhee Yun

arXiv BibTeX Python

Method Overview

AMUSE combines Muon with Schedule-Free updates by maintaining three sequences: the fast base sequence $Z_t$, the averaged sequence $X_t$, and the gradient-evaluation point $Y_t$. At each step, AMUSE evaluates the gradient at

$$ Y_t = (1-\beta_t) Z_t + \beta_t X_t, $$

where the interpolation coefficient increases after warmup as

$$ \beta_t = \begin{cases} \beta_1, & t \le T_0, \\ 1 - \left(\frac{T_0 - 1}{t - 1}\right)^\rho (1-\beta_1), & t > T_0. \end{cases} $$

The parameter $\rho$ controls how quickly the gradient-evaluation point shifts from the fast Muon trajectory $Z_t$ toward the stable averaged trajectory $X_t$.

For matrix-valued hidden parameters, AMUSE applies Muon at $Y_t$:

$$ M_t = \mu M_{t-1} + \nabla L(Y_t), \qquad O_t = \mathrm{NewtonSchulz}(M_t), $$

$$ Z_{t+1} = Z_t - \eta O_t, \qquad X_{t+1} = \left(1-\frac{1}{t+1}\right) X_t + \frac{1}{t+1} Z_{t+1}. $$

Thus, AMUSE preserves Muon's rapid progress in early training while gradually stabilizing the trajectory through Schedule-Free averaging. This preserves Muon's rapid progress while reducing valley-wall oscillations, enabling schedule-free and anytime training.

Full paper abstract:

Modern deep learning commonly relies on AdamW with prescribed learning rate schedules, but recent works challenge both components: Schedule-Free optimization removes explicit schedules via iterate averaging, and Muon improves the update geometry by orthogonalizing momentum for matrix parameters. Despite Muon's strong empirical performance, its underlying mechanism remains partially understood. We study Muon through the river-valley loss landscape, where useful training progress occurs along a flat, low-curvature bulk subspace, while high-curvature dominant directions form steep valley walls that induce oscillations. We empirically show that while Muon's orthogonalization accelerates river progress by increasing the bulk component, it also amplifies dominant-direction noise, causing oscillatory trajectories. Building on this, we propose Anytime MUon with Stable gradient Evaluation (AMUSE), which integrates Muon's rapid bulk progress with the stabilizing effect of Schedule-Free averaging. AMUSE uses a time-varying interpolation coefficient that initially evaluates gradients near the fast Muon sequence for rapid adaptation, then gradually shifts toward the stable averaged sequence to suppress valley-wall oscillations. As a result, AMUSE requires no learning rate schedules and supports anytime training. Across vision tasks and large language model pretraining, AMUSE consistently improves the performance-iteration Pareto frontier over (Schedule-Free) AdamW and Muon.

Repository Structure

amuse/
├── src/lm/       # language model pretraining experiments
├── src/image/    # vision/image experiments
├── src/optim/    # AMUSE and optimizer implementations
├── scripts/      # launch scripts
└── assets/       # figures and result plots

Installation

conda create -n amuse python=3.10
conda activate amuse
pip install -r requirements.txt

Quick Start

For language model pretraining, run AMUSE on a 124M Llama-style model with:

bash scripts/lm/124m/amuse.sh

Set YOUR_DATASET_DIR in the script to the root directory used by the FineWeb-100B loader.

For image classification, run AMUSE on CIFAR-10 with:

bash scripts/image/cifar10/amuse.sh

Other image experiments are available through:

bash scripts/image/cifar100/amuse.sh
bash scripts/image/svhn/amuse.sh
bash scripts/image/imagenet/amuse.sh

For ImageNet, set YOUR_DATASET_DIR in the corresponding script. See src/lm/README.md and src/image/README.md for task-specific optimizer and parameter grouping details.

Results

Language Model Pretraining

AMUSE achieves the performance-iteration Pareto frontier in Llama-style pretraining on FineWeb-100B.

FineWeb Llama 124M pretraining results

The same trend holds across model scales.

FineWeb Llama scaling results for 720M and 1.3B models

Image Classification

AMUSE also performs strongly across standard image classification benchmarks.

image classification results

Citation

@article{kim2026amuse,
  title={{AMUSE}: Anytime Muon with Stable Gradient Evaluation},
  author={Kim, Jueun and Shin, Baekrok and Yun, Jihun and Baek, Beomhan and Song, Minhak and Yun, Chulhee},
  journal={arXiv preprint arXiv:2605.22432},
  year={2026}
}

About

AMUSE optimizer implementation

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors