Skip to content

merlresearch/ANIRA

Repository files navigation

ANIRA: Understanding Dynamic Compute Allocation in Recurrent Transformers, ICML 2026

Ibraheem Muhammad Moosa, Suhas Lohit, Ye Wang, Moitreya Chatterjee, Wenpeng Yin

ArXiv preprintCitation

Summary

Token-level adaptive computation seeks to reduce inference cost by allocating more computation to harder tokens and less to easier ones. However, prior work is primarily evaluated on natural-language benchmarks using task-level metrics, where token-level difficulty is unobservable and confounded with architectural factors, making it unclear whether compute allocation truly aligns with underlying complexity. We address this gap through three contributions. First, we introduce a complexity-controlled evaluation paradigm using algorithmic and synthetic language tasks with parameterized difficulty, enabling direct testing of token-level compute allocation. Second, we propose ANIRA (Adaptive Neural Iterative Reasoning Architecture), a unified recurrent Transformer framework that supports per-token variable-depth computation while isolating compute allocation decisions from other model factors. Third, we use this framework to conduct a systematic analysis of token-level adaptive computation across alignment with complexity, generalization, and decision timing. Our results show that compute allocation aligned with task complexity can emerge without explicit difficulty supervision, but such alignment does not imply algorithmic generalization: models fail to extrapolate to unseen input sizes despite allocating additional computation. We further find that early compute decisions rely on static structural cues, whereas online halting more closely tracks algorithmic execution state.

This repository provides the code, experiment scripts, and analysis utilities for the paper. The experiments study token-level adaptive computation in recurrent Transformers using ANIRA, with two decision mechanisms:

  • ANIRA-E: early compute allocation from shallow token representations
  • ANIRA-O: online halting during recurrent computation

Getting Started

Tested in a Linux/CUDA environment with Python 3.13.

From the anira/ directory:

python -m venv .venv
source .venv/bin/activate
python -m pip install --upgrade pip

Install PyTorch with the CUDA version matching your system (check with nvidia-smi). For example, for CUDA 12.4:

pip install torch==2.6.0 --index-url https://download.pytorch.org/whl/cu124

See pytorch.org for other CUDA versions. Then install the remaining dependencies:

python -m pip install -r requirements.txt

Controlled Complexity Evaluations

Commands below should be run from the anira/ directory unless stated otherwise.

MANO

Prefix modular arithmetic evaluation with complexity knob L (number of operators).

Train ANIRA-E

bash physics_of_lm/scripts/anira_e/mano.sh

Train ANIRA-O

bash physics_of_lm/scripts/anira_o/mano.sh

Compare compute allocation

python physics_of_lm/scripts/plotting/plot_mano_comparison.py \
    <anira_e_run_dir> <anira_o_run_dir> --out_dir plots/mano

BREVO

Dependency generation on DAGs with complexity knob N (graph size).

Train ANIRA-E

bash physics_of_lm/scripts/anira_e/brevo.sh

Train ANIRA-O

bash physics_of_lm/scripts/anira_o/brevo.sh

Compare compute allocation

python physics_of_lm/scripts/plotting/plot_brevo_comparison.py \
    <anira_e_run_dir> <anira_o_run_dir> --out_dir plots/brevo

CLRS-Text

Multitask algorithmic reasoning experiments across task-specific input sizes.

Train ANIRA-E

bash clrs_text/scripts/anira_e.sh

Train ANIRA-O

bash clrs_text/scripts/anira_o.sh

Compare accuracy and compute allocation

python clrs_text/scripts/plotting/plot_clrs_comparison.py \
    <anira_e_run_dir> <anira_o_run_dir> --out_dir plots/clrs

LANO

Synthetic language modeling from a PCFG.

Train ANIRA-E

bash physics_of_lm/scripts/anira_e/lano.sh

Train ANIRA-O

bash physics_of_lm/scripts/anira_o/lano.sh

DEPO

K-step successor retrieval on directed cycles.

Train ANIRA-E

bash physics_of_lm/scripts/anira_e/depo.sh

Train ANIRA-O

bash physics_of_lm/scripts/anira_o/depo.sh

Compare compute allocation

python physics_of_lm/scripts/plotting/plot_depo_comparison.py \
    <anira_e_run_dir> <anira_o_run_dir> --out_dir plots/depo

Training Compute

  • Each of the Physics of Language Models experiments -- MANO, BREVO, LANO, DEPO -- requires about 6 hours of compute on one Nvidia A100 GPU for training.
  • The CLRS-text experiments requires about 3 days of compute on one Nvidia A100 GPU for training.

References

  1. MANO, BREVO, DEPO, LANO: Allen-Zhu, Physics of Language Models: Part 4.1, Architecture Design and the Magic of Canon Layers, NeurIPS 2025
  2. CLRS-Text: Markeeva et al., The CLRS-Text Algorithmic Reasoning Language Benchmark, arXiv preprint arXiv:2406.04229

Additional Analyses

Training Dynamics

MANO checkpoint dynamics:

First evaluate all saved checkpoints:

bash physics_of_lm/scripts/eval_checkpoints.sh <anira_e_run_dir>
bash physics_of_lm/scripts/eval_checkpoints.sh <anira_o_run_dir>

Then generate checkpoint progression plots for each run::

python physics_of_lm/scripts/plotting/plot_mano_checkpoint_progression.py \
    <anira_e_run_dir> --out_dir plots/mano_anira_e_dynamics

python physics_of_lm/scripts/plotting/plot_mano_checkpoint_progression.py \
    <anira_o_run_dir> --out_dir plots/mano_anira_o_dynamics

BREVO checkpoint dynamics:

First evaluate all saved checkpoints:

bash physics_of_lm/scripts/eval_checkpoints.sh <anira_e_run_dir>
bash physics_of_lm/scripts/eval_checkpoints.sh <anira_o_run_dir>

Then generate checkpoint progression plots for each run::

python physics_of_lm/scripts/plotting/plot_brevo_checkpoint_progression.py \
    <anira_e_run_dir> --out_dir plots/brevo_anira_e_dynamics

python physics_of_lm/scripts/plotting/plot_brevo_checkpoint_progression.py \
    <anira_o_run_dir> --out_dir plots/brevo_anira_o_dynamics

Token-Level Analysis

MANO question-token analysis:

First dump token-level compute allocations:

python -m src.adaptive_compute.eval_adaptive_compute \
    --run_dir <run_dir> \
    --dump_compute_allocations \
    --dump_compute_allocations_only_correct \
    --dump_compute_allocations_skip_metrics

Then run:

python physics_of_lm/scripts/mano_question_token_analysis/analyze_mano_question_traces.py \
    --run_dir <run_dir> --out_dir plots/mano_question_tokens

BREVO answer-token analysis:

BREVO evaluation writes generation JSONL files with per-token compute_probs under <run_dir>/generations/.

python physics_of_lm/scripts/brevo_answer_token_analysis/analyze_brevo_answer_token_compute_allocation.py \
    --run_dir <run_dir> --output_dir plots/brevo_answer_tokens

MANO Extrapolation

MANO extrapolation beyond training length.

Train baselines first:

bash physics_of_lm/scripts/mano_baselines.sh

Then run:

python physics_of_lm/scripts/plotting/plot_mano_extrapolation.py \
    runs/mano/nonadaptive/L16_lr1e-4 runs/mano/nonrecurrent/L16_lr1e-4 \
    <anira_e_dir> <anira_o_dir> \
    --train_max_l 16 --out_dir plots/mano_extrapolation

Natural-Language Experiments

Commands below should be run from the natural_language/ directory.

Download and preprocess Nemotron-CC-Math-v1-4plus

huggingface-cli login

python data_preparation/download_nemotron.py \
    --dataset_path adaptive_retrofitted_llama/nemotron_math

python data_preparation/preprocess_data_packing.py \
    --out_path adaptive_retrofitted_llama/nemotron_math_tokenized \
    --dataset_location adaptive_retrofitted_llama/nemotron_math/datasets/Nemotron-CC-Math-v1-4plus/ \
    --cache_path adaptive_retrofitted_llama/cache \
    --save_path adaptive_retrofitted_llama/preprocessed_datasets/ \
    --max_length 4096 \
    --num_proc 16

Train

bash scripts/train_anira_e_4gpu.sh
bash scripts/train_anira_o_4gpu.sh

Update --dataset_path inside the training scripts to point to the Arrow dataset produced above.

Training Compute

  • About 1 day of compute on 4 Nvidia A100 GPUs using the default 5B tokens of training

Evaluate on GSM-Symbolic

Edit the model paths at the top of scripts/eval_gsm_symbolic_all.sh, then run:

bash scripts/eval_gsm_symbolic_all.sh

References

  1. Nemotron-CC-Math-v1-4plus: Mahabadi et al., Nemotron-CC-Math: A 133 Billion-Token-Scale High Quality Math Pretraining Dataset, arXiv preprint arXiv:2508.15096
  2. GSM-Symbolic: Mirzadeh et al., GSM-Symbolic: Understanding the Limitations of Mathematical Reasoning in Large Language Models, ICLR 2025
  3. Retrofitted Recurrence: McLiesh et al., Teaching Pretrained Language Models to Think Deeper with Retrofitted Recurrence, arXiv preprint arXiv:2511.07384

Contributing

See CONTRIBUTING.md for our policy on contributions.

License

Released under AGPL-3.0-or-later license, as found in the LICENSE.md file.

All files, except as noted below:

Copyright (C) 2026 Mitsubishi Electric Research Laboratories (MERL)

SPDX-License-Identifier: AGPL-3.0-or-later

The config and model files for the controlled complexity evaluation experiments

  • src/adaptive_compute/init.py
  • src/adaptive_compute/configuration_adaptive_compute.py
  • src/adaptive_compute/modelling_adaptive_compute.py

were adapted from HuggingFace-Transformers (license included in LICENSES/Apache-2.0.txt):

Copyright (c) 2026 Mitsubishi Electric Research Laboratories (MERL)
Copyright 2022 EleutherAI and the HuggingFace Inc. team. All rights reserved.

SPDX-License-Identifier: AGPL-3.0-or-later
SPDX-License-Identifier: Apache-2.0

The config file for the natural language experiments

  • natural_language/adaptive_retrofitted_llama/configuration_adaptive_retrofitted_llama.py

was adapted from Huginn (license included in LICENSES/Apache-2.0.txt):

Copyright (c) 2026 Mitsubishi Electric Research Laboratories (MERL)
Copyright (c) 2025 Jonas Geiping, John Kirchenbauer, Sean McLeish, Khalid Saifullah, Manli Shu, Neel Jain, Siddarth Singh, Abhimanyu Hans, Monte Hoover and Prajwal Singhanaia


SPDX-License-Identifier: AGPL-3.0-or-later
SPDX-License-Identifier: Apache-2.0

The model file for the natural language experiments

  • natural_language/adaptive_retrofitted_llama/modeling_adaptive_retrofitted_llama.py

was adapted from Retrofitting Recurrence (license included in LICENSES/Apache-2.0.txt):

Copyright (c) 2026 Mitsubishi Electric Research Laboratories (MERL)
Copyright (c) 2025 Sean McLeish, Jonas Geiping, Ang Li

SPDX-License-Identifier: AGPL-3.0-or-later
SPDX-License-Identifier: Apache-2.0

Citation

@article{moosa2026understanding,
  title={Understanding Dynamic Compute Allocation in Recurrent Transformers},
  author={Moosa, Ibraheem Muhammad and Lohit, Suhas and Wang, Ye and Chatterjee, Moitreya and Yin, Wenpeng},
  journal={International Conference on Machine Learning},
  year={2026}
}

About

Understanding Dynamic Compute Allocation in Recurrent Transformers (ICML 2026)

Resources

License

Contributing

Stars

Watchers

Forks

Packages

 
 
 

Contributors