Ibraheem Muhammad Moosa, Suhas Lohit, Ye Wang, Moitreya Chatterjee, Wenpeng Yin
Token-level adaptive computation seeks to reduce inference cost by allocating more computation to harder tokens and less to easier ones. However, prior work is primarily evaluated on natural-language benchmarks using task-level metrics, where token-level difficulty is unobservable and confounded with architectural factors, making it unclear whether compute allocation truly aligns with underlying complexity. We address this gap through three contributions. First, we introduce a complexity-controlled evaluation paradigm using algorithmic and synthetic language tasks with parameterized difficulty, enabling direct testing of token-level compute allocation. Second, we propose ANIRA (Adaptive Neural Iterative Reasoning Architecture), a unified recurrent Transformer framework that supports per-token variable-depth computation while isolating compute allocation decisions from other model factors. Third, we use this framework to conduct a systematic analysis of token-level adaptive computation across alignment with complexity, generalization, and decision timing. Our results show that compute allocation aligned with task complexity can emerge without explicit difficulty supervision, but such alignment does not imply algorithmic generalization: models fail to extrapolate to unseen input sizes despite allocating additional computation. We further find that early compute decisions rely on static structural cues, whereas online halting more closely tracks algorithmic execution state.
This repository provides the code, experiment scripts, and analysis utilities for the paper. The experiments study token-level adaptive computation in recurrent Transformers using ANIRA, with two decision mechanisms:
- ANIRA-E: early compute allocation from shallow token representations
- ANIRA-O: online halting during recurrent computation
Tested in a Linux/CUDA environment with Python 3.13.
From the anira/ directory:
python -m venv .venv
source .venv/bin/activate
python -m pip install --upgrade pipInstall PyTorch with the CUDA version matching your system (check with nvidia-smi). For example, for CUDA 12.4:
pip install torch==2.6.0 --index-url https://download.pytorch.org/whl/cu124See pytorch.org for other CUDA versions. Then install the remaining dependencies:
python -m pip install -r requirements.txtCommands below should be run from the anira/ directory unless stated otherwise.
Prefix modular arithmetic evaluation with complexity knob L (number of operators).
Train ANIRA-E
bash physics_of_lm/scripts/anira_e/mano.shTrain ANIRA-O
bash physics_of_lm/scripts/anira_o/mano.shCompare compute allocation
python physics_of_lm/scripts/plotting/plot_mano_comparison.py \
<anira_e_run_dir> <anira_o_run_dir> --out_dir plots/manoDependency generation on DAGs with complexity knob N (graph size).
Train ANIRA-E
bash physics_of_lm/scripts/anira_e/brevo.shTrain ANIRA-O
bash physics_of_lm/scripts/anira_o/brevo.shCompare compute allocation
python physics_of_lm/scripts/plotting/plot_brevo_comparison.py \
<anira_e_run_dir> <anira_o_run_dir> --out_dir plots/brevoMultitask algorithmic reasoning experiments across task-specific input sizes.
Train ANIRA-E
bash clrs_text/scripts/anira_e.shTrain ANIRA-O
bash clrs_text/scripts/anira_o.shCompare accuracy and compute allocation
python clrs_text/scripts/plotting/plot_clrs_comparison.py \
<anira_e_run_dir> <anira_o_run_dir> --out_dir plots/clrsSynthetic language modeling from a PCFG.
Train ANIRA-E
bash physics_of_lm/scripts/anira_e/lano.shTrain ANIRA-O
bash physics_of_lm/scripts/anira_o/lano.shK-step successor retrieval on directed cycles.
Train ANIRA-E
bash physics_of_lm/scripts/anira_e/depo.shTrain ANIRA-O
bash physics_of_lm/scripts/anira_o/depo.shCompare compute allocation
python physics_of_lm/scripts/plotting/plot_depo_comparison.py \
<anira_e_run_dir> <anira_o_run_dir> --out_dir plots/depo- Each of the Physics of Language Models experiments -- MANO, BREVO, LANO, DEPO -- requires about 6 hours of compute on one Nvidia A100 GPU for training.
- The CLRS-text experiments requires about 3 days of compute on one Nvidia A100 GPU for training.
- MANO, BREVO, DEPO, LANO: Allen-Zhu, Physics of Language Models: Part 4.1, Architecture Design and the Magic of Canon Layers, NeurIPS 2025
- CLRS-Text: Markeeva et al., The CLRS-Text Algorithmic Reasoning Language Benchmark, arXiv preprint arXiv:2406.04229
MANO checkpoint dynamics:
First evaluate all saved checkpoints:
bash physics_of_lm/scripts/eval_checkpoints.sh <anira_e_run_dir>
bash physics_of_lm/scripts/eval_checkpoints.sh <anira_o_run_dir>Then generate checkpoint progression plots for each run::
python physics_of_lm/scripts/plotting/plot_mano_checkpoint_progression.py \
<anira_e_run_dir> --out_dir plots/mano_anira_e_dynamics
python physics_of_lm/scripts/plotting/plot_mano_checkpoint_progression.py \
<anira_o_run_dir> --out_dir plots/mano_anira_o_dynamicsBREVO checkpoint dynamics:
First evaluate all saved checkpoints:
bash physics_of_lm/scripts/eval_checkpoints.sh <anira_e_run_dir>
bash physics_of_lm/scripts/eval_checkpoints.sh <anira_o_run_dir>Then generate checkpoint progression plots for each run::
python physics_of_lm/scripts/plotting/plot_brevo_checkpoint_progression.py \
<anira_e_run_dir> --out_dir plots/brevo_anira_e_dynamics
python physics_of_lm/scripts/plotting/plot_brevo_checkpoint_progression.py \
<anira_o_run_dir> --out_dir plots/brevo_anira_o_dynamicsMANO question-token analysis:
First dump token-level compute allocations:
python -m src.adaptive_compute.eval_adaptive_compute \
--run_dir <run_dir> \
--dump_compute_allocations \
--dump_compute_allocations_only_correct \
--dump_compute_allocations_skip_metricsThen run:
python physics_of_lm/scripts/mano_question_token_analysis/analyze_mano_question_traces.py \
--run_dir <run_dir> --out_dir plots/mano_question_tokensBREVO answer-token analysis:
BREVO evaluation writes generation JSONL files with per-token compute_probs under <run_dir>/generations/.
python physics_of_lm/scripts/brevo_answer_token_analysis/analyze_brevo_answer_token_compute_allocation.py \
--run_dir <run_dir> --output_dir plots/brevo_answer_tokensMANO extrapolation beyond training length.
Train baselines first:
bash physics_of_lm/scripts/mano_baselines.shThen run:
python physics_of_lm/scripts/plotting/plot_mano_extrapolation.py \
runs/mano/nonadaptive/L16_lr1e-4 runs/mano/nonrecurrent/L16_lr1e-4 \
<anira_e_dir> <anira_o_dir> \
--train_max_l 16 --out_dir plots/mano_extrapolationCommands below should be run from the natural_language/ directory.
huggingface-cli login
python data_preparation/download_nemotron.py \
--dataset_path adaptive_retrofitted_llama/nemotron_math
python data_preparation/preprocess_data_packing.py \
--out_path adaptive_retrofitted_llama/nemotron_math_tokenized \
--dataset_location adaptive_retrofitted_llama/nemotron_math/datasets/Nemotron-CC-Math-v1-4plus/ \
--cache_path adaptive_retrofitted_llama/cache \
--save_path adaptive_retrofitted_llama/preprocessed_datasets/ \
--max_length 4096 \
--num_proc 16bash scripts/train_anira_e_4gpu.sh
bash scripts/train_anira_o_4gpu.shUpdate --dataset_path inside the training scripts to point to the Arrow dataset produced above.
Training Compute
- About 1 day of compute on 4 Nvidia A100 GPUs using the default 5B tokens of training
Edit the model paths at the top of scripts/eval_gsm_symbolic_all.sh, then run:
bash scripts/eval_gsm_symbolic_all.sh- Nemotron-CC-Math-v1-4plus: Mahabadi et al., Nemotron-CC-Math: A 133 Billion-Token-Scale High Quality Math Pretraining Dataset, arXiv preprint arXiv:2508.15096
- GSM-Symbolic: Mirzadeh et al., GSM-Symbolic: Understanding the Limitations of Mathematical Reasoning in Large Language Models, ICLR 2025
- Retrofitted Recurrence: McLiesh et al., Teaching Pretrained Language Models to Think Deeper with Retrofitted Recurrence, arXiv preprint arXiv:2511.07384
See CONTRIBUTING.md for our policy on contributions.
Released under AGPL-3.0-or-later license, as found in the LICENSE.md file.
All files, except as noted below:
Copyright (C) 2026 Mitsubishi Electric Research Laboratories (MERL)
SPDX-License-Identifier: AGPL-3.0-or-later
The config and model files for the controlled complexity evaluation experiments
- src/adaptive_compute/init.py
- src/adaptive_compute/configuration_adaptive_compute.py
- src/adaptive_compute/modelling_adaptive_compute.py
were adapted from HuggingFace-Transformers (license included in LICENSES/Apache-2.0.txt):
Copyright (c) 2026 Mitsubishi Electric Research Laboratories (MERL)
Copyright 2022 EleutherAI and the HuggingFace Inc. team. All rights reserved.
SPDX-License-Identifier: AGPL-3.0-or-later
SPDX-License-Identifier: Apache-2.0
The config file for the natural language experiments
natural_language/adaptive_retrofitted_llama/configuration_adaptive_retrofitted_llama.py
was adapted from Huginn (license included in LICENSES/Apache-2.0.txt):
Copyright (c) 2026 Mitsubishi Electric Research Laboratories (MERL)
Copyright (c) 2025 Jonas Geiping, John Kirchenbauer, Sean McLeish, Khalid Saifullah, Manli Shu, Neel Jain, Siddarth Singh, Abhimanyu Hans, Monte Hoover and Prajwal Singhanaia
SPDX-License-Identifier: AGPL-3.0-or-later
SPDX-License-Identifier: Apache-2.0
The model file for the natural language experiments
natural_language/adaptive_retrofitted_llama/modeling_adaptive_retrofitted_llama.py
was adapted from Retrofitting Recurrence (license included in LICENSES/Apache-2.0.txt):
Copyright (c) 2026 Mitsubishi Electric Research Laboratories (MERL)
Copyright (c) 2025 Sean McLeish, Jonas Geiping, Ang Li
SPDX-License-Identifier: AGPL-3.0-or-later
SPDX-License-Identifier: Apache-2.0
@article{moosa2026understanding,
title={Understanding Dynamic Compute Allocation in Recurrent Transformers},
author={Moosa, Ibraheem Muhammad and Lohit, Suhas and Wang, Ye and Chatterjee, Moitreya and Yin, Wenpeng},
journal={International Conference on Machine Learning},
year={2026}
}