Ahmed Heakl, Martin Gubri, Salman Khan, Sangdoo Yun, Seong Joon Oh
Parameter Lab · MBZUAI · NAVER AI Lab · University of Tübingen · Tübingen AI Center
🧩 The training, data generation, and in-domain evaluation code for Dr.LLM are not yet released.
These components (MCTS supervision, router training scripts, and lm-eval integration) will be made public in an upcoming update.
Stay tuned for the full release!
- 📢 15 October 2025: Paper ArXived!
Large Language Models (LLMs) process every token through all layers of a transformer stack, wasting compute on simple queries and lacking flexibility for harder ones that need deeper reasoning.
Dr.LLM (Dynamic Routing of Layers for LLMs) is a retrofittable framework that adds lightweight per-layer routers to pretrained models.
Each router decides whether to skip, execute, or repeat a layer, enabling adaptive depth without retraining or architectural changes.
Routers are trained with explicit supervision from Monte Carlo Tree Search (MCTS), generating high-quality layer configurations that preserve or improve accuracy under a compute budget.
Stabilized with windowed pooling, focal loss, and bottleneck MLPs, Dr.LLM maintains robustness under class imbalance and long sequences.
📈 Results
- On ARC (logic) and DART (math), Dr.LLM improves accuracy by +3.4%p while saving ~5 layers per input.
- Routers generalize to MMLU, GSM8k, AIME, TruthfulQA, SQuADv2, GPQA, PIQA, and AGIEval with only 0.85% accuracy drop.
- Outperforms prior routing methods (LayerSkip, FlexiDepth, MindSkip) by up to +7.7%p.
💡 Dr.LLM equips frozen LLMs for budget-aware, accuracy-driven inference — no base weight modification required.
Our layer routing based on hidden states. Dr.LLM augments a frozen decoder-only LLM with per-layer routers that decide to skip, execute, or repeat a block once. Routers read windowed summaries of hidden states and are trained from MCTS-derived targets.
Length-aware MCTS used to collect the supervised training dataset of per-layer routing configurations (skip/execute/repeat). For each input, MCTS explores modified layer paths and retains accuracy-preserving or improving ones under a compute budget.
We evaluate Dr.LLM using lm-eval-harness
across in-domain and out-of-domain benchmarks.
Routers are trained and evaluated on ARC-Easy/Challenge (logic) and DART-Math (levels 1–5) (multi-step math reasoning), using 4K MCTS-derived execution paths.
Dataset | Domain | Metric |
---|---|---|
ARC-Easy / Challenge | Logic Reasoning | Accuracy |
DART (levels 1–5) | Math Reasoning | Accuracy |
We test zero-shot transfer on MMLU, GSM8k, AIME24, TruthfulQA, GPQA Diamond, AGIEval, SQuADv2, and PIQA.
All evaluations follow default lm-eval-harness
settings (2048 max tokens, greedy decoding).
git clone https://github.com/parameterlab/dr-llm
cd dr-llm
pip install -r requirements.txt
2️⃣ Training the Routers
⚠️ Note: Full code release is pending⚠️
Training uses AdamW, 25 epochs, 1×10⁻³ LR, bf16 precision, and a single A100 GPU (40GB) — taking under 4 hours.
Models source code must be manipulated to insert routers after each transformer block.
Routers are trained separately using MCTS-generated supervision:
python train.py \
--model llama-3-8b-instruct \
--data_dir data/arc_dart \
--save_dir checkpoints/drllm_router
3️⃣ Evaluation with lm-eval-harness
🚨 Note: Full code release is pending 🚨
lm_eval \
--model openai/llama-3-8b-instruct \
--tasks arc_challenge,dart,gsm8k,mmlu \
--device cuda
If you find this work useful, please cite:
@article{heakl2025drllm,
title={Dr.LLM: Dynamic Layer Routing in LLMs},
author={Ahmed Heakl and Martin Gubri and Salman Khan and Sangdoo Yun and Seong Joon Oh},
journal={arXiv preprint arXiv:2510.12773},
year={2025}
}