🧩 Dr.LLM: Dynamic Layer Routing in LLMs

Ahmed Heakl, Martin Gubri, Salman Khan, Sangdoo Yun, Seong Joon Oh
Parameter Lab · MBZUAI · NAVER AI Lab · University of Tübingen · Tübingen AI Center

🚨 Code Release Status

🧩 The training, data generation, and in-domain evaluation code for Dr.LLM are not yet released.
These components (MCTS supervision, router training scripts, and lm-eval integration) will be made public in an upcoming update.
Stay tuned for the full release!

🆕 Latest Updates

📢 15 October 2025: Paper ArXived!

📘 Table of Contents

Overview
🧪 Evaluation
- In-Domain (Training & Evaluation Tasks)
- Out-of-Domain (Generalization Benchmarks)
📊 Results Summary
⚙️ Usage
🧭 Citation

🧩 Overview

Large Language Models (LLMs) process every token through all layers of a transformer stack, wasting compute on simple queries and lacking flexibility for harder ones that need deeper reasoning.

Dr.LLM (Dynamic Routing of Layers for LLMs) is a retrofittable framework that adds lightweight per-layer routers to pretrained models.
Each router decides whether to skip, execute, or repeat a layer, enabling adaptive depth without retraining or architectural changes.

Routers are trained with explicit supervision from Monte Carlo Tree Search (MCTS), generating high-quality layer configurations that preserve or improve accuracy under a compute budget.
Stabilized with windowed pooling, focal loss, and bottleneck MLPs, Dr.LLM maintains robustness under class imbalance and long sequences.

📈 Results

On ARC (logic) and DART (math), Dr.LLM improves accuracy by +3.4%p while saving ~5 layers per input.
Routers generalize to MMLU, GSM8k, AIME, TruthfulQA, SQuADv2, GPQA, PIQA, and AGIEval with only 0.85% accuracy drop.
Outperforms prior routing methods (LayerSkip, FlexiDepth, MindSkip) by up to +7.7%p.

💡 Dr.LLM equips frozen LLMs for budget-aware, accuracy-driven inference — no base weight modification required.

Routers

Our layer routing based on hidden states. Dr.LLM augments a frozen decoder-only LLM with per-layer routers that decide to skip, execute, or repeat a block once. Routers read windowed summaries of hidden states and are trained from MCTS-derived targets.

Training with MCTS Supervision

Length-aware MCTS used to collect the supervised training dataset of per-layer routing configurations (skip/execute/repeat). For each input, MCTS explores modified layer paths and retains accuracy-preserving or improving ones under a compute budget.

🧪 Evaluation

We evaluate Dr.LLM using lm-eval-harness across in-domain and out-of-domain benchmarks.

In-Domain (Training & Evaluation Tasks)

Routers are trained and evaluated on ARC-Easy/Challenge (logic) and DART-Math (levels 1–5) (multi-step math reasoning), using 4K MCTS-derived execution paths.

Dataset	Domain	Metric
ARC-Easy / Challenge	Logic Reasoning	Accuracy
DART (levels 1–5)	Math Reasoning	Accuracy

Out-of-Domain (Generalization Benchmarks)

We test zero-shot transfer on MMLU, GSM8k, AIME24, TruthfulQA, GPQA Diamond, AGIEval, SQuADv2, and PIQA.
All evaluations follow default lm-eval-harness settings (2048 max tokens, greedy decoding).

⚙️ Usage

1️⃣ Installation

git clone https://github.com/parameterlab/dr-llm
cd dr-llm
pip install -r requirements.txt

2️⃣ Training the Routers

⚠️ Note: Full code release is pending ⚠️

Training uses AdamW, 25 epochs, 1×10⁻³ LR, bf16 precision, and a single A100 GPU (40GB) — taking under 4 hours.

Models source code must be manipulated to insert routers after each transformer block.

Routers are trained separately using MCTS-generated supervision:

python train.py \
  --model llama-3-8b-instruct \
  --data_dir data/arc_dart \
  --save_dir checkpoints/drllm_router

3️⃣ Evaluation with lm-eval-harness

🚨 Note: Full code release is pending 🚨

lm_eval \
  --model openai/llama-3-8b-instruct \
  --tasks arc_challenge,dart,gsm8k,mmlu \
  --device cuda

🧭 Citation

If you find this work useful, please cite:

@article{heakl2025drllm,
  title={Dr.LLM: Dynamic Layer Routing in LLMs},
  author={Ahmed Heakl and Martin Gubri and Salman Khan and Sangdoo Yun and Seong Joon Oh},
  journal={arXiv preprint arXiv:2510.12773},
  year={2025}
}

Name		Name	Last commit message	Last commit date
Latest commit History 18 Commits
assets		assets
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
data_generation.py		data_generation.py
prompts.py		prompts.py
requirements.txt		requirements.txt
train.py		train.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

🧩 Dr.LLM: Dynamic Layer Routing in LLMs

🚨 Code Release Status

🆕 Latest Updates

📘 Table of Contents

🧩 Overview

Routers

Training with MCTS Supervision

🧪 Evaluation

In-Domain (Training & Evaluation Tasks)

Out-of-Domain (Generalization Benchmarks)

⚙️ Usage

1️⃣ Installation

🧭 Citation

About

Uh oh!

Releases

Packages

Contributors 2

Languages

License

parameterlab/dr-llm

Folders and files

Latest commit

History

Repository files navigation

🧩 Dr.LLM: Dynamic Layer Routing in LLMs

🚨 Code Release Status

🆕 Latest Updates

📘 Table of Contents

🧩 Overview

Routers

Training with MCTS Supervision

🧪 Evaluation

In-Domain (Training & Evaluation Tasks)

Out-of-Domain (Generalization Benchmarks)

⚙️ Usage

1️⃣ Installation

🧭 Citation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages