This is the official implementation of our paper THOR: Tool-Integrated Hierarchical Optimization via RL for Mathematical Reasoning.
- 🎉🎉🎉 Our paper has been selected for the 🤗 Hugging Face Daily Papers! Thanks to the community for the recognition and support 🚀
- 🎉🎉🎉Congratulations! Our paper has been accepted by ICLR 2026.
TODO:
- Update arXiv preprint.
- Update inference code.
- Update TIRGen code.
- Update training code.
- Update the TIRGen dataset.
Large Language Models (LLMs) have advanced in mathematical reasoning but still struggle with precise computation and symbolic manipulation. THOR (Tool-Integrated Hierarchical Optimization via RL) addresses this by:
- TIRGen – an actor–critic pipeline to construct high-quality tool-integrated reasoning data.
- Hierarchical RL – jointly optimizing trajectory-level reasoning and step-level code generation.
- Self-Correction – leveraging tool feedback to fix reasoning errors during inference.
THOR achieves state-of-the-art performance on multiple mathematical benchmarks and shows consistent improvements on code generation tasks, generalizing well across both reasoning and non-reasoning models.
- 🛠 TIRGen Pipeline – Generates policy-aligned tool-integrated reasoning data.
- 🎯 Hierarchical RL – Combines trajectory-level optimization with step-level correction.
- 🔄 Self-Correction Inference – Dynamically fixes reasoning errors during inference.
- 📊 Broad Generalization – Effective across reasoning and non-reasoning models.
Our method, THOR, enhances tool-integrated reasoning with a three-stage pipeline:
1️⃣ TIRGen: Tool-Integrated Data Construction
- Actor generates natural language reasoning steps.
- Critic evaluates whether parts of the reasoning can be executed as code.
- Identified steps are transformed into tool-augmented reasoning paths.
- Multi-stage filtering ensures policy alignment, code quality, and difficulty balance.
2️⃣ Hierarchical Reinforcement Learning
- Trajectory-level RL: Optimizes overall correctness of the final answer using GRPO.
- Step-level RL: Focuses on error-prone code generation steps, using execution results as fine-grained rewards.
- Joint optimization addresses sparse reward issues in long reasoning chains.
3️⃣ Self-Correction During Inference
- During inference, if a tool call fails, the model backtracks to the reasoning step.
- It regenerates a new suffix and revised action, guided by tool feedback.
- This enables online error correction with minimal overhead.
Step1. Install SandboxFusion
git clone https://github.com/bytedance/SandboxFusion
cd SandboxFusion
# install sandboxfusion to support code execution
conda create -n sandbox -y python=3.12
conda activate sandbox
poetry install
# to build the real docs, run `cd docs && npm ci && npm run build`
mkdir -p docs/build
make run-onlineStep2. Install THOR environment
git clone https://github.com/JingMog/THOR
cd THOR
conda create -n THOR -y python=3.10
pip install -r requirements.txtcd TIRGen
# TIR dataset construction
bash construct_dataset_main.sh
# multi_stage_filter
bash filter.shcd inference
bash submit_bon_policy.shOur cold start is based on swift, the usage of ms-swift can be found in ms-swift.
cd swift
bash sft_demo.sh# TODOWe thank the open-source community from Qwen, verl and SandboxFusion.
If you find our work helpful, please consider giving us a ⭐ and citing our paper:
@article{THOR,
title={THOR: Tool-Integrated Hierarchical Optimization via RL for Mathematical Reasoning},
author = {Chang, Qikai and Zhang, Zhenrong and Hu, Pengfei and Ma, Jiefeng and Pan, Yicheng and Zhang, Jianshu and Du, Jun and Liu, Quan and Gao, Jianqing},
journal={arXiv preprint arXiv:2509.13761},
year={2025}
}





