🚀 THOR: Tool-Integrated Hierarchical Optimization via RL for Mathematical Reasoning 🚀

This is the official implementation of our paper THOR: Tool-Integrated Hierarchical Optimization via RL for Mathematical Reasoning.

🔥 News:

🎉🎉🎉 Our paper has been selected for the 🤗 Hugging Face Daily Papers! Thanks to the community for the recognition and support 🚀
🎉🎉🎉Congratulations! Our paper has been accepted by ICLR 2026.

TODO:

🔍 Overview

Large Language Models (LLMs) have advanced in mathematical reasoning but still struggle with precise computation and symbolic manipulation. THOR (Tool-Integrated Hierarchical Optimization via RL) addresses this by:

TIRGen – an actor–critic pipeline to construct high-quality tool-integrated reasoning data.
Hierarchical RL – jointly optimizing trajectory-level reasoning and step-level code generation.
Self-Correction – leveraging tool feedback to fix reasoning errors during inference.

THOR achieves state-of-the-art performance on multiple mathematical benchmarks and shows consistent improvements on code generation tasks, generalizing well across both reasoning and non-reasoning models.

✨ Key Contributions

🛠 TIRGen Pipeline – Generates policy-aligned tool-integrated reasoning data.
🎯 Hierarchical RL – Combines trajectory-level optimization with step-level correction.
🔄 Self-Correction Inference – Dynamically fixes reasoning errors during inference.
📊 Broad Generalization – Effective across reasoning and non-reasoning models.

⚙️ Method

Our method, THOR, enhances tool-integrated reasoning with a three-stage pipeline:

1️⃣ TIRGen: Tool-Integrated Data Construction

Actor generates natural language reasoning steps.
Critic evaluates whether parts of the reasoning can be executed as code.
Identified steps are transformed into tool-augmented reasoning paths.
Multi-stage filtering ensures policy alignment, code quality, and difficulty balance.

2️⃣ Hierarchical Reinforcement Learning

Trajectory-level RL: Optimizes overall correctness of the final answer using GRPO.
Step-level RL: Focuses on error-prone code generation steps, using execution results as fine-grained rewards.
Joint optimization addresses sparse reward issues in long reasoning chains.

3️⃣ Self-Correction During Inference

During inference, if a tool call fails, the model backtracks to the reasoning step.
It regenerates a new suffix and revised action, guided by tool feedback.
This enables online error correction with minimal overhead.

📊 Results

Comparison With State-of-the-Art Methods

Effectiveness of TIRGen

Ablation Study

📥 Installation

Step1. Install SandboxFusion

git clone https://github.com/bytedance/SandboxFusion
cd SandboxFusion
# install sandboxfusion to support code execution
conda create -n sandbox -y python=3.12
conda activate sandbox
poetry install
# to build the real docs, run `cd docs && npm ci && npm run build`
mkdir -p docs/build
make run-online

Step2. Install THOR environment

git clone https://github.com/JingMog/THOR
cd THOR
conda create -n THOR -y python=3.10
pip install -r requirements.txt

🚀 Usage

1. TIRGen: TIR data construction pipeline

cd TIRGen
# TIR dataset construction
bash construct_dataset_main.sh

# multi_stage_filter
bash filter.sh

2. TIR Inference

cd inference
bash submit_bon_policy.sh

3. cold start

Our cold start is based on swift, the usage of ms-swift can be found in ms-swift.

cd swift
bash sft_demo.sh

4. RL training

# TODO

🙌 Acknowledgements

We thank the open-source community from Qwen, verl and SandboxFusion.

🖊️ Citation

If you find our work helpful, please consider giving us a ⭐ and citing our paper:

@article{THOR,
  title={THOR: Tool-Integrated Hierarchical Optimization via RL for Mathematical Reasoning},
  author = {Chang, Qikai and Zhang, Zhenrong and Hu, Pengfei and Ma, Jiefeng and Pan, Yicheng and Zhang, Jianshu and Du, Jun and Liu, Quan and Gao, Jianqing},
  journal={arXiv preprint arXiv:2509.13761},
  year={2025}
}

Name		Name	Last commit message	Last commit date
Latest commit History 23 Commits
SandboxFusion		SandboxFusion
TIRGen		TIRGen
assets		assets
config		config
inference		inference
swift		swift
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

🚀 THOR: Tool-Integrated Hierarchical Optimization via RL for Mathematical Reasoning 🚀

🔥 News:

🔍 Overview

✨ Key Contributions

⚙️ Method

📊 Results

Comparison With State-of-the-Art Methods

Effectiveness of TIRGen

Ablation Study

📥 Installation

🚀 Usage

1. TIRGen: TIR data construction pipeline

2. TIR Inference

3. cold start

4. RL training

🙌 Acknowledgements

🖊️ Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

🚀 THOR: Tool-Integrated Hierarchical Optimization via RL for Mathematical Reasoning 🚀

🔥 News:

🔍 Overview

✨ Key Contributions

⚙️ Method

📊 Results

Comparison With State-of-the-Art Methods

Effectiveness of TIRGen

Ablation Study

📥 Installation

🚀 Usage

1. TIRGen: TIR data construction pipeline

2. TIR Inference

3. cold start

4. RL training

🙌 Acknowledgements

🖊️ Citation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages