Skip to content

JingMog/THOR

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

23 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

🚀 THOR: Tool-Integrated Hierarchical Optimization via RL for Mathematical Reasoning 🚀

Pipeline

This is the official implementation of our paper THOR: Tool-Integrated Hierarchical Optimization via RL for Mathematical Reasoning.

🔥 News:

  • 🎉🎉🎉 Our paper has been selected for the 🤗 Hugging Face Daily Papers! Thanks to the community for the recognition and support 🚀
  • 🎉🎉🎉Congratulations! Our paper has been accepted by ICLR 2026.

TODO:

  • Update arXiv preprint.
  • Update inference code.
  • Update TIRGen code.
  • Update training code.
  • Update the TIRGen dataset.

🔍 Overview

Large Language Models (LLMs) have advanced in mathematical reasoning but still struggle with precise computation and symbolic manipulation. THOR (Tool-Integrated Hierarchical Optimization via RL) addresses this by:

  1. TIRGen – an actor–critic pipeline to construct high-quality tool-integrated reasoning data.
  2. Hierarchical RL – jointly optimizing trajectory-level reasoning and step-level code generation.
  3. Self-Correction – leveraging tool feedback to fix reasoning errors during inference.

THOR achieves state-of-the-art performance on multiple mathematical benchmarks and shows consistent improvements on code generation tasks, generalizing well across both reasoning and non-reasoning models.

✨ Key Contributions

  1. 🛠 TIRGen Pipeline – Generates policy-aligned tool-integrated reasoning data.
  2. 🎯 Hierarchical RL – Combines trajectory-level optimization with step-level correction.
  3. 🔄 Self-Correction Inference – Dynamically fixes reasoning errors during inference.
  4. 📊 Broad Generalization – Effective across reasoning and non-reasoning models.

⚙️ Method

Our method, THOR, enhances tool-integrated reasoning with a three-stage pipeline:

1️⃣ TIRGen: Tool-Integrated Data Construction

  • Actor generates natural language reasoning steps.
  • Critic evaluates whether parts of the reasoning can be executed as code.
  • Identified steps are transformed into tool-augmented reasoning paths.
  • Multi-stage filtering ensures policy alignment, code quality, and difficulty balance.

TIRGen

2️⃣ Hierarchical Reinforcement Learning

  • Trajectory-level RL: Optimizes overall correctness of the final answer using GRPO.
  • Step-level RL: Focuses on error-prone code generation steps, using execution results as fine-grained rewards.
  • Joint optimization addresses sparse reward issues in long reasoning chains.

THOR

3️⃣ Self-Correction During Inference

  • During inference, if a tool call fails, the model backtracks to the reasoning step.
  • It regenerates a new suffix and revised action, guided by tool feedback.
  • This enables online error correction with minimal overhead.

📊 Results

Comparison With State-of-the-Art Methods

SOTA_result

Effectiveness of TIRGen

effectiveness_of_TIRGen

Ablation Study

Ablation_Study

📥 Installation

Step1. Install SandboxFusion

git clone https://github.com/bytedance/SandboxFusion
cd SandboxFusion
# install sandboxfusion to support code execution
conda create -n sandbox -y python=3.12
conda activate sandbox
poetry install
# to build the real docs, run `cd docs && npm ci && npm run build`
mkdir -p docs/build
make run-online

Step2. Install THOR environment

git clone https://github.com/JingMog/THOR
cd THOR
conda create -n THOR -y python=3.10
pip install -r requirements.txt

🚀 Usage

1. TIRGen: TIR data construction pipeline

cd TIRGen
# TIR dataset construction
bash construct_dataset_main.sh

# multi_stage_filter
bash filter.sh

2. TIR Inference

cd inference
bash submit_bon_policy.sh

3. cold start

Our cold start is based on swift, the usage of ms-swift can be found in ms-swift.

cd swift
bash sft_demo.sh

4. RL training

# TODO

🙌 Acknowledgements

We thank the open-source community from Qwen, verl and SandboxFusion.

🖊️ Citation

If you find our work helpful, please consider giving us a ⭐ and citing our paper:

@article{THOR,
  title={THOR: Tool-Integrated Hierarchical Optimization via RL for Mathematical Reasoning},
  author = {Chang, Qikai and Zhang, Zhenrong and Hu, Pengfei and Ma, Jiefeng and Pan, Yicheng and Zhang, Jianshu and Du, Jun and Liu, Quan and Gao, Jianqing},
  journal={arXiv preprint arXiv:2509.13761},
  year={2025}
}

About

[ICLR-2026] Official Implementation of our paper "THOR: Tool-Integrated Hierarchical Optimization via RL for Mathematical Reasoning".

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors