MathSmith: Towards Extremely Hard Mathematical Reasoning by Forging Synthetic Problems with a Reinforced Policy
MathSmith is a framework for enhancing mathematical reasoning capabilities of large language models by generating challenging synthetic problems from scratch. Unlike methods that modify existing problems, MathSmith creates novel problems through a reinforced policy, ensuring diversity and scalability.
- MathSmith-HC-Qwen3-8B: complexity + consistency reward
- MathSmith-Hard-Qwen3-8B: complexity-only reward
ShortCoT (Qwen3 series):
1.7B | 8B | 14B | 32B
LongCoT:
Qwen3-8B | DS-Qwen-7B
The MathSmith framework consists of four main stages:
-
Concept Collection: Randomly sample concept–explanation pairs from PlanetMath to ensure data independence.
-
Supervised Fine-tuning (SFT): Train the model on collected concept–explanation pairs to establish foundational understanding.
-
Reinforcement Learning (RL): Optimize the model using GRPO with rewards based on:
- Structural validity
- Reasoning complexity
- Answer consistency
-
Weakness-Focused Self-Improvement: Iteratively identify and address model weaknesses by generating targeted problem variants.
git clone https://github.com/Jasaxion/MathSmith.git
cd MathSmith
pip install -r requirements.txtCollect concept–explanation pairs from PlanetMath:
cd data_collect/planetmath_process
# Follow instructions to process PlanetMath dataWe have processed the concept-explanation pairs from PlanetMath and stored them in ./data_collect/sampled_concept/collect_planetmath_grouped_deduplicated.jsonl
Generate mathematical problems using the trained model:
python QM_sampler.pyEvaluate on benchmarks (GSM8K, MATH-500, AIME2024, AIME2025, OlympiadBench):
cd evaluate
bash eval.shRun the weakness-focused improvement pipeline: Instruction
cd self-improvement
bash self_improve.shMathSmith/
├── data_collect/ # Concept collection and data processing
├── sft-stage/ # Supervised fine-tuning scripts
├── rl-stage/ # Reinforcement learning training
│ ├── train_script/ # RL training scripts
│ └── reward_func/ # Reward function implementations
├── answer_sampler/ # Answer generation for problems
├── evaluate/ # Evaluation scripts and benchmarks
├── self-improvement/ # Weakness-focused improvement pipeline
├── utils/ # Utility functions
└── QM_sampler.py # Problem generation script
If you want to start training a MathSmith framework's problem synthesis model from scratch, please complete two stages of training according to the following steps.
SFT Stage to get a MathSmith-cold-start model
cd sft-stage
# Configure MathSmith_Questioner-Qwen3-8B.yaml
# Run SFT trainingRL Stage to custom reward and training a HC/Hard version.
cd rl-stage/train_script
bash rl_mathsmith.shMathSmith consistently outperforms baselines across five benchmarks under both short and long chain-of-thought settings:
- Easy & Medium: GSM8K, MATH-500
- Hard: AIME2024, AIME2025, OlympiadBench
If you find this work useful, please cite:
@article{zhan2025mathsmith,
title={MathSmith: Towards Extremely Hard Mathematical Reasoning by Forging Synthetic Problems with a Reinforced Policy},
author={Zhan, Shaoxiong and Lai, Yanlin and Lu, Ziyu and Lin, Dahua and Yang, Ziqing and Tan, Fei},
journal={arXiv preprint arXiv:2508.05592},
year={2025}
}



