Skip to content

Jasaxion/MathSmith

Repository files navigation

MathSmith

MathSmith: Towards Extremely Hard Mathematical Reasoning by Forging Synthetic Problems with a Reinforced Policy

Paper License Python Project Page

Overview

MathSmith is a framework for enhancing mathematical reasoning capabilities of large language models by generating challenging synthetic problems from scratch. Unlike methods that modify existing problems, MathSmith creates novel problems through a reinforced policy, ensuring diversity and scalability.

Resources

🧠 Problem Synthesizers

📘 Datasets

🔧 SFT Models

ShortCoT (Qwen3 series): 1.7B | 8B | 14B | 32B
LongCoT: Qwen3-8B | DS-Qwen-7B

Pipeline

The MathSmith framework consists of four main stages:

  1. Concept Collection: Randomly sample concept–explanation pairs from PlanetMath to ensure data independence.

  2. Supervised Fine-tuning (SFT): Train the model on collected concept–explanation pairs to establish foundational understanding.

  3. Reinforcement Learning (RL): Optimize the model using GRPO with rewards based on:

    • Structural validity
    • Reasoning complexity
    • Answer consistency
  4. Weakness-Focused Self-Improvement: Iteratively identify and address model weaknesses by generating targeted problem variants.

Quick Start

Installation

git clone https://github.com/Jasaxion/MathSmith.git
cd MathSmith
pip install -r requirements.txt

Data Collection

Collect concept–explanation pairs from PlanetMath:

cd data_collect/planetmath_process
# Follow instructions to process PlanetMath data

We have processed the concept-explanation pairs from PlanetMath and stored them in ./data_collect/sampled_concept/collect_planetmath_grouped_deduplicated.jsonl

Problem Generation

Generate mathematical problems using the trained model:

python QM_sampler.py

Evaluation

Evaluate on benchmarks (GSM8K, MATH-500, AIME2024, AIME2025, OlympiadBench):

cd evaluate
bash eval.sh

Self-Improvement

Run the weakness-focused improvement pipeline: Instruction

cd self-improvement
bash self_improve.sh

Repository Structure

MathSmith/
├── data_collect/          # Concept collection and data processing
├── sft-stage/             # Supervised fine-tuning scripts
├── rl-stage/              # Reinforcement learning training
│   ├── train_script/      # RL training scripts
│   └── reward_func/       # Reward function implementations
├── answer_sampler/        # Answer generation for problems
├── evaluate/              # Evaluation scripts and benchmarks
├── self-improvement/      # Weakness-focused improvement pipeline
├── utils/                 # Utility functions
└── QM_sampler.py         # Problem generation script

Training

If you want to start training a MathSmith framework's problem synthesis model from scratch, please complete two stages of training according to the following steps.

SFT Stage to get a MathSmith-cold-start model

cd sft-stage
# Configure MathSmith_Questioner-Qwen3-8B.yaml
# Run SFT training

RL Stage to custom reward and training a HC/Hard version.

cd rl-stage/train_script
bash rl_mathsmith.sh

Results

MathSmith consistently outperforms baselines across five benchmarks under both short and long chain-of-thought settings:

  • Easy & Medium: GSM8K, MATH-500
  • Hard: AIME2024, AIME2025, OlympiadBench

Citation

If you find this work useful, please cite:

@article{zhan2025mathsmith,
  title={MathSmith: Towards Extremely Hard Mathematical Reasoning by Forging Synthetic Problems with a Reinforced Policy},
  author={Zhan, Shaoxiong and Lai, Yanlin and Lu, Ziyu and Lin, Dahua and Yang, Ziqing and Tan, Fei},
  journal={arXiv preprint arXiv:2508.05592},
  year={2025}
}

About

MathSmith, creator of extremely difficult math problems

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors