GOLF is an RL framework that explicitly exploits group-level language feedback to guide targeted exploration through actionable refinements. It aggregates two complementary feedback sources: (i) external critiques that pinpoint errors or propose targeted fixes, and (ii) intra-group attempts that supply alternative partial ideas and diverse failure patterns. These group-level feedbacks are aggregated to produce high-quality refinements, which are adaptively injected into training as off-policy scaffolds to provide targeted guidance in sparse-reward regions. Meanwhile, GOLF jointly optimizes generation and refinement within a unified RL loop, creating a virtuous cycle that continuously improves both capabilities.
This repository supports fuzzy tasks (e.g. chat) and verifiable tasks (math, code, IF) with task-specific reward and critique pipelines built on verl and AMPO-style hybrid GRPO.
- Installation
- Repository Structure
- Data Preparation
- Training
- Evaluation
- Inference
- Acknowledgement
- Citation
Requirements: Python 3.10, PyTorch, CUDA, vLLM (for rollout). We recommend a dedicated conda environment.
conda create -n golf python=3.10
conda activate golf
cd golf
cd verl
# For FSDP (no Megatron):
USE_MEGATRON=0 bash scripts/install_vllm_sglang_mcore.sh
# For Megatron-backed training, use: bash scripts/install_vllm_sglang_mcore.sh
cd ..
pip install -r requirements.txtGOLF/
├── golf/ # Core training code (built on verl)
│ └── verl/
│ └── verl/adaptive_mix_src/ # GOLF trainer, critique refiner, reward
├── data/ # Data preparation scripts
├── exp_scripts/ # Training launch scripts
│ ├── critique_grpo_hybrid_math.sh # Verifiable: math
│ ├── critique_grpo_hybrid_if.sh # Verifiable: instruction following (IF)
│ ├── critique_grpo_hybrid_code.sh # Verifiable: code
│ └── critique_grpo_hybrid_wildchat.sh # Fuzzy: wildchat / chat
├── eval_scripts/ # Evaluation and generation
│ ├── eval_math.sh # Verifiable math eval
│ ├── eval_fuzzy.sh # Fuzzy benchmark eval (RLMT-style)
│ ├── generate_vllm.py
│ └── ...
└── README.md
Data format and preprocessing per task:
| Task | Reference |
|---|---|
| Fuzzy (chat / instruction following) | RLMT |
| Math | critique-GRPO |
| Code | SDPO |
| IF | allenai/IF_multi_constraints_upto5 |
Scripts assume repo root at $PROJECT_ROOT/GOLF. Set PROJECT_ROOT, MODEL_PATH, TRAIN_FILE, TEST_FILE (and for IF: IFEVAL_VAL_FILE, IFBENCH_VAL_FILE) as needed. Optional: export WANDB_API_KEY=your_key or WANDB_MODE=disabled.
Prepare data per task using the Data Preparation references above, then run:
export PROJECT_ROOT=/path/to/your/projects
export MODEL_PATH=/path/to/pretrained_models/Qwen3-8B
export TRAIN_FILE=$PROJECT_ROOT/GOLF/data/openr1_math_4k_train.parquet
export TEST_FILE=$PROJECT_ROOT/GOLF/data/openr1_math_4k_test.parquet
bash exp_scripts/critique_grpo_hybrid_math.shexport PROJECT_ROOT=/path/to/your/projects
export MODEL_PATH=/path/to/pretrained_models/Qwen3-4B
export TRAIN_FILE=$PROJECT_ROOT/GOLF/data/if_train.parquet
export IFEVAL_VAL_FILE=$PROJECT_ROOT/GOLF/data/ifeval_test.parquet
export IFBENCH_VAL_FILE=$PROJECT_ROOT/GOLF/data/ifbench_test.parquet
bash exp_scripts/critique_grpo_hybrid_if.shexport PROJECT_ROOT=/path/to/your/projects
export MODEL_PATH=/path/to/pretrained_models/Qwen3-8B
export TRAIN_FILE=$PROJECT_ROOT/GOLF/data/lcb_v6_train.parquet
export TEST_FILE=$PROJECT_ROOT/GOLF/data/lcb_v6_test.parquet
bash exp_scripts/critique_grpo_hybrid_code.shexport PROJECT_ROOT=/path/to/your/projects
export MODEL_PATH=/path/to/pretrained_models/Llama-3.1-8B-Instruct
export TRAIN_FILE=$PROJECT_ROOT/GOLF/data/wildchat-if_train.parquet
export TEST_FILE=$PROJECT_ROOT/GOLF/data/wildchat-if_val.parquet
bash exp_scripts/critique_grpo_hybrid_wildchat.shCheckpoints: $PROJECT_ROOT/GOLF/checkpoints/<model_name>/golf/<exp_name>/. Merge FSDP shards via eval_scripts/model_merge.sh when needed.
Run inference then score (e.g. with Math-Verify or your validator). Set PROJECT_ROOT, EVAL_DATA, EVAL_OUTPUT_DIR, and the MODEL_PATHS array to your merged checkpoints.
export PROJECT_ROOT=/path/to/your/projects
# Edit eval_scripts/eval_math.sh: set MODEL_PATHS and MODEL_NAMES to your checkpoints
bash eval_scripts/eval_math.shThen run your preferred math metric on the generated *.jsonl under EVAL_OUTPUT_DIR.
Use the same pattern: point the eval scripts to your checkpoint dirs and data, run generate_vllm.py (or equivalent), then run task-specific scoring (e.g. pass@k for code, IFEval/IFBench for IF).
For fuzzy benchmarks (e.g. creative writing, WildBench, arena), use:
export PROJECT_ROOT=/path/to/your/projects
export OPENAI_BASE_URL=http://your-vllm-server:80/v1 # or local vLLM
# Edit eval_scripts/eval_fuzzy.sh: set MODELS, MODEL_NAMES, BENCHMARKS
bash eval_scripts/eval_fuzzy.shBenchmark list and scoring follow the same spirit as RLMT; adjust BENCHMARKS and paths as in the script.
GOLF builds on the following projects:
- AMPO — “More Than One Teacher: Adaptive Multi-Guidance Policy Optimization for Diverse Exploration”; we adopt and extend the adaptive multi-guidance and hybrid training ideas.
- verl — Volcano Engine Reinforcement Learning for LLMs; our training stack is built on verl’s GRPO/PPO and infrastructure.
We also thank RLMT, critique-GRPO, and SDPO, Math-Verify for data, benchmarks, and tooling.
If you use GOLF or this code, please cite:
@misc{huang2026bootstrappingexplorationgrouplevelnatural,
title = {Bootstrapping Exploration with Group-Level Natural Language Feedback in Reinforcement Learning},
author = {Lei Huang and Xiang Cheng and Chenxiao Zhao and Guobin Shen and Junjie Yang and Xiaocheng Feng and Yuxuan Gu and Xing Yu and Bing Qin},
year = {2026},
eprint = {2603.04597},
archivePrefix = {arXiv},
primaryClass = {cs.CL},
url = {https://arxiv.org/abs/2603.04597},
}