Skip to content

LuckyyySTA/GOLF

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

GOLF: Guidance-Optimized Learning with Feedback

Bootstrapping Exploration with Group-Level Natural Language Feedback in Reinforcement Learning

Paper GitHub

GOLF is an RL framework that explicitly exploits group-level language feedback to guide targeted exploration through actionable refinements. It aggregates two complementary feedback sources: (i) external critiques that pinpoint errors or propose targeted fixes, and (ii) intra-group attempts that supply alternative partial ideas and diverse failure patterns. These group-level feedbacks are aggregated to produce high-quality refinements, which are adaptively injected into training as off-policy scaffolds to provide targeted guidance in sparse-reward regions. Meanwhile, GOLF jointly optimizes generation and refinement within a unified RL loop, creating a virtuous cycle that continuously improves both capabilities.

This repository supports fuzzy tasks (e.g. chat) and verifiable tasks (math, code, IF) with task-specific reward and critique pipelines built on verl and AMPO-style hybrid GRPO.


Table of Contents


Installation

Requirements: Python 3.10, PyTorch, CUDA, vLLM (for rollout). We recommend a dedicated conda environment.

conda create -n golf python=3.10
conda activate golf
cd golf
cd verl
# For FSDP (no Megatron):
USE_MEGATRON=0 bash scripts/install_vllm_sglang_mcore.sh
# For Megatron-backed training, use: bash scripts/install_vllm_sglang_mcore.sh
cd ..
pip install -r requirements.txt

Repository Structure

GOLF/
├── golf/                    # Core training code (built on verl)
│   └── verl/
│       └── verl/adaptive_mix_src/   # GOLF trainer, critique refiner, reward
├── data/                    # Data preparation scripts
├── exp_scripts/             # Training launch scripts
│   ├── critique_grpo_hybrid_math.sh      # Verifiable: math
│   ├── critique_grpo_hybrid_if.sh        # Verifiable: instruction following (IF)
│   ├── critique_grpo_hybrid_code.sh      # Verifiable: code
│   └── critique_grpo_hybrid_wildchat.sh  # Fuzzy: wildchat / chat
├── eval_scripts/            # Evaluation and generation
│   ├── eval_math.sh         # Verifiable math eval
│   ├── eval_fuzzy.sh        # Fuzzy benchmark eval (RLMT-style)
│   ├── generate_vllm.py
│   └── ...
└── README.md

Data Preparation

Data format and preprocessing per task:

Task Reference
Fuzzy (chat / instruction following) RLMT
Math critique-GRPO
Code SDPO
IF allenai/IF_multi_constraints_upto5

Training

Scripts assume repo root at $PROJECT_ROOT/GOLF. Set PROJECT_ROOT, MODEL_PATH, TRAIN_FILE, TEST_FILE (and for IF: IFEVAL_VAL_FILE, IFBENCH_VAL_FILE) as needed. Optional: export WANDB_API_KEY=your_key or WANDB_MODE=disabled.

Prepare data per task using the Data Preparation references above, then run:

Verifiable: Math

export PROJECT_ROOT=/path/to/your/projects
export MODEL_PATH=/path/to/pretrained_models/Qwen3-8B
export TRAIN_FILE=$PROJECT_ROOT/GOLF/data/openr1_math_4k_train.parquet
export TEST_FILE=$PROJECT_ROOT/GOLF/data/openr1_math_4k_test.parquet
bash exp_scripts/critique_grpo_hybrid_math.sh

Verifiable: Instruction Following (IF)

export PROJECT_ROOT=/path/to/your/projects
export MODEL_PATH=/path/to/pretrained_models/Qwen3-4B
export TRAIN_FILE=$PROJECT_ROOT/GOLF/data/if_train.parquet
export IFEVAL_VAL_FILE=$PROJECT_ROOT/GOLF/data/ifeval_test.parquet
export IFBENCH_VAL_FILE=$PROJECT_ROOT/GOLF/data/ifbench_test.parquet
bash exp_scripts/critique_grpo_hybrid_if.sh

Verifiable: Code

export PROJECT_ROOT=/path/to/your/projects
export MODEL_PATH=/path/to/pretrained_models/Qwen3-8B
export TRAIN_FILE=$PROJECT_ROOT/GOLF/data/lcb_v6_train.parquet
export TEST_FILE=$PROJECT_ROOT/GOLF/data/lcb_v6_test.parquet
bash exp_scripts/critique_grpo_hybrid_code.sh

Fuzzy: Wildchat / chat

export PROJECT_ROOT=/path/to/your/projects
export MODEL_PATH=/path/to/pretrained_models/Llama-3.1-8B-Instruct
export TRAIN_FILE=$PROJECT_ROOT/GOLF/data/wildchat-if_train.parquet
export TEST_FILE=$PROJECT_ROOT/GOLF/data/wildchat-if_val.parquet
bash exp_scripts/critique_grpo_hybrid_wildchat.sh

Checkpoints: $PROJECT_ROOT/GOLF/checkpoints/<model_name>/golf/<exp_name>/. Merge FSDP shards via eval_scripts/model_merge.sh when needed.


Evaluation

Verifiable: Math

Run inference then score (e.g. with Math-Verify or your validator). Set PROJECT_ROOT, EVAL_DATA, EVAL_OUTPUT_DIR, and the MODEL_PATHS array to your merged checkpoints.

export PROJECT_ROOT=/path/to/your/projects
# Edit eval_scripts/eval_math.sh: set MODEL_PATHS and MODEL_NAMES to your checkpoints

bash eval_scripts/eval_math.sh

Then run your preferred math metric on the generated *.jsonl under EVAL_OUTPUT_DIR.

Verifiable: Code / IF

Use the same pattern: point the eval scripts to your checkpoint dirs and data, run generate_vllm.py (or equivalent), then run task-specific scoring (e.g. pass@k for code, IFEval/IFBench for IF).

Fuzzy (RLMT-style)

For fuzzy benchmarks (e.g. creative writing, WildBench, arena), use:

export PROJECT_ROOT=/path/to/your/projects
export OPENAI_BASE_URL=http://your-vllm-server:80/v1   # or local vLLM
# Edit eval_scripts/eval_fuzzy.sh: set MODELS, MODEL_NAMES, BENCHMARKS

bash eval_scripts/eval_fuzzy.sh

Benchmark list and scoring follow the same spirit as RLMT; adjust BENCHMARKS and paths as in the script.


Acknowledgement

GOLF builds on the following projects:

  • AMPO — “More Than One Teacher: Adaptive Multi-Guidance Policy Optimization for Diverse Exploration”; we adopt and extend the adaptive multi-guidance and hybrid training ideas.
  • verl — Volcano Engine Reinforcement Learning for LLMs; our training stack is built on verl’s GRPO/PPO and infrastructure.

We also thank RLMT, critique-GRPO, and SDPO, Math-Verify for data, benchmarks, and tooling.


Citation

If you use GOLF or this code, please cite:

@misc{huang2026bootstrappingexplorationgrouplevelnatural,
  title  = {Bootstrapping Exploration with Group-Level Natural Language Feedback in Reinforcement Learning},
  author = {Lei Huang and Xiang Cheng and Chenxiao Zhao and Guobin Shen and Junjie Yang and Xiaocheng Feng and Yuxuan Gu and Xing Yu and Bing Qin},
  year   = {2026},
  eprint = {2603.04597},
  archivePrefix = {arXiv},
  primaryClass   = {cs.CL},
  url    = {https://arxiv.org/abs/2603.04597},
}

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages