Skip to content

huzhexin/ZipRL

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

ZipRL: Adaptive Multi-Turn Context Compression with Hindsight Response Replay

This repository contains the implementation of ZipRL, an adaptive multi-turn context compression framework for agentic search, as described in the paper:

ZipRL: Adaptive Multi-Turn Context Compression with Hindsight Response Replay
Context compression plays a pivotal role in enhancing the application of Large Language Models (LLMs) in multi-turn scenarios such as agentic search. ZipRL introduces: (1) a multi-granularity mechanism that adaptively compresses context based on document–query relevance, and (2) Hindsight Response Replay (HRR), which densifies sparse reward signals via advantage re-shaping in Group Relative Policy Optimization (GRPO).

ZipRL is built on verl and extends it with context-zip agent loops, compression quality scoring, and HRR-based reward reshaping for long-horizon search agent training.


Overview

  • Multi-granularity mechanism: The agent selects coarse-to-fine compression levels (e.g., Level 1 “Ultra-coarse” to Level 5 “Ultra-fine”) per document via in-context prompts, preserving more relevant mutual information than uniform compression (with theoretical support in the paper).
  • Compression quality score ((Q_{\text{com}})): A heuristic metric over four dimensions—compression ratio, level-strategy consistency, information retention, and semantic completeness—used to evaluate each compression step without external reward models.
  • Hindsight Response Replay (HRR): Inspired by Hindsight Experience Replay (HER). The average compression quality over a trajectory is used as a substitute goal; turn-level advantages are re-shaped by the difference (Q_{\text{com}}^{(i,j)} - \bar{Q}_{\text{com}}^{(i)}), densifying training signals and improving credit assignment.
  • GRPO training: ZipRL uses GRPO with HRR-integrated advantages for stable, sample-efficient policy optimization.

The agent uses three tools: Search (query → snippets), Open-Page (docid/URL → full content), and Finish (submit answer or stop).

Quick Start

Train ZipRL (GRPO + context-zip agent): set a few environment variables, run the script, then submit the job.

1. Set environment variables

Edit or export these before running:

Variable Meaning Example
CONDA_ENV Conda environment name (or path) ziprl
PROJECT_DIR Repo root (optional; default: auto from script path) $(pwd)
MODEL_PATH SFT checkpoint to start GRPO from /path/to/sft/checkpoint
TRAIN_DATA_DIR Training data (parquet) $PROJECT_DIR/data/train.parquet
TEST_DATA_DIR Validation data (parquet) $PROJECT_DIR/data/val.parquet
SEARCH_URL Search service URL for the agent http://localhost:8002
LOG_DIR Training logs (optional) $PROJECT_DIR/logs
OPENAI_JUDGE_BASE_URL Judge API for validation (optional) http://localhost:8003/v1

2. Run the training script

From the repository root:

chmod +x examples/context_zip_agent/run_grpo_context_zip.sh
./examples/context_zip_agent/run_grpo_context_zip.sh

The script activates the conda env, sets config paths, and launches verl.trainer.main_ppo with the context-zip GRPO config. Logs are written to $LOG_DIR/<experiment_name>.log.

3. Evaluation

  • ReAct-style evaluation (multi-turn tool use):
    eval/evaluate_react.py
    Use the scripts in eval/evalscript/ for batch runs (e.g. batch_eval_react_all.sh). Set EVAL_WORKSPACE, EVAL_DATA_DIR, SEARCH_URL, MODEL_PATH_*, and API_KEYS as needed (see comments in each script).
  • Summary-style evaluation:
    eval/evaluate.py
    Batch scripts: batch_eval_summary_api_qa.sh, batch_eval_summary_api_bc.sh, etc.

Datasets used in the paper: MusiQue, SQuAD, Frames, Bamboogle (multi-hop QA), and BrowseComp-plus (web browsing).

About

ZipRL Paper

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors