ZipRL: Adaptive Multi-Turn Context Compression with Hindsight Response Replay

This repository contains the implementation of ZipRL, an adaptive multi-turn context compression framework for agentic search, as described in the paper:

ZipRL: Adaptive Multi-Turn Context Compression with Hindsight Response Replay
Context compression plays a pivotal role in enhancing the application of Large Language Models (LLMs) in multi-turn scenarios such as agentic search. ZipRL introduces: (1) a multi-granularity mechanism that adaptively compresses context based on document–query relevance, and (2) Hindsight Response Replay (HRR), which densifies sparse reward signals via advantage re-shaping in Group Relative Policy Optimization (GRPO).

ZipRL is built on verl and extends it with context-zip agent loops, compression quality scoring, and HRR-based reward reshaping for long-horizon search agent training.

Overview

Multi-granularity mechanism: The agent selects coarse-to-fine compression levels (e.g., Level 1 “Ultra-coarse” to Level 5 “Ultra-fine”) per document via in-context prompts, preserving more relevant mutual information than uniform compression (with theoretical support in the paper).
Compression quality score ((Q_{\text{com}})): A heuristic metric over four dimensions—compression ratio, level-strategy consistency, information retention, and semantic completeness—used to evaluate each compression step without external reward models.
Hindsight Response Replay (HRR): Inspired by Hindsight Experience Replay (HER). The average compression quality over a trajectory is used as a substitute goal; turn-level advantages are re-shaped by the difference (Q_{\text{com}}^{(i,j)} - \bar{Q}_{\text{com}}^{(i)}), densifying training signals and improving credit assignment.
GRPO training: ZipRL uses GRPO with HRR-integrated advantages for stable, sample-efficient policy optimization.

The agent uses three tools: Search (query → snippets), Open-Page (docid/URL → full content), and Finish (submit answer or stop).

Quick Start

Train ZipRL (GRPO + context-zip agent): set a few environment variables, run the script, then submit the job.

1. Set environment variables

Edit or export these before running:

Variable	Meaning	Example
`CONDA_ENV`	Conda environment name (or path)	`ziprl`
`PROJECT_DIR`	Repo root (optional; default: auto from script path)	`$(pwd)`
`MODEL_PATH`	SFT checkpoint to start GRPO from	`/path/to/sft/checkpoint`
`TRAIN_DATA_DIR`	Training data (parquet)	`$PROJECT_DIR/data/train.parquet`
`TEST_DATA_DIR`	Validation data (parquet)	`$PROJECT_DIR/data/val.parquet`
`SEARCH_URL`	Search service URL for the agent	`http://localhost:8002`
`LOG_DIR`	Training logs (optional)	`$PROJECT_DIR/logs`
`OPENAI_JUDGE_BASE_URL`	Judge API for validation (optional)	`http://localhost:8003/v1`

2. Run the training script

From the repository root:

chmod +x examples/context_zip_agent/run_grpo_context_zip.sh
./examples/context_zip_agent/run_grpo_context_zip.sh

The script activates the conda env, sets config paths, and launches verl.trainer.main_ppo with the context-zip GRPO config. Logs are written to $LOG_DIR/<experiment_name>.log.

3. Evaluation

ReAct-style evaluation (multi-turn tool use):
eval/evaluate_react.py
Use the scripts in eval/evalscript/ for batch runs (e.g. batch_eval_react_all.sh). Set EVAL_WORKSPACE, EVAL_DATA_DIR, SEARCH_URL, MODEL_PATH_*, and API_KEYS as needed (see comments in each script).
Summary-style evaluation:
eval/evaluate.py
Batch scripts: batch_eval_summary_api_qa.sh, batch_eval_summary_api_bc.sh, etc.

Datasets used in the paper: MusiQue, SQuAD, Frames, Bamboogle (multi-hop QA), and BrowseComp-plus (web browsing).

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
eval		eval
examples/context_zip_agent		examples/context_zip_agent
recipe		recipe
verl		verl
.DS_Store		.DS_Store
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ZipRL: Adaptive Multi-Turn Context Compression with Hindsight Response Replay

Overview

Quick Start

1. Set environment variables

2. Run the training script

3. Evaluation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

ZipRL: Adaptive Multi-Turn Context Compression with Hindsight Response Replay

Overview

Quick Start

1. Set environment variables

2. Run the training script

3. Evaluation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages