This repository contains the implementation of ZipRL, an adaptive multi-turn context compression framework for agentic search, as described in the paper:
ZipRL: Adaptive Multi-Turn Context Compression with Hindsight Response Replay
Context compression plays a pivotal role in enhancing the application of Large Language Models (LLMs) in multi-turn scenarios such as agentic search. ZipRL introduces: (1) a multi-granularity mechanism that adaptively compresses context based on document–query relevance, and (2) Hindsight Response Replay (HRR), which densifies sparse reward signals via advantage re-shaping in Group Relative Policy Optimization (GRPO).
ZipRL is built on verl and extends it with context-zip agent loops, compression quality scoring, and HRR-based reward reshaping for long-horizon search agent training.
- Multi-granularity mechanism: The agent selects coarse-to-fine compression levels (e.g., Level 1 “Ultra-coarse” to Level 5 “Ultra-fine”) per document via in-context prompts, preserving more relevant mutual information than uniform compression (with theoretical support in the paper).
- Compression quality score ((Q_{\text{com}})): A heuristic metric over four dimensions—compression ratio, level-strategy consistency, information retention, and semantic completeness—used to evaluate each compression step without external reward models.
- Hindsight Response Replay (HRR): Inspired by Hindsight Experience Replay (HER). The average compression quality over a trajectory is used as a substitute goal; turn-level advantages are re-shaped by the difference (Q_{\text{com}}^{(i,j)} - \bar{Q}_{\text{com}}^{(i)}), densifying training signals and improving credit assignment.
- GRPO training: ZipRL uses GRPO with HRR-integrated advantages for stable, sample-efficient policy optimization.
The agent uses three tools: Search (query → snippets), Open-Page (docid/URL → full content), and Finish (submit answer or stop).
Train ZipRL (GRPO + context-zip agent): set a few environment variables, run the script, then submit the job.
Edit or export these before running:
| Variable | Meaning | Example |
|---|---|---|
CONDA_ENV |
Conda environment name (or path) | ziprl |
PROJECT_DIR |
Repo root (optional; default: auto from script path) | $(pwd) |
MODEL_PATH |
SFT checkpoint to start GRPO from | /path/to/sft/checkpoint |
TRAIN_DATA_DIR |
Training data (parquet) | $PROJECT_DIR/data/train.parquet |
TEST_DATA_DIR |
Validation data (parquet) | $PROJECT_DIR/data/val.parquet |
SEARCH_URL |
Search service URL for the agent | http://localhost:8002 |
LOG_DIR |
Training logs (optional) | $PROJECT_DIR/logs |
OPENAI_JUDGE_BASE_URL |
Judge API for validation (optional) | http://localhost:8003/v1 |
From the repository root:
chmod +x examples/context_zip_agent/run_grpo_context_zip.sh
./examples/context_zip_agent/run_grpo_context_zip.shThe script activates the conda env, sets config paths, and launches verl.trainer.main_ppo with the context-zip GRPO config. Logs are written to $LOG_DIR/<experiment_name>.log.
- ReAct-style evaluation (multi-turn tool use):
eval/evaluate_react.py
Use the scripts ineval/evalscript/for batch runs (e.g.batch_eval_react_all.sh). SetEVAL_WORKSPACE,EVAL_DATA_DIR,SEARCH_URL,MODEL_PATH_*, andAPI_KEYSas needed (see comments in each script). - Summary-style evaluation:
eval/evaluate.py
Batch scripts:batch_eval_summary_api_qa.sh,batch_eval_summary_api_bc.sh, etc.
Datasets used in the paper: MusiQue, SQuAD, Frames, Bamboogle (multi-hop QA), and BrowseComp-plus (web browsing).