ClaimDiff-RL

Claim-level Differential Reward for Vision-Language Caption RL.

ClaimDiff-RL improves image caption quality through reinforcement learning with a claim-level differential reward. Instead of scoring captions holistically, it decomposes the comparison between a model's caption and a reference caption into atomic claim-level differences, judges each against the image using a multimodal LLM (Gemini), and aggregates per-claim verdicts into a fine-grained reward signal for GRPO/PPO training.

TODO

Release benchmark evaluation suite
Release model weights

Architecture

┌───────────────────────────────────────────────────────────────────┐
│                       Training Loop (verl)                        │
│                                                                   │
│  ┌────────────┐    ┌──────────┐    ┌──────────────────────────┐   │
│  │   Actor    │───>│ Rollout  │───>│  Remote Reward Manager   │   │
│  │ (Qwen3-VL) │    │ (SGLang) │    │ (sends to reward server) │   │
│  └────────────┘    └──────────┘    └────────────┬─────────────┘   │
│                                                 │                 │
└─────────────────────────────────────────────────┼─────────────────┘
                                                  │ HTTP
                                     ┌────────────▼───────────┐
                                     │   Reward Server        │
                                     │   (FastAPI)            │
                                     │                        │
                                     │  ┌──────────────────┐  │
                                     │  │ ClaimDiff        │  │
                                     │  │ Verifier         │  │
                                     │  │                  │  │
                                     │  │ 1. Extract claims│  │
                                     │  │ 2. Compare A↔B   │  │
                                     │  │ 3. Judge via     │  │
                                     │  │    Gemini+Image  │  │
                                     │  │ 4. Aggregate     │  │
                                     │  │    reward        │  │
                                     │  └──────────────────┘  │
                                     └────────────────────────┘

Reward flow:

The actor model generates a caption (Caption A) for an image.
The caption, along with a reference caption (Caption B, e.g., from Gemini), is sent to the reward server.
The ClaimDiff verifier prompts Gemini to identify atomic differences between the two captions, each with an ASPECT, per-caption CLAIM, JUDGMENT, and ERROR_TYPE.
Error counts (optionally severity-weighted) are aggregated into a scalar reward using configurable modes (normalized_diff, normalized_a_only, judgment_based, etc.).
The reward drives GRPO/PPO updates.

Repository Structure

ClaimDiff-RL/
├── README.md
├── requirements.txt
├── reward_server/                    # Standalone reward server
│   ├── reward_serving_fastapi.py     # FastAPI app with /judge endpoint
│   ├── docker/
│   │   └── Dockerfile
│   └── verifier/
│       ├── __init__.py
│       ├── main.py                   # Verifier registry & BaseVerifier ABC
│       ├── helper.py                 # Format reward (think/answer tags)
│       ├── utils_caption_diff.py     # Core ClaimDiff verifier (gemini_caption_diff)
│       ├── utils_caption_diff_judge.py  # F1-based judge verifier (caption_judge)
│       └── utils_caption_holistic.py    # Holistic baseline verifier (B1)
├── recipe/
│   ├── caption_dataset.py            # Custom dataset class for caption RL
│   ├── config/
│   │   └── caption.yaml              # Hydra config
│   ├── scripts/
│   │   └── run_qwen3vl_caption_rl.sh # Training launch script
│   └── data/
│       └── convert_caption_train_data.py  # Data conversion example
└── verl_patch/
    └── workers/
        └── reward_manager/
            ├── remote.py             # Remote reward manager (register as "remote")
            └── remote_proxy.py       # HTTP proxy for reward server communication

Prerequisites

Python 3.10+
PyTorch 2.1+
verl framework (v0.6+)
A Gemini API key (for the reward server's claim-diff judge)
8+ GPUs (tested on 8xA100/H100)

Setup

1. Install verl

git clone https://github.com/volcengine/verl.git
cd verl
pip install -e .

2. Install dependencies

pip install -r requirements.txt

3. Apply verl patches

Copy the remote reward manager into your verl installation:

cp verl_patch/workers/reward_manager/remote.py \
   <verl_install_path>/verl/workers/reward_manager/remote.py
cp verl_patch/workers/reward_manager/remote_proxy.py \
   <verl_install_path>/verl/workers/reward_manager/remote_proxy.py

Register the remote reward manager in <verl_install_path>/verl/workers/reward_manager/__init__.py:

from .remote import RemoteRewardManager

4. Prepare data

Training data should be in Parquet format with these columns:

Column	Type	Description
`images`	`str` (JSON)	`'["/path/to/image.jpg"]'`
`data_source`	`str`	Dataset identifier
`prompt`	`str` (JSON)	Chat-format messages, e.g., `'[{"role":"user","content":"<image> Describe this image."}]'`
`reward_model`	`str` (JSON)	`{"answer":"", "ground_truth":"<reference caption>", "verifier":"gemini_caption_diff", "verifier_parm":{"image_path":"/path/to/image.jpg"}, "format_ratio":0.0}`
`extra_info`	`str` (JSON)	`{"id":"...", "image_path":"/path/to/image.jpg", "question":"..."}`
`max_image_token`	`int`	Max image tokens (e.g., `8192`)

See recipe/data/convert_caption_train_data.py for a conversion example.

Running

Step 1: Start the reward server

export GEMINI_API_KEY=<your-gemini-api-key>

# Optional: configure model and reward behavior
export GEMINI_MODEL_NAME=gemini-2.5-flash       # Default: gemini-2.5-flash
export ERROR_DIFF_MODE=normalized_a_only         # See "Reward Modes" below
export DIFF_PROMPT_VERSION=v5                    # v1, v2, v4, v5, v6

cd reward_server
python reward_serving_fastapi.py

The server starts on port 8000 with 8 workers. Deploy multiple instances for higher throughput.

Step 2: Launch training

export ACTOR_LOAD_PATH=/path/to/Qwen3-VL-8B
export DATA_TRAIN_FILE=/path/to/train.parquet
export DATA_VAL_FILE=/path/to/val.parquet
export TRAIN_SAVE_PATH=/path/to/checkpoints
export _REMOTE_REWARD_JOB_ID=<reward-server-host>
export _REMOTE_REWARD_WORKER_NUM=1
export _REMOTE_REWARD_SERVER_PORT=8000

bash recipe/scripts/run_qwen3vl_caption_rl.sh

Reward Modes

The ERROR_DIFF_MODE environment variable controls how claim-level error counts are converted to reward scores:

Mode	Formula	Description
`normalized_diff`	`(A_errors - B_errors) / total_diffs`	Relative error difference, normalized. Default.
`normalized_a_only`	`f(A_errors, total_diffs)`	Smooth reward based only on actor errors. Uses exponential decay with difficulty bonus.
`a_only`	Linear map of raw A error count	Actor errors only, linear reward mapping.
`judgment_based`	`-(A_wins - B_wins) / total_diffs`	Based on per-claim JUDGMENT field winners.
`raw_diff`	`A_errors - B_errors`	Raw difference without normalization.
`hall_vs_miss`	Hallucination + missing fact penalty	Image-verified hallucination detection.

Verifier Types

Verifier	Registry Name	Description
`GeminiCaptionDiffVerifier`	`gemini_caption_diff`	Core. Decomposes captions into claim-level diffs and scores via Gemini.
`CaptionJudgeVerifier`	`caption_judge`	Model predicts diffs, Gemini judges pred vs GT with image → F1 reward.
`GeminiCaptionHolisticVerifier`	`gemini_caption_holistic`	Baseline (B1). Holistic 0-10 score without decomposition.

Configuration Reference

Reward Server Environment Variables

Variable	Default	Description
`GEMINI_API_KEY`	(required)	Gemini API key
`GEMINI_API_BASE`	Google's API	Gemini API base URL
`GEMINI_MODEL_NAME`	`gemini-2.5-flash`	Gemini model to use
`ERROR_DIFF_MODE`	`normalized_diff`	Reward computation mode
`DIFF_PROMPT_VERSION`	`v1`	Diff prompt version (v1/v2/v4/v5/v6)
`USE_SEVERITY_WEIGHTED_REWARD`	`false`	Weight errors by severity (1/2/3)
`USE_AMBIGUITY_PENALTY`	`false`	Penalize hedging language
`REQUIRE_THINK_BLOCK`	`false`	Require `<think>` and `<answer>` tags
`MAX_RETRIES`	`8`	Max API retries per request
`LOG_SAMPLE_RATE`	`0.1`	Fraction of requests to log
`SAVE_TRAINING_DATA`	`false`	Save reward data to JSONL
`SAVE_TRAINING_DATA_PATH`	-	JSONL output path

Training Script Variables

Variable	Default	Description
`ACTOR_LOAD_PATH`	(required)	HuggingFace model path
`DATA_TRAIN_FILE`	(required)	Training parquet file
`ROLLOUT_N`	`4`	Number of rollout samples per prompt
`ACTOR_LR`	`1e-6`	Actor learning rate
`ALGO_ADV_ESTIMATOR`	`grpo`	Advantage estimator (`grpo` or `gae`)
`FREEZE_VISION_TOWER`	`True`	Freeze the vision encoder
`ACC_SCALE_RANGE`	`[0, 1.0]`	Scale range for accuracy reward

License

Apache License 2.0

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ClaimDiff-RL

TODO

Architecture

Repository Structure

Prerequisites

Setup

1. Install verl

2. Install dependencies

3. Apply verl patches

4. Prepare data

Running

Step 1: Start the reward server

Step 2: Launch training

Reward Modes

Verifier Types

Configuration Reference

Reward Server Environment Variables

Training Script Variables

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
assets		assets
recipe		recipe
reward_server		reward_server
verl_patch/workers/reward_manager		verl_patch/workers/reward_manager
README.md		README.md
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

ClaimDiff-RL

TODO

Architecture

Repository Structure

Prerequisites

Setup

1. Install verl

2. Install dependencies

3. Apply verl patches

4. Prepare data

Running

Step 1: Start the reward server

Step 2: Launch training

Reward Modes

Verifier Types

Configuration Reference

Reward Server Environment Variables

Training Script Variables

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages