Skip to content

kepengxu/PRISM-VL

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

PRSIMVL logo

PRSIMVL

Allegory of the Cave: Measurement-Grounded Vision-Language Learning

Kepeng Xu · Li Xu · Gang He · Wenxin Yu

Project Page · arXiv:2605.11727 · Synthetic RAW precursor · MeasL-Bench-V1 · MeasL-150K-V1 · Weights · 中文

Method Input Benchmark Training Backbone Adapters

PRSIMVL is a research release for asking a simple but under-tested question: when the RGB image has already lost sensor evidence, can a vision-language model reason better from measurement-domain observations?

PRSIMVL keeps the familiar Qwen3-VL training and inference workflow, but changes the visual interface from post-ISP RGB to RAW-derived Meas.-XYZ plus camera metadata. The release includes the benchmark, training corpus, evaluation pipeline, service demo, and LoRA checkpoints needed to reproduce the core findings.

PRSIMVL framework

The 30-Second Version

What Release
Core idea Use RAW-derived Meas.-XYZ and capture metadata when RGB rendering clips, denoises, tone maps, or quantizes away evidence.
Benchmark MeasL-Bench-V1, 2,183 held-out matched examples over 14 measurement-sensitive capability slices.
Training data MeasL-150K-V1, 152,517 instruction-tuning examples with 48,000 release images.
Model family Qwen3-VL 2B, 4B, and 8B with released PRSIMVL LoRA adapters hosted on Hugging Face.
Headline result PRSIMVL-8B improves over RGB Qwen3-VL-8B by +0.1074 BLEU, +0.1071 ROUGE-L, and +4.46 LLM-Judge points on MeasL-Bench.

Start Here

Goal Entry point What you get
Ask one image question inference/README.md Start swift deploy, send a local image with ask_service.py, inspect the answer.
Run the benchmark eval/README.md Evaluate Meas.-XYZ or matched RGB with the packaged wrapper.
Inspect benchmark data eval_data/README.md Dataset card, taxonomy, schema, HF loading snippet, path rules.
Inspect training data training_data/README.md Dataset card, release contents, quality checks, registry aliases.
Understand release scope RELEASE_MANIFEST.md What is included, pruned, and expected as large artifacts.

Quick Start

Install the release snapshot:

git clone <repo-url> PRSIMVL
cd PRSIMVL
bash install_editable.sh

Run a dry-run check without launching model inference:

MODEL_SIZE=2b CUDA_VISIBLE_DEVICES=0 bash eval/run_infer_and_eval.sh --dry-run

Run one PRSIMVL adapter on the default Meas.-XYZ benchmark split:

MODEL_SIZE=2b CUDA_VISIBLE_DEVICES=0 bash eval/run_infer_and_eval.sh

Large artifacts are expected at these release-local paths:

eval_data/       # MeasL-Bench-V1 JSONL + image/
training_data/   # MeasL-150K-V1 JSONL + image/
exps/            # released LoRA adapters

Download the public data from Hugging Face: MeasL-Bench-V1 for evaluation and MeasL-150K-V1 for training. Released LoRA weights are hosted at kepeng/PRSIMVL-LoRA-V1; restore them under exps/ before running adapter inference or full evaluation.

Why Measurement Grounding Matters

RGB is a display-oriented product of an image signal processor. It is useful, compact, and familiar, but it may remove the evidence that a downstream model needs. PRSIMVL treats the camera measurement as a first-class observation: Meas.-XYZ preserves a linear, three-channel view derived from RAW measurements, and metadata supplies capture context such as ISO, exposure time, and aperture.

The examples below show low-illumination text cases where RGB rendering exposes misleading evidence while Meas.-XYZ keeps the answer region recoverable.

Case RGB Observation Meas.-XYZ Observation
Illuminated shop name
RGB answer: Hua Tian Hua (wrong)

PRSIMVL answer: 正美口腔
Yellow sign text
RGB answer: diamond (wrong)

PRSIMVL answer: BLACK

Zoomed evidence crops:

RGB crop Meas.-XYZ crop RGB crop Meas.-XYZ crop

Earlier Synthetic-RAW Version

This release builds on our earlier synthetic-RAW prototype, End-to-End RAW Synergy for Elevated Vision-Language Reasoning, which introduced Raw-VLM with a learnable ISP frontend and RAW-tokenization for VLM reasoning. That first version used synthetic RAW data to study whether RAW sensor information can improve captioning, VQA, and hallucination behavior.

PRSIMVL extends that direction toward a release centered on measurement-grounded inputs: RAW-derived Meas.-XYZ, camera metadata grounding, MeasL-Bench-V1, MeasL-150K-V1, and released Qwen3-VL LoRA adapters.

Main Results

The table reports the held-out MeasL-Bench protocol. BLEU and ROUGE-L are lexical metrics; LLM-Judge is reported as accuracy percentage.

Model Visual Input BLEU ROUGE-L LLM-Judge
Qwen3-VL-2B RGB 0.3407 0.3171 69.54
Qwen3-VL-4B RGB 0.4442 0.3453 77.37
Qwen3-VL-8B RGB 0.5046 0.3500 78.20
PRSIMVL-2B Meas.-XYZ + metadata 0.5865 0.4244 77.99
PRSIMVL-4B Meas.-XYZ + metadata 0.6021 0.4465 80.83
PRSIMVL-8B Meas.-XYZ + metadata 0.6120 0.4571 82.66

Capability radar

Where It Helps Most

Capability RGB Qwen3-VL-8B BLEU / ROUGE-L PRSIMVL-2B BLEU / ROUGE-L
HDR Evidence Recovery (HER) 0.5343 / 0.3614 0.6066 / 0.4533
Low-Illumination Evidence Recovery (LER) 0.3470 / 0.2851 0.5174 / 0.4249
Scene Text Recognition (STR) 0.3719 / 0.3604 0.5084 / 0.4669
General Visual Grounding (GVG) 0.5109 / 0.3644 0.6117 / 0.4505
Agent and Entity Identification (AEI) 0.5304 / 0.4332 0.6210 / 0.5307
Binary Visual Verification (BVV) 0.5367 / 0.3580 0.6186 / 0.3732

What Is In This Release

Artifact Location Notes
Benchmark eval_data/ and HF Matched Meas.-XYZ/RGB JSONL files and 3,812 images.
Training data training_data/ and HF 152,517 instruction-tuning examples and 48,000 images.
Demo inference inference/ OpenAI-compatible swift deploy service demo for local images.
Evaluation wrapper eval/ Reproducible MeasL-Bench inference and offline evaluation entrypoint.
Training configs configs/qwen3_vl_150k_llmmeta_vit_proxy/ Launch scripts and SFT configs for 2B, 4B, and 8B.
Released adapters exps/ and HF LoRA checkpoints for Qwen3-VL 2B, 4B, and 8B.

Benchmark And Data

MeasL-Bench-V1

MeasL-Bench-V1 is the held-out benchmark for measurement-grounded language-vision evaluation.

File Rows / Files Purpose
eval_data/test-raw-measl-bench.jsonl 2,183 rows Main Meas.-XYZ benchmark.
eval_data/test-rgb-measl-bench.jsonl 2,183 rows Matched RGB benchmark.
eval_data/image/ 3,812 files Local image assets referenced by both JSONL files.

All JSONL image paths use the release-local form eval_data/image/.... When reading directly from the Hugging Face dataset repository root, remove the leading eval_data/ prefix.

Capability taxonomy
Label Capability Count
CAG Chromatic Attribute Grounding 150
NG Numerosity Grounding 150
DSG Descriptive Scene Grounding 150
HER HDR Evidence Recovery 150
LER Low-Illumination Evidence Recovery 233
STR Scene Text Recognition 150
GVG General Visual Grounding 150
CVR Compositional Visual Reasoning 150
SRU Spatial Relation Understanding 150
MSQ Manner and State Queries 150
EAQ Entity and Attribute Queries 150
DS Discriminative Selection 150
AEI Agent and Entity Identification 150
BVV Binary Visual Verification 150

MeasL-150K-V1

MeasL-150K-V1 is the released instruction-tuning corpus.

File Rows / Files Purpose
training_data/train-measl-150k-v1.jsonl 152,517 rows Final instruction-tuning set.
training_data/image/ 48,000 files Release image subset referenced by the JSONL file.

The corpus was built from approximately 700K auto-annotated candidates, filtered to 518,433 post-scoring records, balanced by source and question structure, and decontaminated against MeasL-Bench before release.

Demo Inference

Start an OpenAI-compatible service with a released adapter:

conda activate msswiftv1_service
CUDA_VISIBLE_DEVICES=0 swift deploy \
  --model Qwen/Qwen3-VL-2B-Instruct \
  --adapters exps/BANALCED_150K_META_VIT_PROXY/output-Qwen3-VL-2B-Instruct/v8-20260421-133546/checkpoint-95000

Ask a question from another terminal:

conda activate msswiftv1_service
python inference/ask_service.py \
  --image inference/demo_data/images/demo1_pole_color.png \
  --question "This is a linear Image with Metadata: ISO: 250, Exposure Time: 1/640, Aperture: f/9. What is the color of the vertical pole visible through the windshield?"

See inference/README.md for demo images, request options, and troubleshooting.

Evaluation

Run the default Meas.-XYZ benchmark:

MODEL_SIZE=4b CUDA_VISIBLE_DEVICES=0 bash eval/run_infer_and_eval.sh

Run the matched RGB benchmark:

DATASET=rgb MODEL_SIZE=4b CUDA_VISIBLE_DEVICES=0 bash eval/run_infer_and_eval.sh

Enable LLM-as-judge through an OpenAI-compatible endpoint:

export JUDGE_API_KEY=YOUR_KEY
JUDGE_URL=https://openrouter.ai/api/v1 \
JUDGE_MODEL=openai/gpt-5 \
MODEL_SIZE=2b CUDA_VISIBLE_DEVICES=0 \
bash eval/run_infer_and_eval.sh

The evaluation entrypoint defaults to eval_data/test-raw-measl-bench.jsonl or eval_data/test-rgb-measl-bench.jsonl. Use DATASET_FILE and IMAGE_ROOT only for external datasets or non-standard image locations. Full options are documented in eval/README.md.

Training

Final training configs are under configs/qwen3_vl_150k_llmmeta_vit_proxy/.

bash configs/qwen3_vl_150k_llmmeta_vit_proxy/train_prsimvl_2b.sh
bash configs/qwen3_vl_150k_llmmeta_vit_proxy/train_prsimvl_4b.sh
bash configs/qwen3_vl_150k_llmmeta_vit_proxy/train_prsimvl_8b.sh

The corresponding config files are:

  • sft_qwen3_vl_2b_prsimvl_v1.yaml
  • sft_qwen3_vl_4b_prsimvl_v1.yaml
  • sft_qwen3_vl_8b_prsimvl_v1.yaml

Released Weights

Released PRSIMVL LoRA weights are hosted on Hugging Face: kepeng/PRSIMVL-LoRA-V1. The local release expects the same checkpoint layout under exps/BANALCED_150K_META_VIT_PROXY/.

Size Base Model Local LoRA Checkpoint
2B Qwen/Qwen3-VL-2B-Instruct exps/BANALCED_150K_META_VIT_PROXY/output-Qwen3-VL-2B-Instruct/v8-20260421-133546/checkpoint-95000
4B Qwen/Qwen3-VL-4B-Instruct exps/BANALCED_150K_META_VIT_PROXY/output-Qwen3-VL-4B-Instruct/v12-20260425-113029/checkpoint-85000
8B Qwen/Qwen3-VL-8B-Instruct exps/BANALCED_150K_META_VIT_PROXY/output-Qwen3-VL-8B-Instruct/v2-20260423-205317/checkpoint-95000

Repository Layout

PRSIMVL/
├── assets/                       README figures and qualitative examples
├── eval/                         Benchmark inference and evaluation entrypoint
├── inference/                    Service-based VQA demo
├── eval_data/                    MeasL-Bench-V1 artifact folder
├── training_data/                MeasL-150K-V1 artifact folder
├── exps/                         Released LoRA adapters
├── configs/qwen3_vl_150k_llmmeta_vit_proxy/
│   └── PRSIMVL v1 training configs and launch scripts
├── scripts/test_raw_eval_pipeline_opt/
│   └── shared inference/evaluation implementation
└── swift/, libs/                 Training and inference code snapshot

Release Notes

  • Dataset license: CC BY-NC 4.0 for non-commercial research and education; citation is required.
  • Evaluation outputs are written under eval/output_benchmark/ by default.
  • Generated scratch outputs, conversion utilities, and local environment checks are pruned from this public release snapshot.
  • Contribution and release hygiene notes are available in CONTRIBUTING.md, CODE_OF_CONDUCT.md, and RELEASE_CHECKLIST.md.

Citation

Main paper: Allegory of the Cave: Measurement-Grounded Vision-Language Learning
Earlier synthetic-RAW version: End-to-End RAW Synergy for Elevated Vision-Language Reasoning
Project page: https://kepengxu.github.io/projects/prism-vl/
Author homepage: https://kepengxu.github.io/

@misc{xu2026allegory,
  title         = {Allegory of the Cave: Measurement-Grounded Vision-Language Learning},
  author        = {Xu, Kepeng and Xu, Li and He, Gang and Yu, Wenxin},
  year          = {2026},
  eprint        = {2605.11727},
  archivePrefix = {arXiv},
  url           = {https://arxiv.org/abs/2605.11727}
}

@inproceedings{xu2025rawvlm,
  title     = {End-to-End RAW Synergy for Elevated Vision-Language Reasoning},
  author    = {Xu, Kepeng and Qiao, Tong and Liu, Zhenyang and Xu, Li and He, Gang},
  booktitle = {IJCAI 2025 Workshop on Multimodal Knowledge and Language Modeling (MKLM)},
  year      = {2025},
  url       = {https://openreview.net/forum?id=fsCtGojL2R}
}

About

PRISM-VL studies measurement-grounded VLM learning with RAW-derived Meas.-XYZ inputs, camera-conditioned grounding, and exposure-bracketed supervision transfer.

Resources

License

Code of conduct

Contributing

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors