Evaluation scripts for the QCalEval benchmark — a dataset for assessing vision-language model capabilities on quantum calibration experiment analysis. Data is loaded directly from HuggingFace. Compatible with any OpenAI-compatible API endpoint.
Based on the QCalEval benchmark findings, we release NVIDIA Ising Calibration 1, an open-weight 35B MoE model fine-tuned for zero-shot quantum calibration plot understanding.
| Label | Question | Task |
|---|---|---|
| Tech. Desc. | Q1 | Structured JSON description of plot type, axes, and salient visual features |
| Exp. Status | Q2 | 4-way outcome classification: expected behavior, suboptimal parameters, anomalous behavior, or apparatus issue |
| Reasoning | Q3 | Experiment-specific scientific analysis: what the pattern implies, whether the sweep is sufficient, and what calibration step follows |
| Fit Rel. | Q4 | Assess whether a visible fit is trustworthy for downstream use: reliable, unreliable, or no fit |
| Param. Ext. | Q5 | Extract family-specific physical parameters into structured JSON |
| Cal. Diag. | Q6 | Assign a family-specific status code (e.g., SUCCESS, NO_SIGNAL) with corrective action |
Scores are per-question averages (0–100), judged by GPT-5.4.
| Type | Model | Mean | Q1 | Q2 | Q3 | Q4 | Q5 | Q6 |
|---|---|---|---|---|---|---|---|---|
| NVIDIA | Ising-Cal-1-35B | 74.7 | 87.8 | 67.1 | 64.7 | 90.5 | 62.5 | 75.3 |
| Closed | Gemini-3.1-Pro | 72.3 | 88.5 | 57.2 | 61.1 | 84.4 | 71.5 | 71.2 |
| Open | Gemma-4-31B-IT | 68.8 | 85.6 | 54.3 | 59.8 | 82.7 | 68.3 | 62.1 |
| Closed | Gemini-3.1-Flash-Lite | 68.2 | 89.2 | 53.5 | 59.4 | 82.7 | 63.8 | 60.9 |
| Closed | Claude Opus 4.6 | 67.8 | 90.8 | 49.0 | 65.5 | 76.1 | 64.7 | 60.5 |
| Closed | Claude Sonnet 4.6 | 66.5 | 89.7 | 48.6 | 63.4 | 76.5 | 60.4 | 60.1 |
| Closed | GPT-5.4 | 64.6 | 90.9 | 52.7 | 63.7 | 54.7 | 64.3 | 61.3 |
| Open | Qwen3.5-397B-A17B | 58.6 | 88.1 | 42.8 | 52.0 | 50.6 | 62.5 | 55.6 |
| Open | Qwen3.5-27B | 58.5 | 87.0 | 45.7 | 48.3 | 56.4 | 58.7 | 55.1 |
| Open | Qwen3.5-122B-A10B | 57.1 | 86.6 | 44.0 | 49.0 | 50.2 | 61.2 | 51.9 |
| Closed | GPT-5.4-Mini | 55.7 | 90.3 | 39.5 | 48.3 | 42.0 | 62.6 | 51.4 |
| Open | Qwen3.5-35B-A3B | 55.5 | 86.8 | 39.9 | 45.7 | 52.7 | 57.8 | 50.6 |
| Open | Qwen3.5-9B | 53.0 | 81.5 | 37.9 | 39.5 | 49.8 | 57.1 | 52.3 |
| Closed | Claude Haiku 4.5 | 50.5 | 83.4 | 36.6 | 40.8 | 48.6 | 51.0 | 42.8 |
| Open | InternVL3-78B | 48.2 | 76.3 | 37.0 | 34.1 | 42.8 | 52.9 | 45.7 |
| Open | MiniCPM-o-4.5 | 44.5 | 76.7 | 31.7 | 29.8 | 32.5 | 47.9 | 48.1 |
| Open | InternVL3-38B | 44.1 | 79.2 | 34.6 | 27.6 | 33.7 | 49.2 | 40.3 |
| Open | Kimi-VL-A3B | 38.9 | 65.0 | 34.6 | 22.1 | 35.0 | 38.9 | 37.4 |
In ICL mode, the model receives labeled demonstration examples from the same experiment family before the query plot — showing what a correct answer looks like for similar data. Q3 and Q6 use N-way demonstrations (multiple examples from the family), while Q5 uses a single 1-shot demonstration with the extraction schema. Delta shows change from zero-shot.
| Type | Model | Mean | Q3 | Delta | Q5 | Delta | Q6 | Delta |
|---|---|---|---|---|---|---|---|---|
| Closed | Gemini-3.1-Pro | 85.2 | 81.3 | +20.2 | 84.5 | +13.0 | 89.8 | +18.6 |
| Closed | Claude Opus 4.6 | 85.1 | 84.7 | +19.2 | 81.3 | +16.6 | 89.4 | +28.9 |
| Open | Gemma-4-31B-IT | 81.2 | 80.6 | +20.8 | 76.9 | +8.6 | 86.0 | +23.9 |
| Closed | GPT-5.4 | 78.4 | 81.0 | +17.3 | 72.9 | +8.6 | 81.4 | +20.1 |
| Closed | Gemini-3.1-Flash-Lite | 78.1 | 78.5 | +19.1 | 73.6 | +9.8 | 82.2 | +21.3 |
| Closed | Claude Sonnet 4.6 | 75.9 | 77.8 | +14.4 | 71.9 | +11.5 | 78.0 | +17.9 |
| Closed | GPT-5.4-Mini | 66.1 | 58.8 | +10.5 | 72.7 | +10.1 | 66.9 | +15.5 |
| Closed | Claude Haiku 4.5 | 66.0 | 66.1 | +25.3 | 58.7 | +7.7 | 73.1 | +30.3 |
| Open | InternVL3-38B | 56.9 | 56.2 | +28.6 | 59.5 | +10.3 | 55.1 | +14.8 |
| Open | Qwen3.5-27B | 53.0 | 41.8 | -6.5 | 71.5 | +12.8 | 45.8 | -9.3 |
| Open | Qwen3.5-397B-A17B | 48.0 | 37.4 | -14.6 | 64.3 | +1.8 | 42.4 | -13.2 |
| Open | InternVL3-78B | 47.0 | 50.5 | +16.4 | 46.2 | -6.7 | 44.3 | -1.4 |
| Open | Qwen3.5-122B-A10B | 44.6 | 36.1 | -12.9 | 62.5 | +1.3 | 35.2 | -16.7 |
| Open | Qwen3.5-35B-A3B | 43.9 | 33.4 | -12.3 | 64.4 | +6.6 | 33.9 | -16.7 |
| Open | Qwen3.5-9B | 43.2 | 32.8 | -6.7 | 63.0 | +5.9 | 33.9 | -18.4 |
| Open | Kimi-VL-A3B | 40.6 | 34.9 | +12.8 | 54.3 | +15.4 | 32.6 | -4.8 |
| Open | MiniCPM-o-4.5 | 33.0 | 19.3 | -10.5 | 50.5 | +2.6 | 29.2 | -18.9 |
pip install -r requirements.txtSends each image + question independently (6 requests per entry).
# OpenAI API
python benchmark_zeroshot.py \
--api-base https://api.openai.com/v1/chat/completions \
--model-id gpt-5.4 \
--api-key-env OPENAI_API_KEY \
--output results_zeroshot.json
# Local vLLM / NIM endpoint
python benchmark_zeroshot.py \
--api-base http://localhost:8000/v1/chat/completions \
--model-id my-model \
--api-key dummy \
--output results_zeroshot.json
# With options
python benchmark_zeroshot.py \
--api-base https://api.openai.com/v1/chat/completions \
--model-id gpt-5.4 \
--concurrency 128 \
--limit 10 \
--output results.jsonRuns 3 questions per entry (Q3, Q5, Q6) with in-context demonstration examples.
python benchmark_icl.py \
--api-base https://api.openai.com/v1/chat/completions \
--model-id gpt-5.4 \
--api-key-env OPENAI_API_KEY \
--output results_icl.json
# For models that use thinking tokens (e.g., Qwen3.5)
python benchmark_icl.py \
--api-base http://localhost:8000/v1/chat/completions \
--model-id my-model \
--api-key dummy \
--no-think \
--output results_icl.jsonScores model responses against ground truth. Uses a combination of programmatic scoring and LLM-based key point evaluation. Works with both zero-shot and ICL results.
# Score using OpenAI as judge
python benchmark_judge.py results_zeroshot.json \
--judge-api-base https://api.openai.com/v1/chat/completions \
--judge-model-id gpt-5.4 \
--judge-api-key-env OPENAI_API_KEY \
--output judged.json
# Score ICL results
python benchmark_judge.py results_icl.json \
--judge-api-base https://api.openai.com/v1/chat/completions \
--judge-model-id gpt-5.4 \
--concurrency 16 \
--output judged_icl.jsonbenchmark_zeroshot.py / benchmark_icl.py
|
v
results.json (model responses)
|
v
benchmark_judge.py
|
v
judged.json (per-question scores + aggregates)
| Argument | Description |
|---|---|
--api-base |
API endpoint URL (OpenAI-compatible /v1/chat/completions) |
--model-id |
Model identifier to send in API requests |
--api-key-env |
Environment variable containing API key (default: OPENAI_API_KEY) |
--api-key |
API key directly (prefer --api-key-env) |
--concurrency |
Max concurrent API requests (default: 8) |
--max-tokens |
Max tokens per response (default: 16384) |
--limit |
Max entries to evaluate (for testing) |
--no-think |
Disable thinking for reasoning models (e.g., Qwen3.5) |
--output |
Output JSON path (required) |
| Question | Task | Scoring Method |
|---|---|---|
| Q1 | Technical Description (JSON) | 50% programmatic + 50% LLM key points |
| Q2 | Outcome Classification | Exact match (4-way) |
| Q3 | Scientific Reasoning | LLM key points (3-item checklist) |
| Q4 | Fit Reliability Assessment | Exact match (3-way) |
| Q5 | Parameter Extraction (JSON) | Per-field tolerance scoring |
| Q6 | Calibration Diagnosis | Exact match (status code) |
ICL mode scores Q3, Q5, Q6 only.
Inference scripts produce JSON with:
{
"mode": "zeroshot",
"model": "gpt-5.4",
"results": [
{
"id": "drag_failure_no_signal_a",
"experiment_type": "drag_failure_no_signal",
"responses": {
"technical_description": {"answer": "...", "error": null},
"experimental_conclusion": {"answer": "...", "error": null},
...
}
}
]
}Judge produces JSON with per-question scores (0-100) and aggregates.
@misc{cao2026qcaleval,
title = {QCalEval: Benchmarking Vision-Language Models for Quantum Calibration Plot Understanding},
author = {Cao, Shuxiang and Zhang, Zijian and others},
year = {2026},
url = {https://research.nvidia.com/publication/2026-04_qcaleval-benchmarking-vision-language-models-quantum-calibration-plot},
}The evaluation scripts are licensed under Apache 2.0. The QCalEval dataset is licensed under CC BY 4.0.