Add vision evaluation metrics (exact_match, relaxed_accuracy, word_sort_ratio) by jiafatom · Pull Request #2474 · microsoft/Olive

jiafatom · 2026-05-27T18:14:05Z

Summary

Add vision evaluation metrics to the Olive evaluator framework, enabling VQA, ChartQA, and OCR model evaluation following the same pattern as #2444 (speech metrics).

exact_match: Case-insensitive string equality for VQA tasks (ScienceQA, TextVQA, MMMU, ai2d, MathVista)
relaxed_accuracy: ±5% numeric tolerance for ChartQA (configurable tolerance)
word_sort_ratio: Word-level overlap ratio for OCR evaluation

Changes

olive/evaluator/metric.py — Add EXACT_MATCH, RELAXED_ACCURACY, WORD_SORT_RATIO to AccuracySubType enum
olive/evaluator/accuracy.py — Add ExactMatch, RelaxedAccuracy, WordSortRatio classes
olive/evaluator/olive_evaluator.py — Add _inference_vision() path in OnnxEvaluator and PyTorchEvaluator; add task-metric validation that throws if metric is incompatible with task type
olive/data/component/pre_process_data.py — Add vision_vqa_pre_process data component with Olive-style sampling (--limit/--seed)
olive/data/container/huggingface_container.py — Add vision-vqa, vision-chart-qa, vision-ocr task types
olive/olive_config.json — Add vision extra dependencies (Pillow)

Task-Metric Validation

Each vision task restricts which metrics are valid:

Task Type	Allowed Metric
`vision-vqa`	`exact_match`
`vision-chart-qa`	`relaxed_accuracy`
`vision-ocr`	`word_sort_ratio`

If a user selects an incompatible metric, a clear exception is raised.

Testing

20 unit tests added covering all three metrics with edge cases (all passing).

Usage

{
  "metrics": [{
    "name": "vision_eval",
    "type": "accuracy",
    "sub_types": [
      {"name": "exact_match", "higher_is_better": true}
    ],
    "data_config": {
      "type": "HuggingfaceContainer",
      "task_type": "vision-vqa",
      "params": {
        "data_name": "HuggingFaceH4/ScienceQA",
        "split": "test"
      }
    }
  }]
}

…rt_ratio) Add vision evaluation metrics to the Olive evaluator framework, enabling VQA, ChartQA, and OCR model evaluation. - exact_match: case-insensitive string equality for VQA tasks - relaxed_accuracy: ±5% numeric tolerance for ChartQA - word_sort_ratio: word-level overlap ratio for OCR Changes: - olive/evaluator/metric.py: Add EXACT_MATCH, RELAXED_ACCURACY, WORD_SORT_RATIO to AccuracySubType - olive/evaluator/accuracy.py: Add ExactMatch, RelaxedAccuracy, WordSortRatio classes - olive/evaluator/olive_evaluator.py: Add _inference_vision() path and task-metric validation - olive/data/component/pre_process_data.py: Add vision_vqa_pre_process data component - olive/data/container/huggingface_container.py: Add vision-vqa, vision-chart-qa, vision-ocr tasks - olive/olive_config.json: Add vision extra dependencies (Pillow) - test/evaluator/test_accuracy.py: Add 20 unit tests for vision metrics Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Copilot

Pull request overview

This PR extends Olive’s evaluator framework with three vision-oriented “accuracy” sub-metrics intended for VQA/ChartQA/OCR evaluation, following the existing pattern used for speech metrics.

Changes:

Adds new AccuracySubType enum values for exact_match, relaxed_accuracy, and word_sort_ratio.
Implements ExactMatch, RelaxedAccuracy, and WordSortRatio metric classes and adds unit tests for their core behavior.
Introduces a vision inference path in the evaluator and adds HuggingFace container task mappings plus a new vision_vqa_pre_process component and a vision extra dependency.

Reviewed changes

Copilot reviewed 7 out of 7 changed files in this pull request and generated 4 comments.

Show a summary per file

File	Description
`olive/evaluator/metric.py`	Adds new accuracy sub-type enum values for vision metrics.
`olive/evaluator/accuracy.py`	Implements the three new vision metric computations.
`olive/evaluator/olive_evaluator.py`	Adds vision inference path and task↔metric validation logic for vision tasks.
`olive/data/component/pre_process_data.py`	Adds a new `vision_vqa_pre_process` pre-processing component with sampling support.
`olive/data/container/huggingface_container.py`	Registers new HuggingFace task types mapping to the vision pre-process component.
`olive/olive_config.json`	Adds a `vision` extra dependency entry.
`test/evaluator/test_accuracy.py`	Adds unit tests covering the new vision metrics.

+    if metric.data_config and hasattr(metric.data_config, "task_type"):
+        task_type = metric.data_config.task_type
+    elif metric.data_config and hasattr(metric.data_config, "params_config"):
+        task_type = getattr(metric.data_config.params_config, "task_type", None)
+


+            image = item[self.image_column]
+            question = item[self.question_column]
+            answer = item[self.answer_column]
+            # Handle list answers (some datasets have multiple valid answers)
+            if isinstance(answer, list):



 # Text-based accuracy sub-types that work with string predictions/targets
 _TEXT_BASED_ACCURACY_SUBTYPES = {AccuracySubType.WER, AccuracySubType.RTFX}
+_VISION_ACCURACY_SUBTYPES = {AccuracySubType.EXACT_MATCH, AccuracySubType.RELAXED_ACCURACY, AccuracySubType.WORD_SORT_RATIO}


        "torch-tensorrt": [ "torch-tensorrt" ],
-        "tune-session-params": [ "psutil" ]
+        "tune-session-params": [ "psutil" ],
+        "vision": [ "Pillow" ]


Document which datasets each vision metric is suitable for: - exact_match: AI2D, ScienceQA, TextVQA, MathVista, MMMU, InterGPS - relaxed_accuracy: ChartQA - word_sort_ratio: OCR Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

jiafatom · 2026-05-27T18:46:14Z

Supported Vision Benchmarks

These metrics map to standard public vision benchmarks:

Metric	Benchmarks	Description
`exact_match`	AI2D, ScienceQA, TextVQA, MathVista, MMMU, InterGPS	Case-insensitive string equality
`relaxed_accuracy`	ChartQA	±5% numeric tolerance for chart/math answers
`word_sort_ratio`	OCR	Word-level overlap ratio

Recipe PR

Evaluation recipe using these metrics with Qwen3-VL-2B-Instruct on AI2D: microsoft/olive-recipes#434

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Add evaluation scripts for Qwen3-VL-2B-Instruct using Olive's new vision metrics (exact_match) on the AI2D science diagram QA benchmark. New files: - eval/evaluate.py: Standalone evaluation script with ONNX + optional PyTorch - eval/eval_user_script.py: Olive-compatible post-processing function - eval/qwen3vl_eval_ai2d.json: Olive config for running evaluation via olive run - eval/requirements.txt: Dependencies - eval/README.md: Usage instructions Related Olive PR: microsoft/Olive#2474 Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

- Fix _validate_vision_task_metric to extract task from pre_process_data_config.params['task'] instead of non-existent DataConfig attributes - Wrap _VISION_ACCURACY_SUBTYPES across multiple lines for lint compliance - Use lowercase 'pillow' in olive_config.json for consistency - Add docstring note about ONNX vs PyTorch path for vision_vqa_pre_process Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

jambayk · 2026-05-27T20:21:17Z

could create a branch directly on the repo and open the PR from there? CI cannot run on forked PRs because of credentials.

jiafatom · 2026-05-27T20:25:19Z

Closing in favor of a new PR from the upstream repo branch (CI requires non-fork PRs).

jiafatom · 2026-05-27T20:26:27Z

See #2476

Copilot AI review requested due to automatic review settings May 27, 2026 18:14

Copilot started reviewing on behalf of jiafatom May 27, 2026 18:14 View session

Copilot AI reviewed May 27, 2026

View reviewed changes

devang-ml requested a review from shaahji May 27, 2026 18:27

jiafatom mentioned this pull request May 27, 2026

Add vision evaluation for Qwen3-VL-2B-Instruct on AI2D microsoft/olive-recipes#434

Open

Remove internal project references from comments

28d8110

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

jiafatom closed this May 27, 2026

jiafatom deleted the jiafa/add-vision-eval-metrics branch May 27, 2026 20:26

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add vision evaluation metrics (exact_match, relaxed_accuracy, word_sort_ratio)#2474

Add vision evaluation metrics (exact_match, relaxed_accuracy, word_sort_ratio)#2474
jiafatom wants to merge 4 commits into
microsoft:mainfrom
jiafatom:jiafa/add-vision-eval-metrics

jiafatom commented May 27, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

jiafatom commented May 27, 2026

Uh oh!

jambayk commented May 27, 2026

Uh oh!

jiafatom commented May 27, 2026

Uh oh!

jiafatom commented May 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

jiafatom commented May 27, 2026

Summary

Changes

Task-Metric Validation

Testing

Usage

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

jiafatom commented May 27, 2026

Supported Vision Benchmarks

Recipe PR

Uh oh!

jambayk commented May 27, 2026

Uh oh!

jiafatom commented May 27, 2026

Uh oh!

jiafatom commented May 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants