Skip to content

Add vision evaluation metrics (exact_match, relaxed_accuracy, word_sort_ratio)#2474

Closed
jiafatom wants to merge 4 commits into
microsoft:mainfrom
jiafatom:jiafa/add-vision-eval-metrics
Closed

Add vision evaluation metrics (exact_match, relaxed_accuracy, word_sort_ratio)#2474
jiafatom wants to merge 4 commits into
microsoft:mainfrom
jiafatom:jiafa/add-vision-eval-metrics

Conversation

@jiafatom
Copy link
Copy Markdown
Contributor

Summary

Add vision evaluation metrics to the Olive evaluator framework, enabling VQA, ChartQA, and OCR model evaluation following the same pattern as #2444 (speech metrics).

  • exact_match: Case-insensitive string equality for VQA tasks (ScienceQA, TextVQA, MMMU, ai2d, MathVista)
  • relaxed_accuracy: ±5% numeric tolerance for ChartQA (configurable tolerance)
  • word_sort_ratio: Word-level overlap ratio for OCR evaluation

Changes

  • olive/evaluator/metric.py — Add EXACT_MATCH, RELAXED_ACCURACY, WORD_SORT_RATIO to AccuracySubType enum
  • olive/evaluator/accuracy.py — Add ExactMatch, RelaxedAccuracy, WordSortRatio classes
  • olive/evaluator/olive_evaluator.py — Add _inference_vision() path in OnnxEvaluator and PyTorchEvaluator; add task-metric validation that throws if metric is incompatible with task type
  • olive/data/component/pre_process_data.py — Add vision_vqa_pre_process data component with Olive-style sampling (--limit/--seed)
  • olive/data/container/huggingface_container.py — Add vision-vqa, vision-chart-qa, vision-ocr task types
  • olive/olive_config.json — Add vision extra dependencies (Pillow)

Task-Metric Validation

Each vision task restricts which metrics are valid:

Task Type Allowed Metric
vision-vqa exact_match
vision-chart-qa relaxed_accuracy
vision-ocr word_sort_ratio

If a user selects an incompatible metric, a clear exception is raised.

Testing

20 unit tests added covering all three metrics with edge cases (all passing).

Usage

{
  "metrics": [{
    "name": "vision_eval",
    "type": "accuracy",
    "sub_types": [
      {"name": "exact_match", "higher_is_better": true}
    ],
    "data_config": {
      "type": "HuggingfaceContainer",
      "task_type": "vision-vqa",
      "params": {
        "data_name": "HuggingFaceH4/ScienceQA",
        "split": "test"
      }
    }
  }]
}

…rt_ratio)

Add vision evaluation metrics to the Olive evaluator framework, enabling
VQA, ChartQA, and OCR model evaluation.

- exact_match: case-insensitive string equality for VQA tasks
- relaxed_accuracy: ±5% numeric tolerance for ChartQA
- word_sort_ratio: word-level overlap ratio for OCR

Changes:
- olive/evaluator/metric.py: Add EXACT_MATCH, RELAXED_ACCURACY, WORD_SORT_RATIO to AccuracySubType
- olive/evaluator/accuracy.py: Add ExactMatch, RelaxedAccuracy, WordSortRatio classes
- olive/evaluator/olive_evaluator.py: Add _inference_vision() path and task-metric validation
- olive/data/component/pre_process_data.py: Add vision_vqa_pre_process data component
- olive/data/container/huggingface_container.py: Add vision-vqa, vision-chart-qa, vision-ocr tasks
- olive/olive_config.json: Add vision extra dependencies (Pillow)
- test/evaluator/test_accuracy.py: Add 20 unit tests for vision metrics

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Copilot AI review requested due to automatic review settings May 27, 2026 18:14
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR extends Olive’s evaluator framework with three vision-oriented “accuracy” sub-metrics intended for VQA/ChartQA/OCR evaluation, following the existing pattern used for speech metrics.

Changes:

  • Adds new AccuracySubType enum values for exact_match, relaxed_accuracy, and word_sort_ratio.
  • Implements ExactMatch, RelaxedAccuracy, and WordSortRatio metric classes and adds unit tests for their core behavior.
  • Introduces a vision inference path in the evaluator and adds HuggingFace container task mappings plus a new vision_vqa_pre_process component and a vision extra dependency.

Reviewed changes

Copilot reviewed 7 out of 7 changed files in this pull request and generated 4 comments.

Show a summary per file
File Description
olive/evaluator/metric.py Adds new accuracy sub-type enum values for vision metrics.
olive/evaluator/accuracy.py Implements the three new vision metric computations.
olive/evaluator/olive_evaluator.py Adds vision inference path and task↔metric validation logic for vision tasks.
olive/data/component/pre_process_data.py Adds a new vision_vqa_pre_process pre-processing component with sampling support.
olive/data/container/huggingface_container.py Registers new HuggingFace task types mapping to the vision pre-process component.
olive/olive_config.json Adds a vision extra dependency entry.
test/evaluator/test_accuracy.py Adds unit tests covering the new vision metrics.

Comment thread olive/evaluator/olive_evaluator.py Outdated
Comment on lines +117 to +121
if metric.data_config and hasattr(metric.data_config, "task_type"):
task_type = metric.data_config.task_type
elif metric.data_config and hasattr(metric.data_config, "params_config"):
task_type = getattr(metric.data_config.params_config, "task_type", None)

Comment on lines +446 to +450
image = item[self.image_column]
question = item[self.question_column]
answer = item[self.answer_column]
# Handle list answers (some datasets have multiple valid answers)
if isinstance(answer, list):
Comment thread olive/evaluator/olive_evaluator.py Outdated

# Text-based accuracy sub-types that work with string predictions/targets
_TEXT_BASED_ACCURACY_SUBTYPES = {AccuracySubType.WER, AccuracySubType.RTFX}
_VISION_ACCURACY_SUBTYPES = {AccuracySubType.EXACT_MATCH, AccuracySubType.RELAXED_ACCURACY, AccuracySubType.WORD_SORT_RATIO}
Comment thread olive/olive_config.json Outdated
"torch-tensorrt": [ "torch-tensorrt" ],
"tune-session-params": [ "psutil" ]
"tune-session-params": [ "psutil" ],
"vision": [ "Pillow" ]
Document which datasets each vision metric is suitable for:
- exact_match: AI2D, ScienceQA, TextVQA, MathVista, MMMU, InterGPS
- relaxed_accuracy: ChartQA
- word_sort_ratio: OCR

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@jiafatom
Copy link
Copy Markdown
Contributor Author

Supported Vision Benchmarks

These metrics map to standard public vision benchmarks:

Metric Benchmarks Description
exact_match AI2D, ScienceQA, TextVQA, MathVista, MMMU, InterGPS Case-insensitive string equality
relaxed_accuracy ChartQA ±5% numeric tolerance for chart/math answers
word_sort_ratio OCR Word-level overlap ratio

Recipe PR

Evaluation recipe using these metrics with Qwen3-VL-2B-Instruct on AI2D: microsoft/olive-recipes#434

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
jiafatom added a commit to jiafatom/olive-recipes that referenced this pull request May 27, 2026
Add evaluation scripts for Qwen3-VL-2B-Instruct using Olive's new vision
metrics (exact_match) on the AI2D science diagram QA benchmark.

New files:
- eval/evaluate.py: Standalone evaluation script with ONNX + optional PyTorch
- eval/eval_user_script.py: Olive-compatible post-processing function
- eval/qwen3vl_eval_ai2d.json: Olive config for running evaluation via olive run
- eval/requirements.txt: Dependencies
- eval/README.md: Usage instructions

Related Olive PR: microsoft/Olive#2474

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- Fix _validate_vision_task_metric to extract task from
  pre_process_data_config.params['task'] instead of non-existent
  DataConfig attributes
- Wrap _VISION_ACCURACY_SUBTYPES across multiple lines for lint compliance
- Use lowercase 'pillow' in olive_config.json for consistency
- Add docstring note about ONNX vs PyTorch path for vision_vqa_pre_process

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@jambayk
Copy link
Copy Markdown
Contributor

jambayk commented May 27, 2026

could create a branch directly on the repo and open the PR from there? CI cannot run on forked PRs because of credentials.

@jiafatom
Copy link
Copy Markdown
Contributor Author

Closing in favor of a new PR from the upstream repo branch (CI requires non-fork PRs).

@jiafatom jiafatom closed this May 27, 2026
@jiafatom jiafatom deleted the jiafa/add-vision-eval-metrics branch May 27, 2026 20:26
@jiafatom
Copy link
Copy Markdown
Contributor Author

See #2476

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants