feature: Integrate accuracy into run benchmarks (part of #4)#93
feature: Integrate accuracy into run benchmarks (part of #4)#93
Conversation
|
MLCommons CLA bot All contributors have signed the MLCommons CLA ✍️ ✅ |
Summary of ChangesHello @arekay-nv, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request significantly enhances the benchmarking framework by seamlessly integrating accuracy evaluation. Previously, accuracy runs might have been separate; now, they can be defined and executed alongside performance benchmarks using a unified configuration. This change provides a more holistic view of model performance by combining throughput, latency, and accuracy metrics in a single benchmark run, improving the efficiency and completeness of model assessment. Highlights
🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console. Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Pull request overview
This PR integrates accuracy evaluation into the main benchmark execution by reading accuracy configurations from the config file. It enables running accuracy benchmarks alongside performance benchmarks by introducing dataset-specific accuracy configurations and a registration system for extractors.
Changes:
- Added registration system for Extractor classes to allow lookup by name
- Modified dataset loading to support both performance and accuracy datasets from config
- Updated scoring to handle optional ground truth columns with default fallback
- Added example configuration file demonstrating accuracy dataset setup
Reviewed changes
Copilot reviewed 11 out of 11 changed files in this pull request and generated 5 comments.
Show a summary per file
| File | Description |
|---|---|
| src/inference_endpoint/evaluation/scoring.py | Added default ground truth column and assertion for validation |
| src/inference_endpoint/evaluation/extractor.py | Implemented registration system for Extractor subclasses |
| src/inference_endpoint/evaluation/init.py | Exported Extractor class |
| src/inference_endpoint/dataset_manager/predefined/gpqa/init.py | Added dataset_id registration and fixed return type |
| src/inference_endpoint/dataset_manager/predefined/aime25/init.py | Added dataset_id registration and fixed return type |
| src/inference_endpoint/dataset_manager/factory.py | Refactored to accept Dataset config object instead of individual parameters |
| src/inference_endpoint/dataset_manager/dataset.py | Fixed parquet loading and numpy array serialization issues |
| src/inference_endpoint/dataset_manager/init.py | Exported predefined dataset classes |
| src/inference_endpoint/config/schema.py | Added AccuracyConfig schema and made dataset path optional |
| src/inference_endpoint/commands/benchmark.py | Integrated accuracy dataset loading and evaluation into benchmark execution |
| examples/04_GPTOSS120B_Example/sglang_gptoss_120b_example.yaml | Added example configuration with accuracy datasets |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
There was a problem hiding this comment.
Code Review
This pull request integrates accuracy evaluation into the benchmark runs, which is a great feature addition. The changes introduce new configuration options for accuracy datasets and the logic to process and score them. My review focuses on improving the robustness and maintainability of the new implementation. I've identified a few critical issues related to potential UnboundLocalError and IndexError that could occur depending on the dataset configuration, which should be addressed. I've also included suggestions to make the code more dynamic and less reliant on hardcoded values, and to improve overall code quality.
2d05e0e to
6e4bad7
Compare
Signed-off-by: Rashid Kaleem <230885705+arekay-nv@users.noreply.github.com>
6e4bad7 to
d67ae1c
Compare
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 11 out of 11 changed files in this pull request and generated 4 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Signed-off-by: Rashid Kaleem <230885705+arekay-nv@users.noreply.github.com>
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 11 out of 11 changed files in this pull request and generated 2 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| @@ -196,8 +196,7 @@ def create_transforms(cls) -> list: | |||
| ] | |||
There was a problem hiding this comment.
In the AddStaticColumns transform, if streaming is not enabled, this key should be removed. Can you add a stream: bool = True kwarg to both get_dataloader and create_transforms?
| api_key: null | ||
| api_type: "sglang" | ||
|
|
||
| report_dir: "results/sglang_gptoss_120b_benchmark_mlperf_13_JAN_26/" |
There was a problem hiding this comment.
nit - remove the date from the report_dir
| metadata = { | ||
| "model": model_name, | ||
| "stream": enable_streaming, | ||
| "max_completion_tokens": max_tokens, |
There was a problem hiding this comment.
This only works for OpenAI adapter btw
What does this PR do?
Integrates the accuracy runs into the main benchmark execution by reading configs from the config file. Sample sglang_gptoss file is included.
Type of change
Related issues
Testing
Checklist