feat : gpt-oss-120b end to end run with accuracy and perf.#107
feat : gpt-oss-120b end to end run with accuracy and perf.#107
Conversation
Signed-off-by: Rashid Kaleem <230885705+arekay-nv@users.noreply.github.com>
Signed-off-by: Rashid Kaleem <230885705+arekay-nv@users.noreply.github.com>
|
MLCommons CLA bot All contributors have signed the MLCommons CLA ✍️ ✅ |
Summary of ChangesHello @arekay-nv, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request introduces comprehensive support for end-to-end benchmarking of the Highlights
🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console. Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Pull request overview
This PR enables end-to-end benchmarking of the gpt-oss-120b model with MLPerf datasets, incorporating both accuracy and performance testing capabilities. The changes add LiveCodeBench evaluation support and improve reporting functionality.
Changes:
- Enhanced report output to support file writing with proper newline handling
- Added LiveCodeBench scorer with ground truth validation
- Made ground_truth configuration optional to support different evaluation methods
- Updated development Docker configuration for GPU access and improved permissions
Reviewed changes
Copilot reviewed 6 out of 6 changed files in this pull request and generated 4 comments.
Show a summary per file
| File | Description |
|---|---|
| src/inference_endpoint/metrics/reporter.py | Added newline and summary_only parameters to enable proper file-based report output |
| src/inference_endpoint/load_generator/session.py | Implemented report export to text file alongside console display |
| src/inference_endpoint/evaluation/scoring.py | Added ground_truth_column validation for LiveCodeBenchScorer |
| src/inference_endpoint/config/schema.py | Changed ground_truth field to optional to support evaluators that don't require it |
| scripts/Dockerfile.dev | Enhanced Docker dev environment with GPU support, git, sudo access, and examples directory |
| examples/04_GPTOSS120B_Example/sglang_gptoss_120b_example.yaml | Enabled LiveCodeBench evaluation and added comprehensive launch instructions |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
There was a problem hiding this comment.
Code Review
This pull request enables end-to-end runs for "gpt-oss-120b" with accuracy and performance evaluation, introducing new configurations and reporting mechanisms. While the core functionality looks solid, a security audit identified critical issues related to insecure data handling and access control. Specifically, the use of unvalidated paths for report generation could lead to arbitrary file writes or application crashes, and the granting of passwordless sudo access in the development Dockerfile poses a significant privilege escalation risk, particularly given the project's focus on executing untrusted LLM-generated code. Additionally, the development Dockerfile could be improved for better security, maintainability, and clarity, including safer sudo permission grants and clearer Hugging Face token handling. A minor duplication in the new example YAML file also needs cleanup.
Signed-off-by: Rashid Kaleem <230885705+arekay-nv@users.noreply.github.com>
What does this PR do?
To run
gpt-oss-120bwith the MLPerf datasets, follow the following steps:Launch LiveCodeBench evaluator service
The live code bench service isolates the code testing environment from the host environment and provides a service that will execute the code on the provided inputs. See LCB Readme for more details. To launch the service, we must log into docker to get the hardened image, build the service image, and launch the service:
Next, we launch the llm inference service of our choice - for instance SGLang with
gpt-oss-120b:Once the server is up, we can launch the inference benchmark via:
Type of change
Related issues
Testing
Checklist