Feat/llama2 70b accuracy by tianmu-li · Pull Request #97 · mlcommons/endpoints

tianmu-li · 2026-01-14T22:14:02Z

What does this PR do?

Add rouge score scorer. Enable accuracy + performance run for llama2-70b. Fix mapping for open-orca.

Type of change

Bug fix
New feature
Documentation update
Refactor/cleanup

Related issues

Testing

Tests added/updated
All tests pass locally
Manual testing completed

Checklist

Code follows project style
Pre-commit hooks pass
Documentation updated (if needed)

Signed-off-by: Li, Tianmu <tianmu.li@intel.com>

github-actions · 2026-01-14T22:14:11Z

MLCommons CLA bot All contributors have signed the MLCommons CLA ✍️ ✅

gemini-code-assist · 2026-01-14T22:14:28Z

Summary of Changes

Hello @tianmu-li, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly enhances the benchmarking capabilities by introducing ROUGE scoring for accuracy evaluation, specifically tailored for the Llama2-70B model on the OpenOrca dataset. It refactors the evaluation system to dynamically select scorers and improves dataset handling, allowing for more comprehensive and configurable accuracy and performance assessments. These changes provide greater flexibility in evaluating large language models and streamline the process of integrating new evaluation metrics.

Highlights

New ROUGE Scorer Implementation: A new RougeScorer class has been added to enable ROUGE-L F1 score calculation for text generation evaluation, including necessary text preprocessing and dependency management for nltk and evaluate libraries.
Llama2-70B Benchmarking with OpenOrca: The Llama2-70B example configuration (online_llama2_70b_orca.yaml) has been updated to include both accuracy and performance benchmarking runs using the OpenOrca dataset, with accuracy evaluation specifically configured to use the new ROUGE scorer.
Dynamic Scorer Selection: The benchmark execution logic has been refactored to dynamically select evaluation scorers based on the configuration's eval_method, replacing the previous hardcoded PassAt1Scorer and allowing for more flexible evaluation strategies.
OpenOrca Dataset Integration: A new OpenOrca predefined dataset loader has been integrated, providing a structured way to load and process the OpenOrca dataset for accuracy evaluations, along with fixes for its parser mappings.
Flexible Dataset Loader Factory: The dataset loader factory now intelligently checks if predefined dataset loaders accept specific arguments (dataset_path, metadata, remap), enhancing its flexibility for various dataset types.
NoneExtractor for Optional Extraction: A NoneExtractor has been introduced to serve as a no-operation extractor when no specific text extraction is required, improving the robustness of the extraction mechanism.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

Copilot

Pull request overview

This PR adds ROUGE scoring capability for evaluating text generation tasks, enables accuracy and performance benchmarking for Llama2-70B, and fixes the dataset mapping configuration for the Open-Orca dataset.

Changes:

Added a new RougeScorer class with ROUGE-L F1 scoring for text generation evaluation
Introduced a registry pattern for scorer lookup via Scorer.get() and __init_subclass__
Added NoneExtractor to handle cases where no output extraction is needed
Updated Open-Orca dataset configuration to support both accuracy and performance evaluation modes

Reviewed changes

Copilot reviewed 7 out of 7 changed files in this pull request and generated 6 comments.

Show a summary per file

File	Description
src/inference_endpoint/evaluation/scoring.py	Added scorer registry pattern and new `RougeScorer` class for ROUGE-based evaluation
src/inference_endpoint/evaluation/extractor.py	Added `NoneExtractor` for pass-through extraction and null handling in `Extractor.get()`
src/inference_endpoint/dataset_manager/predefined/open_orca/init.py	Created `OpenOrca` dataset class with configurable column remapping
src/inference_endpoint/dataset_manager/factory.py	Updated factory to support predefined datasets with custom dataloaders
src/inference_endpoint/dataset_manager/init.py	Exported new `OpenOrca` dataset class
src/inference_endpoint/commands/benchmark.py	Changed to use configurable scorer lookup instead of hardcoded `PassAt1Scorer`
examples/06_Llama2-70B_Example/online_llama2_70b_orca.yaml	Updated configuration with corrected parser mappings and separate accuracy/performance datasets

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

src/inference_endpoint/evaluation/extractor.py

src/inference_endpoint/evaluation/scoring.py

examples/06_Llama2-70B_Example/online_llama2_70b_orca.yaml

src/inference_endpoint/dataset_manager/predefined/open_orca/__init__.py

gemini-code-assist

Code Review

This pull request introduces support for ROUGE scoring for accuracy evaluation, specifically for Llama2-70b on the OpenOrca dataset. The changes include a new RougeScorer, dynamic scorer loading, a new predefined OpenOrca dataset, and updates to the dataset factory to handle file-based predefined datasets. The overall approach is sound, but there are several areas for improvement regarding code quality, correctness, and best practices. My review includes suggestions to fix type hint inconsistencies, improve the dataset factory's introspection logic, remove runtime side-effects like nltk.download(), and refactor the RougeScorer to align with the base class contract and improve its implementation.

src/inference_endpoint/evaluation/extractor.py

src/inference_endpoint/evaluation/scoring.py

src/inference_endpoint/dataset_manager/factory.py

src/inference_endpoint/evaluation/scoring.py

Signed-off-by: Li, Tianmu <tianmu.li@intel.com>

Copilot

Pull request overview

Copilot reviewed 7 out of 7 changed files in this pull request and generated 4 comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

src/inference_endpoint/evaluation/scoring.py

src/inference_endpoint/dataset_manager/factory.py

Signed-off-by: Li, Tianmu <tianmu.li@intel.com>

arekay-nv

Left some initial comments, please rebase as i think some changes are missing. And feel free to ping when ready for review.
Thanks

src/inference_endpoint/commands/benchmark.py

src/inference_endpoint/dataset_manager/predefined/open_orca/__init__.py

src/inference_endpoint/evaluation/extractor.py

src/inference_endpoint/evaluation/scoring.py

…cy and performance; add evaluation download instructions to 70b orca README Signed-off-by: Li, Tianmu <tianmu.li@intel.com>

Copilot

Pull request overview

Copilot reviewed 7 out of 7 changed files in this pull request and generated 3 comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

src/inference_endpoint/evaluation/scoring.py

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

Copilot

Pull request overview

Copilot reviewed 7 out of 7 changed files in this pull request and generated 1 comment.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

src/inference_endpoint/evaluation/scoring.py

tianmu-li · 2026-01-15T17:36:24Z

@arekay-nv Thanks for the review. I've addressed the comments, and this pr is ready for review.

arekay-nv

Can you confirm the rouge scores needed and also the factory loader.
@nv-alicheng - Please take a look as well.

examples/06_Llama2-70B_Example/README.md

src/inference_endpoint/dataset_manager/factory.py

src/inference_endpoint/evaluation/scoring.py

…ca benchmark run Signed-off-by: Li, Tianmu <tianmu.li@intel.com>

Signed-off-by: Li, Tianmu <tianmu.li@intel.com>

…eat/llama2_70b_accuracy

arekay-nv

Approving to unblock llama3 - please followup with refactor in followup.

src/inference_endpoint/dataset_manager/predefined/open_orca/__init__.py

src/inference_endpoint/dataset_manager/factory.py

Copilot

Pull request overview

Copilot reviewed 7 out of 7 changed files in this pull request and generated 1 comment.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

src/inference_endpoint/dataset_manager/predefined/open_orca/__init__.py

…init__.py Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

Copilot

Pull request overview

Copilot reviewed 7 out of 7 changed files in this pull request and generated 1 comment.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

src/inference_endpoint/evaluation/scoring.py

tianmu-li added 5 commits January 13, 2026 17:00

WIP

5db1b1e

Signed-off-by: Li, Tianmu <tianmu.li@intel.com>

Update yaml

3eed127

Signed-off-by: Li, Tianmu <tianmu.li@intel.com>

Fix ordering

26aba9a

Signed-off-by: Li, Tianmu <tianmu.li@intel.com>

Merge remote-tracking branch 'origin/main' into feat/llama2_70b_accuracy

0864d0c

Remove unused code

18440f5

Signed-off-by: Li, Tianmu <tianmu.li@intel.com>

Copilot AI review requested due to automatic review settings January 14, 2026 22:14

github-actions bot requested review from arekay-nv and nvzhihanj January 14, 2026 22:14

Copilot AI reviewed Jan 14, 2026

View reviewed changes

github-code-quality bot found potential problems Jan 14, 2026

View reviewed changes

src/inference_endpoint/dataset_manager/predefined/open_orca/__init__.py Fixed Show fixed Hide fixed

gemini-code-assist bot reviewed Jan 14, 2026

View reviewed changes

Add dummy score_single_sample

4895303

Signed-off-by: Li, Tianmu <tianmu.li@intel.com>

nvzhihanj requested a review from nv-alicheng January 14, 2026 22:25

Minor cleanup

ebb0f60

Signed-off-by: Li, Tianmu <tianmu.li@intel.com>

Copilot AI review requested due to automatic review settings January 14, 2026 22:48

Copilot AI reviewed Jan 14, 2026

View reviewed changes

Fix typo. Return only rouge1 for now

68b75dd

Signed-off-by: Li, Tianmu <tianmu.li@intel.com>

arekay-nv reviewed Jan 15, 2026

View reviewed changes

tianmu-li added 2 commits January 15, 2026 08:30

Merge remote-tracking branch 'origin/main' into feat/llama2_70b_accuracy

9f64330

Resolve merge conflicts; fix issue when using same dataset for accura…

1ea9814

…cy and performance; add evaluation download instructions to 70b orca README Signed-off-by: Li, Tianmu <tianmu.li@intel.com>

Copilot AI review requested due to automatic review settings January 15, 2026 17:27

Copilot AI reviewed Jan 15, 2026

View reviewed changes

src/inference_endpoint/evaluation/scoring.py Show resolved Hide resolved

src/inference_endpoint/evaluation/scoring.py Show resolved Hide resolved

src/inference_endpoint/evaluation/scoring.py Outdated Show resolved Hide resolved

Update src/inference_endpoint/evaluation/scoring.py

b560b80

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

Copilot AI review requested due to automatic review settings January 15, 2026 17:30

Copilot AI reviewed Jan 15, 2026

View reviewed changes

src/inference_endpoint/evaluation/scoring.py Show resolved Hide resolved

tianmu-li marked this pull request as ready for review January 15, 2026 17:35

tianmu-li requested a review from a team as a code owner January 15, 2026 17:35

github-actions bot requested a review from arekay-nv January 15, 2026 17:35

arekay-nv reviewed Jan 20, 2026

View reviewed changes

attafosu mentioned this pull request Jan 20, 2026

feat:llama3.1-8b accuracy #100

Merged

10 tasks

tianmu-li added 3 commits January 20, 2026 10:27

Add dataset download instructions; embed dataset download into openor…

916620a

…ca benchmark run Signed-off-by: Li, Tianmu <tianmu.li@intel.com>

Avoid re-importing nltk

8a7c596

Signed-off-by: Li, Tianmu <tianmu.li@intel.com>

Merge remote-tracking branch 'origin/feat/llama2_70b_accuracy' into f…

113879f

…eat/llama2_70b_accuracy

tianmu-li requested a review from arekay-nv January 20, 2026 18:39

arekay-nv approved these changes Jan 21, 2026

View reviewed changes

Merge branch 'main' into feat/llama2_70b_accuracy

173b22c

Copilot AI review requested due to automatic review settings January 21, 2026 05:55

Copilot AI reviewed Jan 21, 2026

View reviewed changes

src/inference_endpoint/dataset_manager/predefined/open_orca/__init__.py Outdated Show resolved Hide resolved

Update src/inference_endpoint/dataset_manager/predefined/open_orca/__…

14531df

…init__.py Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

Copilot AI review requested due to automatic review settings January 21, 2026 05:56

Copilot AI reviewed Jan 21, 2026

View reviewed changes

src/inference_endpoint/evaluation/scoring.py Show resolved Hide resolved

tianmu-li merged commit 1316c80 into main Jan 21, 2026
4 checks passed

github-actions bot locked and limited conversation to collaborators Jan 21, 2026

arekay-nv deleted the feat/llama2_70b_accuracy branch April 2, 2026 03:05

Conversation

tianmu-li commented Jan 14, 2026

What does this PR do?

Type of change

Related issues

Testing

Checklist

Uh oh!

github-actions bot commented Jan 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gemini-code-assist bot commented Jan 14, 2026

Summary of Changes

Highlights

Footnotes

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

arekay-nv left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

tianmu-li commented Jan 15, 2026

Uh oh!

arekay-nv left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

arekay-nv left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

github-actions bot commented Jan 14, 2026 •

edited

Loading