Skip to content

Feat/llama2 70b accuracy#97

Merged
tianmu-li merged 16 commits intomainfrom
feat/llama2_70b_accuracy
Jan 21, 2026
Merged

Feat/llama2 70b accuracy#97
tianmu-li merged 16 commits intomainfrom
feat/llama2_70b_accuracy

Conversation

@tianmu-li
Copy link
Copy Markdown
Collaborator

What does this PR do?

Add rouge score scorer. Enable accuracy + performance run for llama2-70b. Fix mapping for open-orca.

Type of change

  • Bug fix
  • New feature
  • Documentation update
  • Refactor/cleanup

Related issues

Testing

  • Tests added/updated
  • All tests pass locally
  • Manual testing completed

Checklist

  • Code follows project style
  • Pre-commit hooks pass
  • Documentation updated (if needed)

Signed-off-by: Li, Tianmu <tianmu.li@intel.com>
Signed-off-by: Li, Tianmu <tianmu.li@intel.com>
Signed-off-by: Li, Tianmu <tianmu.li@intel.com>
Signed-off-by: Li, Tianmu <tianmu.li@intel.com>
Copilot AI review requested due to automatic review settings January 14, 2026 22:14
@github-actions
Copy link
Copy Markdown

github-actions bot commented Jan 14, 2026

MLCommons CLA bot All contributors have signed the MLCommons CLA ✍️ ✅

@gemini-code-assist
Copy link
Copy Markdown

Summary of Changes

Hello @tianmu-li, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly enhances the benchmarking capabilities by introducing ROUGE scoring for accuracy evaluation, specifically tailored for the Llama2-70B model on the OpenOrca dataset. It refactors the evaluation system to dynamically select scorers and improves dataset handling, allowing for more comprehensive and configurable accuracy and performance assessments. These changes provide greater flexibility in evaluating large language models and streamline the process of integrating new evaluation metrics.

Highlights

  • New ROUGE Scorer Implementation: A new RougeScorer class has been added to enable ROUGE-L F1 score calculation for text generation evaluation, including necessary text preprocessing and dependency management for nltk and evaluate libraries.
  • Llama2-70B Benchmarking with OpenOrca: The Llama2-70B example configuration (online_llama2_70b_orca.yaml) has been updated to include both accuracy and performance benchmarking runs using the OpenOrca dataset, with accuracy evaluation specifically configured to use the new ROUGE scorer.
  • Dynamic Scorer Selection: The benchmark execution logic has been refactored to dynamically select evaluation scorers based on the configuration's eval_method, replacing the previous hardcoded PassAt1Scorer and allowing for more flexible evaluation strategies.
  • OpenOrca Dataset Integration: A new OpenOrca predefined dataset loader has been integrated, providing a structured way to load and process the OpenOrca dataset for accuracy evaluations, along with fixes for its parser mappings.
  • Flexible Dataset Loader Factory: The dataset loader factory now intelligently checks if predefined dataset loaders accept specific arguments (dataset_path, metadata, remap), enhancing its flexibility for various dataset types.
  • NoneExtractor for Optional Extraction: A NoneExtractor has been introduced to serve as a no-operation extractor when no specific text extraction is required, improving the robustness of the extraction mechanism.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds ROUGE scoring capability for evaluating text generation tasks, enables accuracy and performance benchmarking for Llama2-70B, and fixes the dataset mapping configuration for the Open-Orca dataset.

Changes:

  • Added a new RougeScorer class with ROUGE-L F1 scoring for text generation evaluation
  • Introduced a registry pattern for scorer lookup via Scorer.get() and __init_subclass__
  • Added NoneExtractor to handle cases where no output extraction is needed
  • Updated Open-Orca dataset configuration to support both accuracy and performance evaluation modes

Reviewed changes

Copilot reviewed 7 out of 7 changed files in this pull request and generated 6 comments.

Show a summary per file
File Description
src/inference_endpoint/evaluation/scoring.py Added scorer registry pattern and new RougeScorer class for ROUGE-based evaluation
src/inference_endpoint/evaluation/extractor.py Added NoneExtractor for pass-through extraction and null handling in Extractor.get()
src/inference_endpoint/dataset_manager/predefined/open_orca/init.py Created OpenOrca dataset class with configurable column remapping
src/inference_endpoint/dataset_manager/factory.py Updated factory to support predefined datasets with custom dataloaders
src/inference_endpoint/dataset_manager/init.py Exported new OpenOrca dataset class
src/inference_endpoint/commands/benchmark.py Changed to use configurable scorer lookup instead of hardcoded PassAt1Scorer
examples/06_Llama2-70B_Example/online_llama2_70b_orca.yaml Updated configuration with corrected parser mappings and separate accuracy/performance datasets

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces support for ROUGE scoring for accuracy evaluation, specifically for Llama2-70b on the OpenOrca dataset. The changes include a new RougeScorer, dynamic scorer loading, a new predefined OpenOrca dataset, and updates to the dataset factory to handle file-based predefined datasets. The overall approach is sound, but there are several areas for improvement regarding code quality, correctness, and best practices. My review includes suggestions to fix type hint inconsistencies, improve the dataset factory's introspection logic, remove runtime side-effects like nltk.download(), and refactor the RougeScorer to align with the base class contract and improve its implementation.

Signed-off-by: Li, Tianmu <tianmu.li@intel.com>
@nvzhihanj nvzhihanj requested a review from nv-alicheng January 14, 2026 22:25
Signed-off-by: Li, Tianmu <tianmu.li@intel.com>
Copilot AI review requested due to automatic review settings January 14, 2026 22:48
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 7 out of 7 changed files in this pull request and generated 4 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Signed-off-by: Li, Tianmu <tianmu.li@intel.com>
Copy link
Copy Markdown
Collaborator

@arekay-nv arekay-nv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left some initial comments, please rebase as i think some changes are missing. And feel free to ping when ready for review.
Thanks

…cy and performance; add evaluation download instructions to 70b orca README

Signed-off-by: Li, Tianmu <tianmu.li@intel.com>
Copilot AI review requested due to automatic review settings January 15, 2026 17:27
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 7 out of 7 changed files in this pull request and generated 3 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Copilot AI review requested due to automatic review settings January 15, 2026 17:30
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 7 out of 7 changed files in this pull request and generated 1 comment.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@tianmu-li tianmu-li marked this pull request as ready for review January 15, 2026 17:35
@tianmu-li tianmu-li requested a review from a team as a code owner January 15, 2026 17:35
@github-actions github-actions bot requested a review from arekay-nv January 15, 2026 17:35
@tianmu-li
Copy link
Copy Markdown
Collaborator Author

@arekay-nv Thanks for the review. I've addressed the comments, and this pr is ready for review.

Copy link
Copy Markdown
Collaborator

@arekay-nv arekay-nv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you confirm the rouge scores needed and also the factory loader.
@nv-alicheng - Please take a look as well.

@attafosu attafosu mentioned this pull request Jan 20, 2026
10 tasks
…ca benchmark run

Signed-off-by: Li, Tianmu <tianmu.li@intel.com>
Signed-off-by: Li, Tianmu <tianmu.li@intel.com>
@tianmu-li tianmu-li requested a review from arekay-nv January 20, 2026 18:39
Copy link
Copy Markdown
Collaborator

@arekay-nv arekay-nv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approving to unblock llama3 - please followup with refactor in followup.

Copilot AI review requested due to automatic review settings January 21, 2026 05:55
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 7 out of 7 changed files in this pull request and generated 1 comment.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

…init__.py

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Copilot AI review requested due to automatic review settings January 21, 2026 05:56
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 7 out of 7 changed files in this pull request and generated 1 comment.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@tianmu-li tianmu-li merged commit 1316c80 into main Jan 21, 2026
4 checks passed
@github-actions github-actions bot locked and limited conversation to collaborators Jan 21, 2026
@arekay-nv arekay-nv deleted the feat/llama2_70b_accuracy branch April 2, 2026 03:05
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants