Conversation
Signed-off-by: Li, Tianmu <tianmu.li@intel.com>
Signed-off-by: Li, Tianmu <tianmu.li@intel.com>
Signed-off-by: Li, Tianmu <tianmu.li@intel.com>
|
MLCommons CLA bot All contributors have signed the MLCommons CLA ✍️ ✅ |
Summary of ChangesHello @tianmu-li, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request significantly enhances the benchmarking capabilities by introducing ROUGE scoring for accuracy evaluation, specifically tailored for the Llama2-70B model on the OpenOrca dataset. It refactors the evaluation system to dynamically select scorers and improves dataset handling, allowing for more comprehensive and configurable accuracy and performance assessments. These changes provide greater flexibility in evaluating large language models and streamline the process of integrating new evaluation metrics. Highlights
🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console. Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Pull request overview
This PR adds ROUGE scoring capability for evaluating text generation tasks, enables accuracy and performance benchmarking for Llama2-70B, and fixes the dataset mapping configuration for the Open-Orca dataset.
Changes:
- Added a new
RougeScorerclass with ROUGE-L F1 scoring for text generation evaluation - Introduced a registry pattern for scorer lookup via
Scorer.get()and__init_subclass__ - Added
NoneExtractorto handle cases where no output extraction is needed - Updated Open-Orca dataset configuration to support both accuracy and performance evaluation modes
Reviewed changes
Copilot reviewed 7 out of 7 changed files in this pull request and generated 6 comments.
Show a summary per file
| File | Description |
|---|---|
| src/inference_endpoint/evaluation/scoring.py | Added scorer registry pattern and new RougeScorer class for ROUGE-based evaluation |
| src/inference_endpoint/evaluation/extractor.py | Added NoneExtractor for pass-through extraction and null handling in Extractor.get() |
| src/inference_endpoint/dataset_manager/predefined/open_orca/init.py | Created OpenOrca dataset class with configurable column remapping |
| src/inference_endpoint/dataset_manager/factory.py | Updated factory to support predefined datasets with custom dataloaders |
| src/inference_endpoint/dataset_manager/init.py | Exported new OpenOrca dataset class |
| src/inference_endpoint/commands/benchmark.py | Changed to use configurable scorer lookup instead of hardcoded PassAt1Scorer |
| examples/06_Llama2-70B_Example/online_llama2_70b_orca.yaml | Updated configuration with corrected parser mappings and separate accuracy/performance datasets |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
There was a problem hiding this comment.
Code Review
This pull request introduces support for ROUGE scoring for accuracy evaluation, specifically for Llama2-70b on the OpenOrca dataset. The changes include a new RougeScorer, dynamic scorer loading, a new predefined OpenOrca dataset, and updates to the dataset factory to handle file-based predefined datasets. The overall approach is sound, but there are several areas for improvement regarding code quality, correctness, and best practices. My review includes suggestions to fix type hint inconsistencies, improve the dataset factory's introspection logic, remove runtime side-effects like nltk.download(), and refactor the RougeScorer to align with the base class contract and improve its implementation.
Signed-off-by: Li, Tianmu <tianmu.li@intel.com>
Signed-off-by: Li, Tianmu <tianmu.li@intel.com>
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 7 out of 7 changed files in this pull request and generated 4 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Signed-off-by: Li, Tianmu <tianmu.li@intel.com>
arekay-nv
left a comment
There was a problem hiding this comment.
Left some initial comments, please rebase as i think some changes are missing. And feel free to ping when ready for review.
Thanks
src/inference_endpoint/dataset_manager/predefined/open_orca/__init__.py
Outdated
Show resolved
Hide resolved
…cy and performance; add evaluation download instructions to 70b orca README Signed-off-by: Li, Tianmu <tianmu.li@intel.com>
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 7 out of 7 changed files in this pull request and generated 3 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 7 out of 7 changed files in this pull request and generated 1 comment.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
|
@arekay-nv Thanks for the review. I've addressed the comments, and this pr is ready for review. |
arekay-nv
left a comment
There was a problem hiding this comment.
Can you confirm the rouge scores needed and also the factory loader.
@nv-alicheng - Please take a look as well.
…ca benchmark run Signed-off-by: Li, Tianmu <tianmu.li@intel.com>
Signed-off-by: Li, Tianmu <tianmu.li@intel.com>
…eat/llama2_70b_accuracy
arekay-nv
left a comment
There was a problem hiding this comment.
Approving to unblock llama3 - please followup with refactor in followup.
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 7 out of 7 changed files in this pull request and generated 1 comment.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
src/inference_endpoint/dataset_manager/predefined/open_orca/__init__.py
Outdated
Show resolved
Hide resolved
…init__.py Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 7 out of 7 changed files in this pull request and generated 1 comment.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
What does this PR do?
Add rouge score scorer. Enable accuracy + performance run for llama2-70b. Fix mapping for open-orca.
Type of change
Related issues
Testing
Checklist