-
Notifications
You must be signed in to change notification settings - Fork 27
Initial skeleton for Evaluator classes and exceptions #6
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Changes from all commits
Commits
Show all changes
11 commits
Select commit
Hold shift + click to select a range
50adbe7
Initial skeleton for Evaluator classes and exceptions
nathan-weinberg 3558524
Add additional child classes for PR-Bench and PR-MMLU
nathan-weinberg fb2c51b
Add attribute descriptors to class docstrings
nathan-weinberg f62821a
Change 'server' to 'server_url' for clarity
nathan-weinberg 0023da3
Seperated out individual `run` commands for each child class
nathan-weinberg 0db0be4
Change 'model' to 'model_path' for clarity
nathan-weinberg 4e72d07
Change 'fewshots' and 'batchsize' to snake_case
nathan-weinberg 11ad758
Change ret value from single dict to multiple ret values
nathan-weinberg 3eadb4d
Add docstrings to `run` methods
nathan-weinberg b422048
Update attributes of ModelNotFoundError class
nathan-weinberg 20d2fc4
Add badges to README.md
nathan-weinberg File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -1,3 +1,8 @@ | ||
| # eval | ||
|
|
||
|  | ||
|  | ||
|  | ||
|  | ||
|
|
||
| Python library for Evaluation |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,13 @@ | ||
| # SPDX-License-Identifier: Apache-2.0 | ||
|
|
||
|
|
||
| class Evaluator: | ||
| """ | ||
| Parent class for Evaluators | ||
|
|
||
| Atttributes: | ||
| model_path Path to the model to be evaluated | ||
| """ | ||
|
|
||
| def __init__(self, model_path: str) -> None: | ||
| self.model_path = model_path |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,24 @@ | ||
| # SPDX-License-Identifier: Apache-2.0 | ||
|
|
||
|
|
||
| class EvalError(Exception): | ||
| """ | ||
| Parent class for all of instructlab-eval exceptions | ||
| """ | ||
|
|
||
|
|
||
| class ModelNotFoundError(EvalError): | ||
| """ | ||
| Exception raised when model is not able to be found | ||
|
|
||
| Attributes | ||
| message error message to be printed on raise | ||
| model model that is being operated on | ||
| path filepath of model location | ||
| """ | ||
|
|
||
| def __init__(self, path) -> None: | ||
| super().__init__() | ||
| self.path = path | ||
| self.model = path.rsplit("/")[-1] | ||
| self.message = f"Model {self.model} could not be found at {self.path}" |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,75 @@ | ||
| # SPDX-License-Identifier: Apache-2.0 | ||
|
|
||
| # Local | ||
| from .evaluator import Evaluator | ||
|
|
||
|
|
||
| class MMLU_Evaluator(Evaluator): | ||
| """ | ||
| Child class of an Evaluator for Massive Multitask Language Understanding (MMLU) | ||
|
|
||
| Attributes: | ||
| tasks list of tasks for MMLU to test the model with | ||
| few_shots number of examples | ||
| batch_size number of GPUs | ||
| """ | ||
|
|
||
| def __init__( | ||
| self, model_path, tasks: list[str], few_shots: int = 2, batch_size: int = 5 | ||
| ) -> None: | ||
| super().__init__(model_path) | ||
| self.tasks = tasks | ||
| self.few_shots = few_shots | ||
| self.batch_size = batch_size | ||
|
|
||
| def run(self) -> tuple: | ||
| """ | ||
| Runs MMLU evaluation | ||
|
|
||
| Returns: | ||
| overall_score MMLU score for the overall model evaluation | ||
| individual_scores Individual MMLU score for each task | ||
| """ | ||
| individual_scores: dict[str, float] = {} | ||
| overall_score: float = 0.0 | ||
| return overall_score, individual_scores | ||
|
|
||
|
|
||
| class PR_MMLU_Evaluator(Evaluator): | ||
| """ | ||
| Child class of an Evaluator for PR Massive Multitask Language Understanding (PR MMLU) | ||
|
|
||
| Attributes: | ||
| sdg_path path where all the PR MMLU tasks are stored | ||
| task group name that is shared by all the PR MMLU tasks | ||
| few_shots number of examples | ||
| batch_size number of GPUs | ||
| """ | ||
|
|
||
| def __init__( | ||
| self, | ||
| model_path, | ||
| sdg_path: str, | ||
| task: str = "mmlu_pr", | ||
| few_shots: int = 2, | ||
| batch_size: int = 5, | ||
| ) -> None: | ||
| super().__init__(model_path) | ||
| self.sdg_path = sdg_path | ||
| self.task = task | ||
| self.few_shots = few_shots | ||
| self.batch_size = batch_size | ||
|
|
||
| def run(self) -> tuple: | ||
| """ | ||
| Runs PR MMLU evaluation | ||
|
|
||
| Returns: | ||
| overall_score PR MMLU score for the overall model evaluation | ||
| individual_scores Individual PR MMLU scores for each task | ||
| qa_pairs Question and answer pairs from the evaluation | ||
| """ | ||
| individual_scores: dict[str, float] = {} | ||
| overall_score: float = 0.0 | ||
| qa_pairs: list[tuple] = [] | ||
| return overall_score, individual_scores, qa_pairs | ||
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,56 @@ | ||
| # SPDX-License-Identifier: Apache-2.0 | ||
|
|
||
| # Local | ||
| from .evaluator import Evaluator | ||
|
|
||
|
|
||
| class MT_Bench_Evaluator(Evaluator): | ||
| """ | ||
| Child class of an Evaluator for Multi-turn Benchmark (MT-Bench) | ||
|
|
||
| Attributes | ||
| server_url vLLM server endpoint | ||
nathan-weinberg marked this conversation as resolved.
Show resolved
Hide resolved
|
||
| """ | ||
|
|
||
| def __init__(self, model_path, server_url: str) -> None: | ||
| super().__init__(model_path) | ||
| self.server_url = server_url | ||
|
|
||
| def run(self) -> tuple: | ||
nathan-weinberg marked this conversation as resolved.
Show resolved
Hide resolved
|
||
| """ | ||
| Runs MT-Bench evaluation | ||
|
|
||
| Returns: | ||
| overall_score MT-Bench score for the overall model evaluation | ||
| qa_pairs Question and answer pairs from the evaluation | ||
| """ | ||
| overall_score: float = 0.0 | ||
| qa_pairs: list[tuple] = [] | ||
| return overall_score, qa_pairs | ||
|
|
||
|
|
||
| class PR_Bench_Evaluator(Evaluator): | ||
| """ | ||
| Child class of an Evaluator for PR-Bench Benchmark (PR-Bench) | ||
|
|
||
| Attributes | ||
| server_url vLLM server endpoint | ||
| questions questions to be asked | ||
nathan-weinberg marked this conversation as resolved.
Show resolved
Hide resolved
|
||
| """ | ||
|
|
||
| def __init__(self, model_path, server_url: str, questions: str) -> None: | ||
| super().__init__(model_path) | ||
| self.server_url = server_url | ||
| self.questions = questions | ||
|
|
||
| def run(self) -> tuple: | ||
| """ | ||
| Runs PR-Bench evaluation | ||
|
|
||
| Returns: | ||
| overall_score MT-Bench score for the overall model evaluation | ||
| qa_pairs Question and answer pairs from the evaluation | ||
| """ | ||
| overall_score = 0.0 | ||
| qa_pairs: list[tuple] = [] | ||
| return overall_score, qa_pairs | ||
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.