Skip to content

Feat: Add accuracy evaluation for LLMs (GPQA, AIME, HLE etc.) #4

@nvzhihanj

Description

@nvzhihanj

Add datasets, post-processing scripts, environment (likely docker) to evaluate the accuracy based on the output collected from the inference endpoints.
List of accuracy eval to add:

  • GPQA
  • MMLU, MMLU-Pro
  • AIME
  • MATH500
  • HLE
  • Health Bench
  • TBD

Metadata

Metadata

Assignees

Labels

ShowStopperPriority !!!: Something that is critically important to implement or fixaccuracyAccuracy evaluation and scoringfeatureA new feature

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions