Skip to content

Comprehensive LLM evaluation framework: GPQA Diamond to Chatbot Arena. Tests all major models equally, easily extensible.

License

Notifications You must be signed in to change notification settings

mrconter1/BenchmarkAggregator

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

63 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

BenchmarkAggregator 🚀

MIT License Contributions Welcome

Rigorous, unbiased, and scalable LLM evaluations across diverse AI benchmarks, from GPQA Diamond to Chatbot Arena, testing all major models equally.

BenchmarkAggregator Dashboard

View Leaderboard | Features | Benchmarks | FAQ

🎯 Introduction

The BenchmarkAggregator framework serves as a central hub, addressing the critical need for consistent model evaluation in the AI community. By providing comprehensive comparisons of Large Language Models (LLMs) across all challenging, well-respected benchmarks in one unified location, it offers a holistic, fair, and scalable view of model performance. Our approach balances depth of evaluation with resource constraints, ensuring fair comparisons while maintaining practicality and accessibility from a single, authoritative source.

📊 Model Performance Overview

Model Average Score
gpt-4o-2024-08-06 69.0
claude-3.5-sonnet 66.2
gpt-4o-mini-2024-07-18 62.1
mistral-large 61.4
llama-3.1-405b-instruct 59.8
llama-3.1-70b-instruct 58.4
claude-3-sonnet 53.2
gpt-3.5-turbo-0125 34.8

For detailed scores across all benchmarks, visit our leaderboard.

🌟 Features

  1. 🏆 Incorporates top, most respected benchmarks in the AI community
  2. 📊 Balanced evaluation using 100 randomly drawn samples per benchmark (adjustable)
  3. 🔌 Quick and easy integration of new benchmarks and models (uses OpenRouter, making the addition of new models absolutely trivial)
  4. 📈 Holistic performance view through score averaging across diverse tasks
  5. ⚖️ Efficient approach balancing evaluation depth with resource constraints

🏆 Current Benchmarks

  1. MMLU-Pro
  2. GPQA-Diamond
  3. ChatbotArena
  4. MATH-Hard
  5. MuSR
  6. ARC-Challenge
  7. HellaSwag
  8. LiveBench
  9. MGSM

📖 Learn more about each benchmark on our website

🤔 FAQ

Why not run all questions for each benchmark? Running all questions for each benchmark would be cost-prohibitive. Our approach balances comprehensive evaluation with practical resource constraints.
How are benchmark samples chosen? The samples are randomly drawn from the larger benchmark dataset. The same sample set is used for each model to ensure consistency and fair comparison across all evaluations.
Why are certain models like Claude 3 Opus and GPT-4 turbo absent? These models are significantly more expensive to query compared to many others. Their absence is due to cost considerations in running the benchmarks.
How easy is it to add new benchmarks or models? Adding new benchmarks or models is designed to be quick and efficient. For benchmarks, it can take only a few minutes to integrate an existing one. For models, we use OpenRouter, which covers basically all closed and open-source options. To add a model, simply find its ID on the OpenRouter website and include it in our framework. This makes adding new models absolutely trivial!
How are the scores from Chatbot Arena calculated? The scores for Chatbot Arena are fetched directly from their website. These scores are then normalized against the values of other models in this benchmark.

👉 View more FAQs on our website

🤝 Contributing

We welcome contributions from the community! If you have any questions, suggestions, or requests, please don't hesitate to create an issue. Your input is valuable in helping us improve and expand the BenchmarkAggregator.

📄 License

This project is licensed under the MIT License - see the LICENSE file for details.

🙏 Acknowledgments

We're grateful to the creators and maintainers of the benchmark datasets used in this project, as well as to OpenRouter for making model integration seamless.


Made with ❤️ by the AI community

Website

About

Comprehensive LLM evaluation framework: GPQA Diamond to Chatbot Arena. Tests all major models equally, easily extensible.

Resources

License

Stars

Watchers

Forks

Languages