BenchmarkAggregator 🚀

Rigorous, unbiased, and scalable LLM evaluations across diverse AI benchmarks, from GPQA Diamond to Chatbot Arena, testing all major models equally.

View Leaderboard | Features | Benchmarks | FAQ

🎯 Introduction

The BenchmarkAggregator framework serves as a central hub, addressing the critical need for consistent model evaluation in the AI community. By providing comprehensive comparisons of Large Language Models (LLMs) across all challenging, well-respected benchmarks in one unified location, it offers a holistic, fair, and scalable view of model performance. Our approach balances depth of evaluation with resource constraints, ensuring fair comparisons while maintaining practicality and accessibility from a single, authoritative source.

📊 Model Performance Overview

Model	Average Score
gpt-4o-2024-08-06	69.0
claude-3.5-sonnet	66.2
gpt-4o-mini-2024-07-18	62.1
mistral-large	61.4
llama-3.1-405b-instruct	59.8
llama-3.1-70b-instruct	58.4
claude-3-sonnet	53.2
gpt-3.5-turbo-0125	34.8

For detailed scores across all benchmarks, visit our leaderboard.

🌟 Features

🏆 Incorporates top, most respected benchmarks in the AI community
📊 Balanced evaluation using 100 randomly drawn samples per benchmark (adjustable)
🔌 Quick and easy integration of new benchmarks and models (uses OpenRouter, making the addition of new models absolutely trivial)
📈 Holistic performance view through score averaging across diverse tasks
⚖️ Efficient approach balancing evaluation depth with resource constraints

🏆 Current Benchmarks

📖 Learn more about each benchmark on our website

🤔 FAQ

Why not run all questions for each benchmark?

Running all questions for each benchmark would be cost-prohibitive. Our approach balances comprehensive evaluation with practical resource constraints.

How are benchmark samples chosen?

The samples are randomly drawn from the larger benchmark dataset. The same sample set is used for each model to ensure consistency and fair comparison across all evaluations.

Why are certain models like Claude 3 Opus and GPT-4 turbo absent?

These models are significantly more expensive to query compared to many others. Their absence is due to cost considerations in running the benchmarks.

How easy is it to add new benchmarks or models?

Adding new benchmarks or models is designed to be quick and efficient. For benchmarks, it can take only a few minutes to integrate an existing one. For models, we use OpenRouter, which covers basically all closed and open-source options. To add a model, simply find its ID on the OpenRouter website and include it in our framework. This makes adding new models absolutely trivial!

How are the scores from Chatbot Arena calculated?

The scores for Chatbot Arena are fetched directly from their website. These scores are then normalized against the values of other models in this benchmark.

👉 View more FAQs on our website

🤝 Contributing

We welcome contributions from the community! If you have any questions, suggestions, or requests, please don't hesitate to create an issue. Your input is valuable in helping us improve and expand the BenchmarkAggregator.

📄 License

This project is licensed under the MIT License - see the LICENSE file for details.

🙏 Acknowledgments

We're grateful to the creators and maintainers of the benchmark datasets used in this project, as well as to OpenRouter for making model integration seamless.

Made with ❤️ by the AI community

Website

Name		Name	Last commit message	Last commit date
Latest commit History 63 Commits
benchmarks		benchmarks
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
api_handler.py		api_handler.py
benchmark_suite.py		benchmark_suite.py
main.py		main.py
model.py		model.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

BenchmarkAggregator 🚀

🎯 Introduction

📊 Model Performance Overview

🌟 Features

🏆 Current Benchmarks

🤔 FAQ

🤝 Contributing

📄 License

🙏 Acknowledgments

About

Contributors 3

Languages

License

mrconter1/BenchmarkAggregator

Folders and files

Latest commit

History

Repository files navigation

BenchmarkAggregator 🚀

🎯 Introduction

📊 Model Performance Overview

🌟 Features

🏆 Current Benchmarks

🤔 FAQ

🤝 Contributing

📄 License

🙏 Acknowledgments

About

Resources

License

Stars

Watchers

Forks

Contributors 3

Languages