diff --git a/README.md b/README.md index 47f4c07fd..330be8300 100644 --- a/README.md +++ b/README.md @@ -21,48 +21,76 @@ --- -**Documentation**: HF's doc +

+ + Documentation + +

--- -### Unlock the Power of LLM Evaluation with Lighteval 🚀 +**Lighteval** is your *all-in-one toolkit* for evaluating LLMs across multiple +backends—whether your model is being **served somewhere** or **already loaded in memory**. +Dive deep into your model's performance by saving and exploring *detailed, +sample-by-sample results* to debug and see how your models stack-up. + +*Customization at your fingertips*: letting you either browse all our existing tasks and [metrics](https://huggingface.co/docs/lighteval/metric-list) or effortlessly create your own [custom task](https://huggingface.co/docs/lighteval/adding-a-custom-task) and [custom metric](https://huggingface.co/docs/lighteval/adding-a-new-metric), tailored to your needs. + + +## Available Tasks -**Lighteval** is your all-in-one toolkit for evaluating LLMs across multiple -backends—whether it's -[transformers](https://github.com/huggingface/transformers), -[tgi](https://github.com/huggingface/text-generation-inference), -[vllm](https://github.com/vllm-project/vllm), or -[nanotron](https://github.com/huggingface/nanotron)—with -ease. Dive deep into your model’s performance by saving and exploring detailed, -sample-by-sample results to debug and see how your models stack-up. +Lighteval supports **7,000+ evaluation tasks** across multiple domains and languages. Here's an overview of some *popular benchmarks*: -Customization at your fingertips: letting you either browse all our existing [tasks](https://huggingface.co/docs/lighteval/available-tasks) and [metrics](https://huggingface.co/docs/lighteval/metric-list) or effortlessly create your own [custom task](https://huggingface.co/docs/lighteval/adding-a-custom-task) and [custom metric](https://huggingface.co/docs/lighteval/adding-a-new-metric), tailored to your needs. -Seamlessly experiment, benchmark, and store your results on the Hugging Face -Hub, S3, or locally. +### 📚 **Knowledge** +- **General Knowledge**: MMLU, MMLU-Pro, MMMU, BIG-Bench +- **Question Answering**: TriviaQA, Natural Questions, SimpleQA, Humanity's Last Exam (HLE) +- **Specialized**: GPQA, AGIEval +### 🧮 **Math and Code** +- **Math Problems**: GSM8K, GSM-Plus, MATH, MATH500 +- **Competition Math**: AIME24, AIME25 +- **Multilingual Math**: MGSM (Grade School Math in 10+ languages) +- **Coding Benchmarks**: LCB (LiveCodeBench) -## 🔑 Key Features +### 🎯 **Chat Model Evaluation** +- **Instruction Following**: IFEval, IFEval-fr +- **Reasoning**: MUSR, DROP (discrete reasoning) +- **Long Context**: RULER +- **Dialogue**: MT-Bench +- **Holistic Evaluation**: HELM, BIG-Bench -- **Speed**: [Use vllm as backend for fast evals](https://huggingface.co/docs/lighteval/use-vllm-as-backend). -- **Completeness**: [Use the accelerate backend to launch any models hosted on Hugging Face](https://huggingface.co/docs/lighteval/quicktour#accelerate). -- **Seamless Storage**: [Save results in S3 or Hugging Face Datasets](https://huggingface.co/docs/lighteval/saving-and-reading-results). -- **Python API**: [Simple integration with the Python API](https://huggingface.co/docs/lighteval/using-the-python-api). -- **Custom Tasks**: [Easily add custom tasks](https://huggingface.co/docs/lighteval/adding-a-custom-task). -- **Versatility**: Tons of [metrics](https://huggingface.co/docs/lighteval/metric-list) and [tasks](https://huggingface.co/docs/lighteval/available-tasks) ready to go. +### 🌍 **Multilingual Evaluation** +- **Cross-lingual**: XTREME, Flores200 (200 languages), XCOPA, XQuAD +- **Language-specific**: + - **Arabic**: ArabicMMLU + - **Filipino**: FilBench + - **French**: IFEval-fr, GPQA-fr, BAC-fr + - **German**: German RAG Eval + - **Serbian**: Serbian LLM Benchmark, OZ Eval + - **Turkic**: TUMLU (9 Turkic languages) + - **Chinese**: CMMLU, CEval, AGIEval + - **Russian**: RUMMLU, Russian SQuAD + - **And many more...** + +### 🧠 **Core Language Understanding** +- **NLU**: GLUE, SuperGLUE, TriviaQA, Natural Questions +- **Commonsense**: HellaSwag, WinoGrande, ProtoQA +- **Natural Language Inference**: XNLI +- **Reading Comprehension**: SQuAD, XQuAD, MLQA, Belebele ## ⚡️ Installation -Note that lighteval is currently completely untested on Windows, and we don't support it yet. (Should be fully functional on Mac/Linux) +> **Note**: lighteval is currently *completely untested on Windows*, and we don't support it yet. (*Should be fully functional on Mac/Linux*) ```bash pip install lighteval ``` -Lighteval allows for many extras when installing, see [here](https://huggingface.co/docs/lighteval/installation) for a complete list. +Lighteval allows for *many extras* when installing, see [here](https://huggingface.co/docs/lighteval/installation) for a **complete list**. -If you want to push results to the Hugging Face Hub, add your access token as +If you want to push results to the **Hugging Face Hub**, add your access token as an environment variable: ```shell @@ -73,19 +101,23 @@ huggingface-cli login Lighteval offers the following entry points for model evaluation: -- `lighteval accelerate` : evaluate models on CPU or one or more GPUs using [🤗 +- `lighteval accelerate`: Evaluate models on CPU or one or more GPUs using [🤗 Accelerate](https://github.com/huggingface/accelerate) -- `lighteval nanotron`: evaluate models in distributed settings using [⚡️ +- `lighteval nanotron`: Evaluate models in distributed settings using [⚡️ Nanotron](https://github.com/huggingface/nanotron) -- `lighteval vllm`: evaluate models on one or more GPUs using [🚀 +- `lighteval vllm`: Evaluate models on one or more GPUs using [🚀 VLLM](https://github.com/vllm-project/vllm) -- `lighteval endpoint` - - `inference-endpoint`: evaluate models on one or more GPUs using [🔗 - Inference Endpoint](https://huggingface.co/inference-endpoints/dedicated) - - `tgi`: evaluate models on one or more GPUs using [🔗 Text Generation Inference](https://huggingface.co/docs/text-generation-inference/en/index) - - `openai`: evaluate models on one or more GPUs using [🔗 OpenAI API](https://platform.openai.com/) +- `lighteval sglang`: Evaluate models using [SGLang](https://github.com/sgl-project/sglang) as backend +- `lighteval endpoint`: Evaluate models using various endpoints as backend + - `lighteval endpoint inference-endpoint`: Evaluate models using Hugging Face's [Inference Endpoints API](https://huggingface.co/inference-endpoints/dedicated) + - `lighteval endpoint tgi`: Evaluate models using [🔗 Text Generation Inference](https://huggingface.co/docs/text-generation-inference/en/index) running locally + - `lighteval endpoint litellm`: Evaluate models on any compatible API using [LiteLLM](https://www.litellm.ai/) + - `lighteval endpoint inference-providers`: Evaluate models using [HuggingFace's inference providers](https://huggingface.co/docs/inference-providers/en/index) as backend + +Did not find what you need ? You can always make your custom model API by following [this guide](https://huggingface.co/docs/lighteval/main/en/evaluating-a-custom-model) +- `lighteval custom`: Evaluate custom models (can be anything) -Here’s a quick command to evaluate using the Accelerate backend: +Here's a **quick command** to evaluate using the *Accelerate backend*: ```shell lighteval accelerate \ @@ -93,28 +125,65 @@ lighteval accelerate \ "leaderboard|truthfulqa:mc|0" ``` +Or use the **Python API** to run a model *already loaded in memory*! + +```python +from transformers import AutoModelForCausalLM + +from lighteval.logging.evaluation_tracker import EvaluationTracker +from lighteval.models.transformers.transformers_model import TransformersModel, TransformersModelConfig +from lighteval.pipeline import ParallelismManager, Pipeline, PipelineParameters + + +MODEL_NAME = "meta-llama/Meta-Llama-3-8B-Instruct" +BENCHMARKS = "lighteval|gsm8k|0" + +evaluation_tracker = EvaluationTracker(output_dir="./results") +pipeline_params = PipelineParameters( + launcher_type=ParallelismManager.NONE, + max_samples=2 +) + +model = AutoModelForCausalLM.from_pretrained( + MODEL_NAME, device_map="auto" +) +config = TransformersModelConfig(model_name=MODEL_NAME, batch_size=1) +model = TransformersModel.from_model(model, config) + +pipeline = Pipeline( + model=model, + pipeline_parameters=pipeline_params, + evaluation_tracker=evaluation_tracker, + tasks=BENCHMARKS, +) + +results = pipeline.evaluate() +pipeline.show_results() +results = pipeline.get_results() +``` + ## 🙏 Acknowledgements -Lighteval started as an extension of the fantastic [Eleuther AI +Lighteval started as an extension of the *fantastic* [Eleuther AI Harness](https://github.com/EleutherAI/lm-evaluation-harness) (which powers the [Open LLM Leaderboard](https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard)) -and draws inspiration from the amazing +and draws inspiration from the *amazing* [HELM](https://crfm.stanford.edu/helm/latest/) framework. -While evolving Lighteval into its own standalone tool, we are grateful to the -Harness and HELM teams for their pioneering work on LLM evaluations. +While evolving Lighteval into its own *standalone tool*, we are grateful to the +Harness and HELM teams for their **pioneering work** on LLM evaluations. ## 🌟 Contributions Welcome 💙💚💛💜🧡 -Got ideas? Found a bug? Want to add a +**Got ideas?** Found a bug? Want to add a [task](https://huggingface.co/docs/lighteval/adding-a-custom-task) or [metric](https://huggingface.co/docs/lighteval/adding-a-new-metric)? -Contributions are warmly welcomed! +Contributions are *warmly welcomed*! -If you're adding a new feature, please open an issue first. +If you're adding a **new feature**, please *open an issue first*. -If you open a PR, don't forget to run the styling! +If you open a PR, don't forget to **run the styling**! ```bash pip install -e .[dev] @@ -128,7 +197,7 @@ pre-commit run --all-files author = {Habib, Nathan and Fourrier, Clémentine and Kydlíček, Hynek and Wolf, Thomas and Tunstall, Lewis}, title = {LightEval: A lightweight framework for LLM evaluation}, year = {2023}, - version = {0.8.0}, + version = {0.10.0}, url = {https://github.com/huggingface/lighteval} } ```