Skip to content
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
151 changes: 110 additions & 41 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -21,48 +21,76 @@

---

**Documentation**: <a href="https://huggingface.co/docs/lighteval/main/en/index" target="_blank">HF's doc</a>
<p align="center">
<a href="https://huggingface.co/docs/lighteval/main/en/index" target="_blank">
<img alt="Documentation" src="https://img.shields.io/badge/Documentation-4F4F4F?style=for-the-badge&logo=readthedocs&logoColor=white" />
</a>
</p>

---

### Unlock the Power of LLM Evaluation with Lighteval 🚀
**Lighteval** is your *all-in-one toolkit* for evaluating LLMs across multiple
backends—whether your model is being **served somewhere** or **already loaded in memory**.
Dive deep into your model's performance by saving and exploring *detailed,
sample-by-sample results* to debug and see how your models stack-up.

*Customization at your fingertips*: letting you either browse all our existing tasks and [metrics](https://huggingface.co/docs/lighteval/metric-list) or effortlessly create your own [custom task](https://huggingface.co/docs/lighteval/adding-a-custom-task) and [custom metric](https://huggingface.co/docs/lighteval/adding-a-new-metric), tailored to your needs.


## Available Tasks

**Lighteval** is your all-in-one toolkit for evaluating LLMs across multiple
backends—whether it's
[transformers](https://github.com/huggingface/transformers),
[tgi](https://github.com/huggingface/text-generation-inference),
[vllm](https://github.com/vllm-project/vllm), or
[nanotron](https://github.com/huggingface/nanotron)—with
ease. Dive deep into your model’s performance by saving and exploring detailed,
sample-by-sample results to debug and see how your models stack-up.
Lighteval supports **7,000+ evaluation tasks** across multiple domains and languages. Here's an overview of some *popular benchmarks*:

Customization at your fingertips: letting you either browse all our existing [tasks](https://huggingface.co/docs/lighteval/available-tasks) and [metrics](https://huggingface.co/docs/lighteval/metric-list) or effortlessly create your own [custom task](https://huggingface.co/docs/lighteval/adding-a-custom-task) and [custom metric](https://huggingface.co/docs/lighteval/adding-a-new-metric), tailored to your needs.

Seamlessly experiment, benchmark, and store your results on the Hugging Face
Hub, S3, or locally.
### 📚 **Knowledge**
- **General Knowledge**: MMLU, MMLU-Pro, MMMU, BIG-Bench
- **Question Answering**: TriviaQA, Natural Questions, SimpleQA, Humanity's Last Exam (HLE)
- **Specialized**: GPQA, AGIEval

### 🧮 **Math and Code**
- **Math Problems**: GSM8K, GSM-Plus, MATH, MATH500
- **Competition Math**: AIME24, AIME25
- **Multilingual Math**: MGSM (Grade School Math in 10+ languages)
- **Coding Benchmarks**: LCB (LiveCodeBench)

## 🔑 Key Features
### 🎯 **Chat Model Evaluation**
- **Instruction Following**: IFEval, IFEval-fr
- **Reasoning**: MUSR, DROP (discrete reasoning)
- **Long Context**: RULER
- **Dialogue**: MT-Bench
- **Holistic Evaluation**: HELM, BIG-Bench
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BigBench is also in knowledge

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would change holistic evaluation to "Across the board capabilities testing" or something like this


- **Speed**: [Use vllm as backend for fast evals](https://huggingface.co/docs/lighteval/use-vllm-as-backend).
- **Completeness**: [Use the accelerate backend to launch any models hosted on Hugging Face](https://huggingface.co/docs/lighteval/quicktour#accelerate).
- **Seamless Storage**: [Save results in S3 or Hugging Face Datasets](https://huggingface.co/docs/lighteval/saving-and-reading-results).
- **Python API**: [Simple integration with the Python API](https://huggingface.co/docs/lighteval/using-the-python-api).
- **Custom Tasks**: [Easily add custom tasks](https://huggingface.co/docs/lighteval/adding-a-custom-task).
- **Versatility**: Tons of [metrics](https://huggingface.co/docs/lighteval/metric-list) and [tasks](https://huggingface.co/docs/lighteval/available-tasks) ready to go.
### 🌍 **Multilingual Evaluation**
- **Cross-lingual**: XTREME, Flores200 (200 languages), XCOPA, XQuAD
- **Language-specific**:
- **Arabic**: ArabicMMLU
- **Filipino**: FilBench
- **French**: IFEval-fr, GPQA-fr, BAC-fr
- **German**: German RAG Eval
- **Serbian**: Serbian LLM Benchmark, OZ Eval
- **Turkic**: TUMLU (9 Turkic languages)
- **Chinese**: CMMLU, CEval, AGIEval
- **Russian**: RUMMLU, Russian SQuAD
- **And many more...**

### 🧠 **Core Language Understanding**
- **NLU**: GLUE, SuperGLUE, TriviaQA, Natural Questions
- **Commonsense**: HellaSwag, WinoGrande, ProtoQA
- **Natural Language Inference**: XNLI
- **Reading Comprehension**: SQuAD, XQuAD, MLQA, Belebele


## ⚡️ Installation

Note that lighteval is currently completely untested on Windows, and we don't support it yet. (Should be fully functional on Mac/Linux)
> **Note**: lighteval is currently *completely untested on Windows*, and we don't support it yet. (*Should be fully functional on Mac/Linux*)

```bash
pip install lighteval
```

Lighteval allows for many extras when installing, see [here](https://huggingface.co/docs/lighteval/installation) for a complete list.
Lighteval allows for *many extras* when installing, see [here](https://huggingface.co/docs/lighteval/installation) for a **complete list**.

If you want to push results to the Hugging Face Hub, add your access token as
If you want to push results to the **Hugging Face Hub**, add your access token as
an environment variable:

```shell
Expand All @@ -73,48 +101,89 @@ huggingface-cli login

Lighteval offers the following entry points for model evaluation:

- `lighteval accelerate` : evaluate models on CPU or one or more GPUs using [🤗
- `lighteval accelerate`: Evaluate models on CPU or one or more GPUs using [🤗
Accelerate](https://github.com/huggingface/accelerate)
- `lighteval nanotron`: evaluate models in distributed settings using [⚡️
- `lighteval nanotron`: Evaluate models in distributed settings using [⚡️
Nanotron](https://github.com/huggingface/nanotron)
- `lighteval vllm`: evaluate models on one or more GPUs using [🚀
- `lighteval vllm`: Evaluate models on one or more GPUs using [🚀
VLLM](https://github.com/vllm-project/vllm)
- `lighteval endpoint`
- `inference-endpoint`: evaluate models on one or more GPUs using [🔗
Inference Endpoint](https://huggingface.co/inference-endpoints/dedicated)
- `tgi`: evaluate models on one or more GPUs using [🔗 Text Generation Inference](https://huggingface.co/docs/text-generation-inference/en/index)
- `openai`: evaluate models on one or more GPUs using [🔗 OpenAI API](https://platform.openai.com/)
- `lighteval sglang`: Evaluate models using [SGLang](https://github.com/sgl-project/sglang) as backend
- `lighteval endpoint`: Evaluate models using various endpoints as backend
- `lighteval endpoint inference-endpoint`: Evaluate models using Hugging Face's [Inference Endpoints API](https://huggingface.co/inference-endpoints/dedicated)
- `lighteval endpoint tgi`: Evaluate models using [🔗 Text Generation Inference](https://huggingface.co/docs/text-generation-inference/en/index) running locally
- `lighteval endpoint litellm`: Evaluate models on any compatible API using [LiteLLM](https://www.litellm.ai/)
- `lighteval endpoint inference-providers`: Evaluate models using [HuggingFace's inference providers](https://huggingface.co/docs/inference-providers/en/index) as backend

Did not find what you need ? You can always make your custom model API by following [this guide](https://huggingface.co/docs/lighteval/main/en/evaluating-a-custom-model)
- `lighteval custom`: Evaluate custom models (can be anything)

Heres a quick command to evaluate using the Accelerate backend:
Here's a **quick command** to evaluate using the *Accelerate backend*:

```shell
lighteval accelerate \
"model_name=gpt2" \
"leaderboard|truthfulqa:mc|0"
```

Or use the **Python API** to run a model *already loaded in memory*!

```python
from transformers import AutoModelForCausalLM

from lighteval.logging.evaluation_tracker import EvaluationTracker
from lighteval.models.transformers.transformers_model import TransformersModel, TransformersModelConfig
from lighteval.pipeline import ParallelismManager, Pipeline, PipelineParameters


MODEL_NAME = "meta-llama/Meta-Llama-3-8B-Instruct"
BENCHMARKS = "lighteval|gsm8k|0"

evaluation_tracker = EvaluationTracker(output_dir="./results")
pipeline_params = PipelineParameters(
launcher_type=ParallelismManager.NONE,
max_samples=2
)

model = AutoModelForCausalLM.from_pretrained(
MODEL_NAME, device_map="auto"
)
config = TransformersModelConfig(model_name=MODEL_NAME, batch_size=1)
model = TransformersModel.from_model(model, config)

pipeline = Pipeline(
model=model,
pipeline_parameters=pipeline_params,
evaluation_tracker=evaluation_tracker,
tasks=BENCHMARKS,
)

results = pipeline.evaluate()
pipeline.show_results()
results = pipeline.get_results()
```

## 🙏 Acknowledgements

Lighteval started as an extension of the fantastic [Eleuther AI
Lighteval started as an extension of the *fantastic* [Eleuther AI
Harness](https://github.com/EleutherAI/lm-evaluation-harness) (which powers the
[Open LLM
Leaderboard](https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard))
and draws inspiration from the amazing
and draws inspiration from the *amazing*
[HELM](https://crfm.stanford.edu/helm/latest/) framework.

While evolving Lighteval into its own standalone tool, we are grateful to the
Harness and HELM teams for their pioneering work on LLM evaluations.
While evolving Lighteval into its own *standalone tool*, we are grateful to the
Harness and HELM teams for their **pioneering work** on LLM evaluations.

## 🌟 Contributions Welcome 💙💚💛💜🧡

Got ideas? Found a bug? Want to add a
**Got ideas?** Found a bug? Want to add a
[task](https://huggingface.co/docs/lighteval/adding-a-custom-task) or
[metric](https://huggingface.co/docs/lighteval/adding-a-new-metric)?
Contributions are warmly welcomed!
Contributions are *warmly welcomed*!

If you're adding a new feature, please open an issue first.
If you're adding a **new feature**, please *open an issue first*.

If you open a PR, don't forget to run the styling!
If you open a PR, don't forget to **run the styling**!

```bash
pip install -e .[dev]
Expand All @@ -128,7 +197,7 @@ pre-commit run --all-files
author = {Habib, Nathan and Fourrier, Clémentine and Kydlíček, Hynek and Wolf, Thomas and Tunstall, Lewis},
title = {LightEval: A lightweight framework for LLM evaluation},
year = {2023},
version = {0.8.0},
version = {0.10.0},
url = {https://github.com/huggingface/lighteval}
}
```
Loading