Skip to content

Fast & more realistic evaluation of chat language models. Includes leaderboard.

License

Notifications You must be signed in to change notification settings

mayflower/FastEval

 
 

Repository files navigation

Project status: I will not add significant new features and mostly fix bugs.

FastEval

This project allows you to quickly evaluate instruction-following and chat language models on a number of benchmarks. See the comparison to lm-evaluation-harness for more information. There is also a leaderboard.

Features

  • Evaluation on various benchmarks with a single command. Supported benchmarks are MT‑Bench for conversational capabilities, HumanEval+ and DS-1000 for Python coding performance, Chain of Thought (GSM8K + MATH + BBH + MMLU) for reasoning capabilities as well as custom test data.
  • High performance. FastEval uses vLLM for fast inference by default and can also optionally make use of text-generation-inference. Both methods are ~20x faster than using huggingface transformers.
  • Detailed information about model performance. FastEval saves the outputs of the language model and other intermediate results to disk. This makes it possible to get deeper insight into model performance. You can look at the performance on different categories and even inspect individual model outputs.
  • Use of model-specific prompt templates: FastEval uses the right prompt template depending on the evaluated model. Many prompt templates are supported and the use of FastChat expands this even further.

Installation

# Install `python3.10`, `python3.10-venv` and `python3.10-dev`.
# The following command assumes an ubuntu >= 22.04 system.
apt install python3.10 python3.10-venv python3.10-dev

# Clone this repository, make it the current working directory
git clone --depth 1 https://github.com/FastEval/FastEval.git
cd FastEval

# Set up the virtual environment
python3.10 -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt

This already installs vLLM for fast inference which is usually enough for most models. However, if you encounter any problems with vLLM or your model is not supported, FastEval also supports using text-generation-inference as an alternative. Please see here if you would like to use text-generation-inference.

OpenAI API Key

MT-Bench uses GPT-4 as a judge for evaluating model outputs. For this benchmark, you need to configure an OpenAI API key by setting the OPENAI_API_KEY environment variable. Note that methods other than setting this environment variable won't work. The cost of evaluating a new model on MT-Bench is approximately $5.

Evaluation

⚠️ Running fasteval currently executes untrusted code from models with remote code as well as LLM generated code when using HumanEval+ and DS-1000. Please note that there is currently no integrated sandbox.

To evaluate a new model, call fasteval in the following way:

./fasteval [-b <benchmark_name_1>...] -t model_type -m model_name

The -b flag specifies the benchmarks that you want to evaluate your model on. The default is all, but you can also specify one or multiple individual benchmarks. Possible values are mt-bench, human-eval-plus, ds1000, cot, cot/gsm8k, cot/math, cot/bbh, cot/mmlu and custom-test-data.

The -t flag specifies the type of the model which is either the prompt template or the API client that will be used. Please see here for information on which model type to select for your model.

The -m flag specifies the name of the model which can be a path to a model on huggingface, a local folder or an OpenAI model name.

For example, this command will evaluate OpenAssistant/pythia-12b-sft-v8-2.5k-steps on HumanEval+:

./fasteval -b human-eval-plus -t open-assistant -m OpenAssistant/pythia-12b-sft-v8-2.5k-steps

There are also flags available for enabling & configuring data parallel evaluation, setting model arguments and changing the inference backend. Please use ./fasteval -h for more information.

Viewing the results

A very short summary of the final scores will be written to stdout after the evaluation has finished.

More details are available through the web UI where you can view performance on different subtasks or inspect individual model inputs & outputs. To access the web UI, use python3 -m http.server in the root folder of this repository. This will start a simple webserver for static files. The server usually runs on port 8000 in which case you can view the detailed results at localhost:8000.

Help & Contributing

For questions, problems and contributions, join the Alignment Lab AI discord server or create a github issue. Contributions are greatly welcome. Please read the contributing guide for more information.

About

Fast & more realistic evaluation of chat language models. Includes leaderboard.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 73.9%
  • JavaScript 24.5%
  • CSS 1.2%
  • Other 0.4%