DSEval is a series of benchmarks aiming at evaluating LLM-powered data science agents.
In this repository, we provide both the toolkit to support the benchmarking, as well as the data used for benchmarks.
DSEval (as a Python package) provides the necessary infrastructure needed for reliable evaluation of data science agents. It has some integrations of popular data science agents inside.
It DOES NOT contain any benchmark data or benchmarking results.
Recommended. Note that in this case, the version installed might not be latest.
pip install dseval
Use pip install dseval[agent]
to automatically install all the dependant agents.
Install from source code.
git clone https://github.com/MetaCopilot/dseval
cd dseval
pip install -e .
This table summarizes the currently-provided benchmarks in this repository.
Benchmark | Latest | # Sets | # Problems | Difficulty |
---|---|---|---|---|
Exercise | v1 | 21 | 187 | 17.3 |
SO | v1 | 202 | 202 | 16.2 |
LeetCode | v1 | 40 | 40 | 56.0 |
Kaggle | v1 | 31 | 396 | 35.9 |
- Install the toolkit following the guide above.
- Clone
dseval
repository if you haven't done so. - Use
python scripts/test.py <path_to_benchmark>
to test a benchmark. You can select agent frameworks, LLMs, endpoints and evaluation configurations.
For example:
python scripts/test.py benchmarks/leetcode --model gpt-35-turbo --endpoint aoai
TODO: Guide for properly setup reproducible environment for evaluation.
Use DSEval browser to diagnose agents' performance on benchmarks. For example,
python -m dseval.browser results
You will see a webpage like this:
Use python -m dseval.browser --help
to see more options.
We are collecting problems via this Google form. If you have any ideas to challenge LLMs, LLM-powered data science agents, or any agents, you are welcome to submit it here.
Multi-lingual version coming soon.
We will provide a tutorial soon. Currently you can refer to the examples provided in benchmarks/examples
.
TODO
The repository is founded upon the idea proposed in this paper. If you find it useful in your research, please consider citing it:
@misc{zhang2024benchmarking,
title={Benchmarking Data Science Agents},
author={Yuge Zhang and Qiyang Jiang and Xingyu Han and Nan Chen and Yuqing Yang and Kan Ren},
year={2024},
eprint={2402.17168},
archivePrefix={arXiv},
primaryClass={cs.AI}
}