DSEval

DSEval is a series of benchmarks aiming at evaluating LLM-powered data science agents.

In this repository, we provide both the toolkit to support the benchmarking, as well as the data used for benchmarks.

Benchmark Toolkit

DSEval (as a Python package) provides the necessary infrastructure needed for reliable evaluation of data science agents. It has some integrations of popular data science agents inside.

It DOES NOT contain any benchmark data or benchmarking results.

Installation Option 1

Recommended. Note that in this case, the version installed might not be latest.

pip install dseval

Use pip install dseval[agent] to automatically install all the dependant agents.

Installation Option 2

Install from source code.

git clone https://github.com/MetaCopilot/dseval
cd dseval
pip install -e .

Benchmark Data

This table summarizes the currently-provided benchmarks in this repository.

Benchmark	Latest	# Sets	# Problems	Difficulty
Exercise	v1	21	187	17.3
SO	v1	202	202	16.2
LeetCode	v1	40	40	56.0
Kaggle	v1	31	396	35.9

Evaluating Existing Agent on Existing Benchmarks

Install the toolkit following the guide above.
Clone dseval repository if you haven't done so.
Use python scripts/test.py <path_to_benchmark> to test a benchmark. You can select agent frameworks, LLMs, endpoints and evaluation configurations.

For example:

python scripts/test.py benchmarks/leetcode --model gpt-35-turbo --endpoint aoai

TODO: Guide for properly setup reproducible environment for evaluation.

Diagnosis

Use DSEval browser to diagnose agents' performance on benchmarks. For example,

python -m dseval.browser results

You will see a webpage like this:

Use python -m dseval.browser --help to see more options.

Contributing New Problems

We are collecting problems via this Google form. If you have any ideas to challenge LLMs, LLM-powered data science agents, or any agents, you are welcome to submit it here.

Multi-lingual version coming soon.

Developing New Benchmarks

We will provide a tutorial soon. Currently you can refer to the examples provided in benchmarks/examples.

Integrating New Agents

TODO

Citation

The repository is founded upon the idea proposed in this paper. If you find it useful in your research, please consider citing it:

@misc{zhang2024benchmarking,
    title={Benchmarking Data Science Agents}, 
    author={Yuge Zhang and Qiyang Jiang and Xingyu Han and Nan Chen and Yuqing Yang and Kan Ren},
    year={2024},
    eprint={2402.17168},
    archivePrefix={arXiv},
    primaryClass={cs.AI}
}

Name		Name	Last commit message	Last commit date
Latest commit History 26 Commits
.github/workflows		.github/workflows
assets		assets
benchmarks		benchmarks
dseval		dseval
results		results
scripts		scripts
website		website
.env.example		.env.example
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

DSEval

Benchmark Toolkit

Installation Option 1

Installation Option 2

Benchmark Data

Evaluating Existing Agent on Existing Benchmarks

Diagnosis

Contributing New Problems

Developing New Benchmarks

Integrating New Agents

Citation

About

Releases

Languages

License

MetaCopilot/dseval

Folders and files

Latest commit

History

Repository files navigation

DSEval

Benchmark Toolkit

Installation Option 1

Installation Option 2

Benchmark Data

Evaluating Existing Agent on Existing Benchmarks

Diagnosis

Contributing New Problems

Developing New Benchmarks

Integrating New Agents

Citation

About

Resources

License

Stars

Watchers

Forks

Releases

Languages