"Uncheatable" LLMs Evaluation - LatestEval

Humans receive new test questions every exam, but LLMs? They've been evaluated with the same benchmarks for too long. Why not assess LLMs with fresh test just like we test our students? In this project, we introduce LatestEval, which automatically constructs language model benchmarks using the latest materials (e.g., arXiv, BBC, Wikipedia, etc.) to prevent "cheating" and data contamination.

News!!

15 Dec, 2023 - This project was accpeted by the main track of AAAI 2024 🥳! Check out the paper here: 👉 Dynamic Test Construction with Latest Materials.

Key Features

We maintain a QA benchmark that updates every half month using the latest online resources (created in the past half month). This approach aims to avoid 1) LLMs being trained on the test set (cheating); and 2) the unintentional inclusion of test questions in the training dataset (data contamination).
We analyzed real Human-AI conversations to ensure the automated benchmark aligns well with real-life applications (see paper for more detail).

The Benchmark

Access the latest benchmark dorectly at Huggingface Hub!

Latest benchmark of GitHub: HF Hub
Latest benchmark of arXiv: HF Hub
Latest benchmark of BBC: HF Hub
The Full benchmark with all sources: HF Hub

The benchmarks are created with latest materials, find these raw materials/documents at Huggingface Hub

Evaluate your LLM on LatestEval

We will add LatestEval to lm-evaluation-harness and OpenCompass. Stay tuned.

Create benchmarks with your own data

Put your documents as .txt files under ./<your_doc_path>.
Set your OpenAI key:

export OPENAI_API_KEY=<Your OpenAI key>

Simply run:

python data_processor.py --source customized --file_path <your_path> --num_docs 100

If you want to reproduce LatestEval on arXiv, BBC, GitHub:

python data_processor.py --source arxiv --num_docs 100

Issue

Open an issue if you have any problems or want to discuss.

Citation

If you find this project useful, consider cite this project:

@misc{li2023avoiding,
      title={Avoiding Data Contamination in Language Model Evaluation: Dynamic Test Construction with Latest Materials}, 
      author={Yucheng Li and Frank Guerin and Chenghua Lin},
      year={2023},
      eprint={2312.12343},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}

Name		Name	Last commit message	Last commit date
Latest commit History 62 Commits
.github/workflows		.github/workflows
data		data
eval		eval
figs		figs
.gitignore		.gitignore
arxiv_downloader.py		arxiv_downloader.py
bbc_downloader.py		bbc_downloader.py
data_processor.py		data_processor.py
github_downloader.py		github_downloader.py
push_to_hf_hub.py		push_to_hf_hub.py
readme.md		readme.md
requirements.txt		requirements.txt
wikitext_downloader.py		wikitext_downloader.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

"Uncheatable" LLMs Evaluation - LatestEval

Key Features

The Benchmark

Evaluate your LLM on LatestEval

Create benchmarks with your own data

Issue

Citation

About

Releases 1

Packages

Languages

liyucheng09/LatestEval

Folders and files

Latest commit

History

Repository files navigation

"Uncheatable" LLMs Evaluation - LatestEval

Key Features

The Benchmark

Evaluate your LLM on LatestEval

Create benchmarks with your own data

Issue

Citation

About

Resources

Stars

Watchers

Forks

Releases 1

Packages 0

Languages

Packages