Skip to content

liyucheng09/LatestEval

Repository files navigation

Logo of Selective Context

"Uncheatable" LLMs Evaluation - LatestEval

Humans receive new test questions every exam, but LLMs? They've been evaluated with the same benchmarks for too long. Why not assess LLMs with fresh test just like we test our students? In this project, we introduce LatestEval, which automatically constructs language model benchmarks using the latest materials (e.g., arXiv, BBC, Wikipedia, etc.) to prevent "cheating" and data contamination.

News!!

Key Features

  1. We maintain a QA benchmark that updates every half month using the latest online resources (created in the past half month). This approach aims to avoid 1) LLMs being trained on the test set (cheating); and 2) the unintentional inclusion of test questions in the training dataset (data contamination).
  2. We analyzed real Human-AI conversations to ensure the automated benchmark aligns well with real-life applications (see paper for more detail).

The Benchmark

Access the latest benchmark dorectly at Huggingface Hub!

  • Latest benchmark of GitHub: HF Hub
  • Latest benchmark of arXiv: HF Hub
  • Latest benchmark of BBC: HF Hub
  • The Full benchmark with all sources: HF Hub

The benchmarks are created with latest materials, find these raw materials/documents at Huggingface Hub

Evaluate your LLM on LatestEval

We will add LatestEval to lm-evaluation-harness and OpenCompass. Stay tuned.

Create benchmarks with your own data

  1. Put your documents as .txt files under ./<your_doc_path>.
  2. Set your OpenAI key:
export OPENAI_API_KEY=<Your OpenAI key>
  1. Simply run:
python data_processor.py --source customized --file_path <your_path> --num_docs 100

If you want to reproduce LatestEval on arXiv, BBC, GitHub:

python data_processor.py --source arxiv --num_docs 100

Issue

Open an issue if you have any problems or want to discuss.

Citation

If you find this project useful, consider cite this project:

@misc{li2023avoiding,
      title={Avoiding Data Contamination in Language Model Evaluation: Dynamic Test Construction with Latest Materials}, 
      author={Yucheng Li and Frank Guerin and Chenghua Lin},
      year={2023},
      eprint={2312.12343},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}

About

Latest Evaluation Toolkit (LatestEval). Assessing the language models with latest, uncontaminated materials.

Resources

Stars

Watchers

Forks

Packages

No packages published

Languages