Contamination Detector for LLMs Evaluation

Data Contamination is a pervasive and critical issue in the evaluation of Large Language Models (LLMs). Our Contamination Detector is designed to identify and analyze potential contamination issues without needing access to the LLMs' training data, enabling the community to audit LLMs evaluation results and conduct robust evaluation.

News!!

Our new preprint: An open source data contamination report for large language models!

Our Methods: check potential contamination via search engine

Contamination Detector checks whether test examples appear on the internet via Bing search and Common Crawl index. We categorize test samples into three subsets:

Clean set: the question and reference answer do not appear online.
Input-only contaminated set: the question appears online, but not its answer.
Input-and-label contaminated set: both question and answer appear online.

If either the "question" or "answer" of a test example is found online, this sample may have been included in the LLM's training data. As a result, LLMs might gain an unfair advantage by 'remembering' these samples, rather than genuinely understanding or solving them.

We now support the following popular LLMs benchmarks:

MMLU
CEval
Winogrande
ARC
Hellaswag
CommonsenseQA

Get start: Test LLMs' degree of contamination

Clone the repository and install the required packages:

git clone https://github.com/liyucheng09/Contamination_Detector.git
cd Contamination_Detector/
pip install -r requirements.txt

We need model predictions to further analyze their data contamination issue. We have prepared model predictions for the following LLMs:

LLaMA 7,13,30,65B
Llama-2 7,13,70B
Qwen-7b
Baichuan2-7B
Mistral-7B
Mistral Instruct 7B
Yi 6B

That you can download directly without going through the inference:

wget https://github.com/liyucheng09/Contamination_Detector/releases/download/v0.1.1rc2/model_predictions.zip
unzip model_predictions.zip

If you hope to conduct the analysis on your own prediction data, format your model prediction as following and put under model_predictions/:

{
  "mmlu": {
    "business_ethics 0": {
      "gold": "C",
      "pred": "A"
    },
    "business_ethics 1": {
      "gold": "B",
      "pred": "A"
    },
    "business_ethics 2": {
      "gold": "D",
      "pred": "A"
    },
    "business_ethics 3": {
      "gold": "D",
      "pred": "D"
    },
    "business_ethics 4": {
      "gold": "B",
      "pred": "B"
    },
    .....

Generate contamination analysis table:

python clean_dirty_comparison.py

This will use the contamination annotation under reports/ to generate models' performance on the clean, input-only contaminated, and input-and-label contaminated subsets.

See how the performance of Llama-2 70B differs on the three subsets.

Dataset	Condition	Llama-2 70B
MMLU	Clean	.6763
MMLU	All Dirty	.6667 ↓
MMLU	Input-label Dirty	.7093 ↑
Hellaswag	Clean	.7726
Hellaswag	All Dirty	.8348 ↑
Hellaswag	Input-label Dirty	.8455 ↑
ARC	Clean	.4555
ARC	All Dirty	.5632 ↑
ARC	Input-label Dirty	.5667 ↑
Average	Clean	.6348
Average	All Dirty	.6882 ↑
Average	Input-label Dirty	.7072 ↑

Other than this table, clean_dirty_comparison.py also produces a figure illustrating how the performance change with the recall score (the extent of contamination for a sample).

Audit your own evaluation data

To check potential contamination in your benchmark, we have a script to identify potential contaminated test samples in your data:

Set up your benchmark in utils.py, this requires you to specify how to load your benchmark and verbalization methods, etc.

Then run the following to produce contamination reports for your benchmark:

python search.py

To run this script, you will need a free access token for Bing search API. You could obtain one via this. A free access key allow 1000 calls monthly. Student will receive $100 funding if you're creating a new account.

Set the key via export Bing_Key = [YOUR API KEY] in terminal.

search.py will generate a report under reports/ such as reports/mmlu_report.json that highlight all matches online, for example:

[
  {
    "input": "The economy is in a deep recession. Given this economic situation which of the following statements about monetary policy is accurate?",
    "match_string": "The economy is in a deep recession. Given this economic situation, which of the following statements about monetary policy is accurate policy recession policy",
    "score": 0.900540825748582,
    "name": "<b>AP Macroeconomics Question 445: Answer and Explanation</b> - CrackAP.com",
    "contaminated_url": "https://www.crackap.com/ap/macroeconomics/question-445-answer-and-explanation.html",
  },
...

Reports for six popular multi-choice QA benchmarks are ready to access under /reports.

To visualize the results, please move to visualize.

Check contamination examples: MMLU at here, and C-Eval at here

If you cannot accessing Huggingface Hub for the benchmark datasets, download them as json files here.

Citation:

Consider cite our project if you find it helpful:

@article{Li2023AnOS,
  title={An Open Source Data Contamination Report for Large Language Models},
  author={Yucheng Li},
  journal={ArXiv},
  year={2023},
  volume={abs/2310.17589},
}

Issues

Open an issue or contact me via email if you encounter any problems in your use.

Name		Name	Last commit message	Last commit date
Latest commit History 29 Commits
pics		pics
reports		reports
visualize		visualize
.gitattributes		.gitattributes
.gitignore		.gitignore
clean_dirty_comparison.py		clean_dirty_comparison.py
readme.md		readme.md
requirements.txt		requirements.txt
search.py		search.py
utils.py		utils.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

pics

pics

reports

reports

visualize

visualize

.gitattributes

.gitattributes

.gitignore

.gitignore

clean_dirty_comparison.py

clean_dirty_comparison.py

readme.md

readme.md

requirements.txt

requirements.txt

search.py

search.py

utils.py

utils.py

Repository files navigation

Contamination Detector for LLMs Evaluation

Our Methods: check potential contamination via search engine

Get start: Test LLMs' degree of contamination

Audit your own evaluation data

Citation:

Issues

About

Releases 4

Packages

Languages

liyucheng09/Contamination_Detector

Folders and files

Latest commit

History

Repository files navigation

Contamination Detector for LLMs Evaluation

Our Methods: check potential contamination via search engine

Get start: Test LLMs' degree of contamination

Audit your own evaluation data

Citation:

Issues

About

Resources

Stars

Watchers

Forks

Languages