An open-source library for contamination detection in NLP datasets and Large Language Models (LLMs).
The library has been designed and tested with Python 3.9 and CUDA 11.8.
First make sure you have CUDA 11.8 installed, and create a conda environment with Python 3.9:
conda create -name llmsanitize python=3.9
Next activate the environment:
conda activate llmsanitize
Then install all the dependencies for LLMSanitize:
pip install -r requirements.txt
Alternatively, you can combine the three steps above by just running:
sh scripts/install.sh
Notably, we use the following important libraries:
- datasets 2.17.1
- einops 0.7.0
- huggingface-hub 0.20.3
- openai 0.27.8
- torch 2.1.2
- transformers 4.38.0
- vllm 0.3.3
So far we support the following contamination detection methods:
Method | Use Case | Short description | White-box access? | Reference |
---|---|---|---|---|
gpt-2 | data contamination | String matching | _ | paper |
gpt-3 | data contamination | String matching | _ | paper |
exact | data contamination | String matching | _ | paper |
palm | data contamination | String matching | _ | paper |
gpt-4 | data contamination | String matching | _ | paper |
platypus | data contamination | Embeddings similarity | _ | paper |
guided-prompting | model contamination | Likelihood | yes | paper |
sharded-likelihood | model contamination | Likelihood | yes | paper |
min-prob | model contamination | LLM-based method | no | paper |
cdd | model contamination | Likelihood | no | paper |
The following methods require to launch a vLLM instance which will handle model inference:
Method |
---|
guided-prompting |
min-prob |
cdd |
To launch the instance, first run the following command in a terminal:
sh scripts/vllm_hosting.sh
You are required to specify a port number and model name in this shell script.
To run contamination detection, follow the multiple test scripts in scripts/tests/ folder.
For instance, to run sharded-likelihood on Hellaswag with Llama-2-7B:
sh scripts/tests/model/sharded-likelihood/test_hellaswag.sh -m <path_to_your_llama-2-7b_folder>
To run a method using vLLM like guided-prompting for instance, the only difference is to pass the port number as argument:
sh scripts/tests/model/guided-prompting/test_hellaswag.sh -m <path_to_your_llama-2-7b_folder> -p <port_number_from_your_vllm_instance>
If you find our paper or this project helps your research, please kindly consider citing our paper in your publication.
@article{ravaut2024much,
title={How Much are LLMs Contaminated? A Comprehensive Survey and the LLMSanitize Library},
author={Ravaut, Mathieu and Ding, Bosheng and Jiao, Fangkai and Chen, Hailin and Li, Xingxuan and Zhao, Ruochen and Qin, Chengwei and Xiong, Caiming and Joty, Shafiq},
journal={arXiv preprint arXiv:2404.00699},
year={2024}
}