LLMSanitize

An open-source library for contamination detection in NLP datasets and Large Language Models (LLMs).

Installation

The library has been designed and tested with Python 3.9 and CUDA 11.8.

First make sure you have CUDA 11.8 installed, and create a conda environment with Python 3.9:

conda create -name llmsanitize python=3.9

Next activate the environment:

conda activate llmsanitize

Then install all the dependencies for LLMSanitize:

pip install -r requirements.txt

Alternatively, you can combine the three steps above by just running:

sh scripts/install.sh

Notably, we use the following important libraries:

datasets 2.17.1
einops 0.7.0
huggingface-hub 0.20.3
openai 0.27.8
torch 2.1.2
transformers 4.38.0
vllm 0.3.3

Supported Methods

So far we support the following contamination detection methods:

Method	Use Case	Short description	White-box access?	Reference
gpt-2	data contamination	String matching	_	paper
gpt-3	data contamination	String matching	_	paper
exact	data contamination	String matching	_	paper
palm	data contamination	String matching	_	paper
gpt-4	data contamination	String matching	_	paper
platypus	data contamination	Embeddings similarity	_	paper
guided-prompting	model contamination	Likelihood	yes	paper
sharded-likelihood	model contamination	Likelihood	yes	paper
min-prob	model contamination	LLM-based method	no	paper
cdd	model contamination	Likelihood	no	paper

vLLM

The following methods require to launch a vLLM instance which will handle model inference:

Method
guided-prompting
min-prob
cdd

To launch the instance, first run the following command in a terminal:

sh scripts/vllm_hosting.sh

You are required to specify a port number and model name in this shell script.

Run Contamination Detection

To run contamination detection, follow the multiple test scripts in scripts/tests/ folder.

For instance, to run sharded-likelihood on Hellaswag with Llama-2-7B:

sh scripts/tests/model/sharded-likelihood/test_hellaswag.sh -m <path_to_your_llama-2-7b_folder>

To run a method using vLLM like guided-prompting for instance, the only difference is to pass the port number as argument:

sh scripts/tests/model/guided-prompting/test_hellaswag.sh -m <path_to_your_llama-2-7b_folder> -p <port_number_from_your_vllm_instance>

Citation

If you find our paper or this project helps your research, please kindly consider citing our paper in your publication.

@article{ravaut2024much,
  title={How Much are LLMs Contaminated? A Comprehensive Survey and the LLMSanitize Library},
  author={Ravaut, Mathieu and Ding, Bosheng and Jiao, Fangkai and Chen, Hailin and Li, Xingxuan and Zhao, Ruochen and Qin, Chengwei and Xiong, Caiming and Joty, Shafiq},
  journal={arXiv preprint arXiv:2404.00699},
  year={2024}
}

Name		Name	Last commit message	Last commit date
Latest commit History 151 Commits
llmsanitize		llmsanitize
scripts		scripts
.gitignore		.gitignore
README.md		README.md
main.py		main.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

llmsanitize

llmsanitize

scripts

scripts

.gitignore

.gitignore

README.md

README.md

main.py

main.py

requirements.txt

requirements.txt

Repository files navigation

LLMSanitize

Installation

Supported Methods

vLLM

Run Contamination Detection

Citation

About

Releases

Packages

Contributors 5

Languages

ntunlp/LLMSanitize

Folders and files

Latest commit

History

Repository files navigation

LLMSanitize

Installation

Supported Methods

vLLM

Run Contamination Detection

Citation

About

Resources

Stars

Watchers

Forks

Languages