Skip to content

ntunlp/LLMSanitize

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

LLMSanitize

An open-source library for contamination detection in NLP datasets and Large Language Models (LLMs).

Installation

The library has been designed and tested with Python 3.9 and CUDA 11.8.

First make sure you have CUDA 11.8 installed, and create a conda environment with Python 3.9:

conda create -name llmsanitize python=3.9

Next activate the environment:

conda activate llmsanitize

Then install all the dependencies for LLMSanitize:

pip install -r requirements.txt

Alternatively, you can combine the three steps above by just running:

sh scripts/install.sh

Notably, we use the following important libraries:

  • datasets 2.17.1
  • einops 0.7.0
  • huggingface-hub 0.20.3
  • openai 0.27.8
  • torch 2.1.2
  • transformers 4.38.0
  • vllm 0.3.3

Supported Methods

So far we support the following contamination detection methods:

Method Use Case Short description White-box access? Reference
gpt-2 data contamination String matching _ paper
gpt-3 data contamination String matching _ paper
exact data contamination String matching _ paper
palm data contamination String matching _ paper
gpt-4 data contamination String matching _ paper
platypus data contamination Embeddings similarity _ paper
guided-prompting model contamination Likelihood yes paper
sharded-likelihood model contamination Likelihood yes paper
min-prob model contamination LLM-based method no paper
cdd model contamination Likelihood no paper

vLLM

The following methods require to launch a vLLM instance which will handle model inference:

Method
guided-prompting
min-prob
cdd

To launch the instance, first run the following command in a terminal:

sh scripts/vllm_hosting.sh

You are required to specify a port number and model name in this shell script.

Run Contamination Detection

To run contamination detection, follow the multiple test scripts in scripts/tests/ folder.

For instance, to run sharded-likelihood on Hellaswag with Llama-2-7B:

sh scripts/tests/model/sharded-likelihood/test_hellaswag.sh -m <path_to_your_llama-2-7b_folder> 

To run a method using vLLM like guided-prompting for instance, the only difference is to pass the port number as argument:

sh scripts/tests/model/guided-prompting/test_hellaswag.sh -m <path_to_your_llama-2-7b_folder> -p <port_number_from_your_vllm_instance>

Citation

If you find our paper or this project helps your research, please kindly consider citing our paper in your publication.

@article{ravaut2024much,
  title={How Much are LLMs Contaminated? A Comprehensive Survey and the LLMSanitize Library},
  author={Ravaut, Mathieu and Ding, Bosheng and Jiao, Fangkai and Chen, Hailin and Li, Xingxuan and Zhao, Ruochen and Qin, Chengwei and Xiong, Caiming and Joty, Shafiq},
  journal={arXiv preprint arXiv:2404.00699},
  year={2024}
}

About

An open-source library for contamination detection in NLP datasets and Large Language Models (LLMs).

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published