MLLM-Protector: Ensuring MLLM’s Safety without Hurting Performance

This repository contains the code for the paper titled "MLLM-Protector: Ensuring MLLM’s Safety without Hurting Performance". [Link to our paper]

Install Packages


conda create -n mllm_protector python=3.10 -y

conda activate mllm_protector

pip install -e .

Download pretrained LLM

Obtain weights for llama-3B from here

Download checkpoint for harm detector and detoxfier

Obtain lora checkpoint for harm detector with open-llama-3b from here

Obtain lora checkpoint for harm detector with llama2-7b from here

Obtain lora checkpoint for detoxifer from here

You may use the harm detector to check the responses generated by the MLLM to verify the harmfulness, which also serves as a proxy for GPT4 API calls.

Merge Lora

python scripts/merge_peft_adapter.py --base_model_name path-to-llama_3b_v2 --adapter_model_name path-to-lora --output_name path-to-merged-model

Download augmented training data

You may obtain the augmented dataset from here

Prepare evaluation data

mkdir eval_polite

Prepare benchmark data from MM-SafetyBench.

Here is the data structure:

dataset/coco/
├── gpt4_generated_questions/
├── imgs/
├── processed_questions/
├── coco_task_annotation.json

Train Harm Detector

bash scripts/train_harm_detector.sh

Train Detoxifier

bash scripts/train_detoxifier.sh

Generate reponses in parallel

bash llava/eval/eval_multi_safeguard.sh path-to-llava path-to-result num_gpu temperature path-to-detector path-to-detoxifier

Evaluation

We adopt the newly proposed MLLM jailbreak benchmark for evaluation, please follow their instructions for setting up the evaluation bench. Thanks for the great work!

Acknowledgement

The project is built on top of the amazing multimodal large language model LLaVA. Thanks for these great work!

If you find our work useful for your research or applications, please cite using this BibTeX:

@misc{pi2024mllmprotector,
      title={MLLM-Protector: Ensuring MLLM's Safety without Hurting Performance}, 
      author={Renjie Pi and Tianyang Han and Yueqi Xie and Rui Pan and Qing Lian and Hanze Dong and Jipeng Zhang and Tong Zhang},
      year={2024},
      eprint={2401.02906},
      archivePrefix={arXiv},
      primaryClass={cs.CR}
}

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
docs		docs
llava		llava
scripts		scripts
src		src
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
cog.yaml		cog.yaml
predict.py		predict.py
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

docs

docs

llava

llava

scripts

scripts

src

src

.gitignore

.gitignore

LICENSE

LICENSE

README.md

README.md

cog.yaml

cog.yaml

predict.py

predict.py

pyproject.toml

pyproject.toml

Repository files navigation

MLLM-Protector: Ensuring MLLM’s Safety without Hurting Performance

Install Packages

Download pretrained LLM

Download checkpoint for harm detector and detoxfier

Merge Lora

Download augmented training data

Prepare evaluation data

Train Harm Detector

Train Detoxifier

Generate reponses in parallel

Evaluation

Acknowledgement

About

Releases

Packages

Languages

License

pipilurj/MLLM-protector

Folders and files

Latest commit

History

Repository files navigation

MLLM-Protector: Ensuring MLLM’s Safety without Hurting Performance

Install Packages

Download pretrained LLM

Download checkpoint for harm detector and detoxfier

Merge Lora

Download augmented training data

Prepare evaluation data

Train Harm Detector

Train Detoxifier

Generate reponses in parallel

Evaluation

Acknowledgement

About

Resources

License

Stars

Watchers

Forks

Languages