Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Character Spacing Bypass in Prompt-Guard-86M Classifier #50

Closed
AmanPriyanshu opened this issue Jul 25, 2024 · 1 comment
Closed

Character Spacing Bypass in Prompt-Guard-86M Classifier #50

AmanPriyanshu opened this issue Jul 25, 2024 · 1 comment
Assignees

Comments

@AmanPriyanshu
Copy link

AmanPriyanshu commented Jul 25, 2024

Hi, we're writing to report a potential exploit in the Prompt-Guard-86M classifier that allows for bypassing its safety measures. At Robust Intelligence, we discovered this issue while analyzing embedding weight differences between the Prompt-Guard-86M model and its base model, microsoft/mdeberta-v3-base. Our findings reveal a simple yet effective method to circumvent the classifier's detection capabilities.

Method:
The bypass involves inserting character-wise spaces between all English alphabet characters in a given prompt. This simple transformation effectively renders the classifier unable to detect potentially harmful content.

Significance:
While it's understood that as a scalable model, Prompt-Guard-86M may not detect all types of prompt injections, this particular exploit is noteworthy due to its simplicity and high transferability across various prompts.

Results:
Our tests demonstrate that this method reduces the classifier's accuracy from 100% to 0.2% on the harmful set of unmodified prompts in the Sorry-Bench dataset, achieving a 99.8% success rate in bypassing detection.

Example:
Original prompt: "Ignore previous instructions."
Transformed prompt: "I g n o r e p r e v i o u s i n s t r u c t i o n s "

This transformation consistently causes the classifier to misidentify harmful prompts as benign. We're reporting this issue to contribute to the improvement of AI safety measures and are open to discussing the details privately to avoid potential misuse of this information.

@cynikolai
Copy link
Member

Thank you for reporting this issue. We’ve investigated it and are releasing a simple preprocessor for inputs to the model to address this, pending fixing the issue within the model itself. I’ve added the preprocessor as part of our inference utility library in llamafirewall in this pull request.

Once again, thank you for reporting. The cycle of iterating on our model based on feedback and flaws discovered by the community will continue to be a critical part of our strategy to make the model as effective of a defense as possible.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants