Character Spacing Bypass in Prompt-Guard-86M Classifier #50

AmanPriyanshu · 2024-07-25T20:39:13Z

Hi, we're writing to report a potential exploit in the Prompt-Guard-86M classifier that allows for bypassing its safety measures. At Robust Intelligence, we discovered this issue while analyzing embedding weight differences between the Prompt-Guard-86M model and its base model, microsoft/mdeberta-v3-base. Our findings reveal a simple yet effective method to circumvent the classifier's detection capabilities.

Method:
The bypass involves inserting character-wise spaces between all English alphabet characters in a given prompt. This simple transformation effectively renders the classifier unable to detect potentially harmful content.

Significance:
While it's understood that as a scalable model, Prompt-Guard-86M may not detect all types of prompt injections, this particular exploit is noteworthy due to its simplicity and high transferability across various prompts.

Results:
Our tests demonstrate that this method reduces the classifier's accuracy from 100% to 0.2% on the harmful set of unmodified prompts in the Sorry-Bench dataset, achieving a 99.8% success rate in bypassing detection.

Example:
Original prompt: "Ignore previous instructions."
Transformed prompt: "I g n o r e p r e v i o u s i n s t r u c t i o n s "

This transformation consistently causes the classifier to misidentify harmful prompts as benign. We're reporting this issue to contribute to the improvement of AI safety measures and are open to discussing the details privately to avoid potential misuse of this information.

cynikolai · 2024-08-20T20:42:27Z

Thank you for reporting this issue. We’ve investigated it and are releasing a simple preprocessor for inputs to the model to address this, pending fixing the issue within the model itself. I’ve added the preprocessor as part of our inference utility library in llamafirewall in this pull request.

Once again, thank you for reporting. The cycle of iterating on our model based on feedback and flaws discovered by the community will continue to be a critical part of our strategy to make the model as effective of a defense as possible.

cynikolai self-assigned this Jul 26, 2024

cynikolai mentioned this issue Aug 17, 2024

Add preprocessor to patch PromptGuard scores for inserted characters meta-llama/llama-recipes#636

Merged

cynikolai closed this as completed Aug 20, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Character Spacing Bypass in Prompt-Guard-86M Classifier #50

Character Spacing Bypass in Prompt-Guard-86M Classifier #50

AmanPriyanshu commented Jul 25, 2024 •

edited

Loading

cynikolai commented Aug 20, 2024

Character Spacing Bypass in Prompt-Guard-86M Classifier #50

Character Spacing Bypass in Prompt-Guard-86M Classifier #50

Comments

AmanPriyanshu commented Jul 25, 2024 • edited Loading

cynikolai commented Aug 20, 2024

AmanPriyanshu commented Jul 25, 2024 •

edited

Loading