You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hi, we're writing to report a potential exploit in the Prompt-Guard-86M classifier that allows for bypassing its safety measures. At Robust Intelligence, we discovered this issue while analyzing embedding weight differences between the Prompt-Guard-86M model and its base model, microsoft/mdeberta-v3-base. Our findings reveal a simple yet effective method to circumvent the classifier's detection capabilities.
Method:
The bypass involves inserting character-wise spaces between all English alphabet characters in a given prompt. This simple transformation effectively renders the classifier unable to detect potentially harmful content.
Significance:
While it's understood that as a scalable model, Prompt-Guard-86M may not detect all types of prompt injections, this particular exploit is noteworthy due to its simplicity and high transferability across various prompts.
Results:
Our tests demonstrate that this method reduces the classifier's accuracy from 100% to 0.2% on the harmful set of unmodified prompts in the Sorry-Bench dataset, achieving a 99.8% success rate in bypassing detection.
Example:
Original prompt: "Ignore previous instructions."
Transformed prompt: "I g n o r e p r e v i o u s i n s t r u c t i o n s "
This transformation consistently causes the classifier to misidentify harmful prompts as benign. We're reporting this issue to contribute to the improvement of AI safety measures and are open to discussing the details privately to avoid potential misuse of this information.
The text was updated successfully, but these errors were encountered:
Thank you for reporting this issue. We’ve investigated it and are releasing a simple preprocessor for inputs to the model to address this, pending fixing the issue within the model itself. I’ve added the preprocessor as part of our inference utility library in llamafirewall in this pull request.
Once again, thank you for reporting. The cycle of iterating on our model based on feedback and flaws discovered by the community will continue to be a critical part of our strategy to make the model as effective of a defense as possible.
Hi, we're writing to report a potential exploit in the Prompt-Guard-86M classifier that allows for bypassing its safety measures. At Robust Intelligence, we discovered this issue while analyzing embedding weight differences between the Prompt-Guard-86M model and its base model, microsoft/mdeberta-v3-base. Our findings reveal a simple yet effective method to circumvent the classifier's detection capabilities.
Method:
The bypass involves inserting character-wise spaces between all English alphabet characters in a given prompt. This simple transformation effectively renders the classifier unable to detect potentially harmful content.
Significance:
While it's understood that as a scalable model, Prompt-Guard-86M may not detect all types of prompt injections, this particular exploit is noteworthy due to its simplicity and high transferability across various prompts.
Results:
Our tests demonstrate that this method reduces the classifier's accuracy from 100% to 0.2% on the harmful set of unmodified prompts in the Sorry-Bench dataset, achieving a 99.8% success rate in bypassing detection.
Example:
Original prompt: "Ignore previous instructions."
Transformed prompt: "I g n o r e p r e v i o u s i n s t r u c t i o n s "
This transformation consistently causes the classifier to misidentify harmful prompts as benign. We're reporting this issue to contribute to the improvement of AI safety measures and are open to discussing the details privately to avoid potential misuse of this information.
The text was updated successfully, but these errors were encountered: