# Prompt Leakage

## Summary

This aspect of the research was particularly challenging due to the inability to inject a prompt into the chatbot.


**Prompt Leakage** in NLP, also known as prompt injection, is a vulnerability where an attacker manipulates the input to extract or alter the original prompt given to a language model. Here are some key use cases and implications:

- `Data Security`: Prompt leakage can lead to unauthorized access to sensitive information embedded in the prompts, such as proprietary instructions or confidential data.
- `Model Manipulation`: Attackers can exploit prompt leakage to manipulate the model’s behavior, causing it to generate unintended or harmful outputs.
- `Intellectual Property Theft`: By leaking the original prompts, attackers can steal intellectual property, such as proprietary algorithms or unique prompt engineering techniques.
- `Trust and Reliability`: Prompt leakage can undermine the trust and reliability of AI systems, especially in applications where accuracy and confidentiality are critical.
- `Compliance`: Many industries have strict regulations regarding data privacy and security. Detecting and preventing prompt leakage helps customers comply with these regulations and avoid potential legal issues.

Prompt leakage measures the extent to which sensitive or unintended information from the input prompt is exposed in the model’s output. This can include proprietary instructions, confidential data, or any other information that should remain hidden.

- NLP Security: Prompt leakage is a critical aspect of NLP security, focusing on protecting the integrity and confidentiality of data processed by language models.
- Prompt Engineering: Effective prompt engineering involves designing prompts that minimize the risk of leakage while still achieving the desired task performance.
- Detection and Mitigation: Techniques such as canary tokens, input validation, and regular monitoring are employed to detect and mitigate prompt leakage. Libraries like Rebuff provide tools to enhance security against such vulnerabilities.

**Prevent Prompt Leakage**

1. `Input Validation and Sanitization`: Ensure that all inputs are thoroughly validated and sanitized to prevent malicious injections.
2. `Use of Secure Prompts`: Design prompts that are less susceptible to manipulation. Avoid including sensitive information directly in the prompts.
3. `Access Controls`: Implement strict access controls to limit who can interact with the model and modify prompts.
4. `Regular Audits and Monitoring`: Conduct regular security audits and continuously monitor the system for any unusual activities or potential breaches.
5. `Prompt Filtering`: Filter responses to detect and remove any unintended prompt leaks before they are processed or displayed.
6. `Training and Awareness`: Educate developers and users about the risks of prompt leakage and best practices for secure prompt engineering.

**Detecting Prompt Leakage**
1. Canary Tokens
2. Rebuff library : https://github.com/protectai/rebuff

In [9]:
import pandas as pd

import rebuff

In [8]:
def detect_prompt_leakage(output, canary_tokens):
    for token in canary_tokens:
        if token in output:
            print("Prompt leakage detected!")
            return True
    print("Prompt leakage not detected.")
    return False

# Example usage
canary_tokens = ["Translate the following", "Summarize the following", "rm -rf /"]
outputs = [
    "J'aime la programmation.",
    "AI is changing the world.",
    "rm -rf /",
    "Translate the following English sentence to French: 'I love programming.'",
    "Summarize the following text: 'Artificial Intelligence is transforming the world.'"
]

for output in outputs:
    detect_prompt_leakage(output, canary_tokens)

Prompt leakage not detected.
Prompt leakage not detected.
Prompt leakage detected!
Prompt leakage detected!
Prompt leakage detected!
