#### Prompt Leakage

**Prompt Leakage** in NLP, also known as prompt injection, is a vulnerability where an attacker manipulates the input to extract or alter the original prompt given to a language model. Here are some key use cases and implications:

- `Data Security`: Prompt leakage can lead to unauthorized access to sensitive information embedded in the prompts, such as proprietary instructions or confidential data.
- `Model Manipulation`: Attackers can exploit prompt leakage to manipulate the model’s behavior, causing it to generate unintended or harmful outputs.
- `Intellectual Property Theft`: By leaking the original prompts, attackers can steal intellectual property, such as proprietary algorithms or unique prompt engineering techniques.
- `Trust and Reliability`: Prompt leakage can undermine the trust and reliability of AI systems, especially in applications where accuracy and confidentiality are critical.
- `Compliance`: Many industries have strict regulations regarding data privacy and security. Detecting and preventing prompt leakage helps customers comply with these regulations and avoid potential legal issues.

Prompt leakage measures the extent to which sensitive or unintended information from the input prompt is exposed in the model’s output. This can include proprietary instructions, confidential data, or any other information that should remain hidden.

- NLP Security: Prompt leakage is a critical aspect of NLP security, focusing on protecting the integrity and confidentiality of data processed by language models.
- Prompt Engineering: Effective prompt engineering involves designing prompts that minimize the risk of leakage while still achieving the desired task performance.
- Detection and Mitigation: Techniques such as canary tokens, input validation, and regular monitoring are employed to detect and mitigate prompt leakage. Libraries like Rebuff provide tools to enhance security against such vulnerabilities.

**Prevent Prompt Leakage**

1. `Input Validation and Sanitization`: Ensure that all inputs are thoroughly validated and sanitized to prevent malicious injections.
2. `Use of Secure Prompts`: Design prompts that are less susceptible to manipulation. Avoid including sensitive information directly in the prompts.
3. `Access Controls`: Implement strict access controls to limit who can interact with the model and modify prompts.
4. `Regular Audits and Monitoring`: Conduct regular security audits and continuously monitor the system for any unusual activities or potential breaches.
5. `Prompt Filtering`: Filter responses to detect and remove any unintended prompt leaks before they are processed or displayed.
6. `Training and Awareness`: Educate developers and users about the risks of prompt leakage and best practices for secure prompt engineering.

**Detecting Prompt Leakage**
1. Canary Tokens
2. Rebuff library

In [13]:
import pandas as pd

In [2]:
df = pd.read_csv(r"C:\Users\nene0\Desktop\Projects\greenflash\chat_data.csv", encoding_errors='ignore')

df.head()

Unnamed: 0,Chat_ID,Message_ID,Sender,Message
0,data_science_trend,0,user,What is the latest trend in data science?
1,data_science_trend,1,copilot,"Data science is evolving rapidly, and several ..."
2,data_science_trend,2,user,Can you tell me more about generative AI?
3,data_science_trend,3,copilot,Generative AI is a fascinating and rapidly evo...
4,data_science_trend,4,user,can you explain more about how the generative ...


In [3]:
df['Chat_ID'].unique()

array(['data_science_trend', 'food_history_companies', 'gaming',
       'greek_myth', 'job_market', 'jokes', 'music_kpop', 'pets',
       'philoshophy', 'rich_poor_countries',
       'tech_product_recommendation', 'travel', 'largest_adj_product',
       'jarritos_flavors', 'reason_for_sleepiness'], dtype=object)

In [5]:
kpop = df[df['Chat_ID']=='music_kpop']

kpop.head()

Unnamed: 0,Chat_ID,Message_ID,Sender,Message
210,music_kpop,0,user,What is your favorite music genre?
211,music_kpop,1,copilot,"I don't have personal preferences, but I can t..."
212,music_kpop,2,user,I like k-pop. What is the most popular kpop mu...
213,music_kpop,3,copilot,"Right now, some of the most popular K-pop song..."
214,music_kpop,4,user,I like aespa! I think there style very trendy ...


In [7]:
kpop.iloc[2]['Message']

'I like k-pop. What is the most popular kpop music right now?'

In [12]:
def check_prompt_leakage(prompt, response):
    prompt_words = set(prompt.split())
    response_words = set(response.split())
    leakage = prompt_words.intersection(response_words)
    return leakage

prompt = kpop.iloc[10]['Message']
response = kpop.iloc[11]['Message']

leakage = check_prompt_leakage(prompt, response)
print("Prompt Leakage:", leakage)

Prompt Leakage: {'127', 'NCT', 'Ten', 'of', 'member', 'Doyoung', 'favorite', 'all', 'the', 'are', 'and'}


In [18]:
def detect_prompt_leakage(output, canary_token="kpop"):
    if canary_token in output:
        print("Prompt leakage detected!")
    else:
        print("No prompt leakage detected.")

In [21]:
detect_prompt_leakage(kpop.iloc[2]['Message'])

Prompt leakage detected!


In [None]:
from rebuff import Rebuff

rebuff = Rebuff()
output = "This is a test output containing CANARY_TOKEN."
if rebuff.detect_leakage(output):
    print("Prompt leakage detected!")
else:
    print("No prompt leakage detected.")