In [None]:
# download presidio
!pip install presidio_analyzer presidio_anonymizer
!pip install openai pandas
!python -m spacy download en_core_web_lg

###### Path to notebook: [https://www.github.com/microsoft/presidio/blob/main/docs/samples/python/synth_data_with_openai.ipynb](https://www.github.com/microsoft/presidio/blob/main/docs/samples/python/synth_data_with_openai.ipynb)

# Use Presidio + OpenAI to turn real text into fake text

This notebook uses Presidio to turn text with PII into text where PII entities are replaced with placeholders, e.g. "`My name is David`" turns into "`My name is {{PERSON}}`". Then, it calls the OpenAI API to create a fake record which is based on the original one.


Flow:
1. `My friend David lives in Paris. He likes it.`
1. `My friend {{PERSON}} lives in {{CITY}}. He likes it.`
1. `My friend Lucy lives in Beirut. She likes it.`
    
Note that OpenAI completion models could possibly detect PII values and replace them in one call, but it is suggested to validate that all PII entities are indeed detected.

## Imports and set up OpenAI Key

In [3]:
import pprint
from dotenv import load_dotenv
import os
import pandas as pd
from openai import OpenAI

load_dotenv()

client = OpenAI(
    api_key=os.environ.get("OPENAI_API_KEY"),
)
#Or put explicitly in notebook. Find out more here: https://help.openai.com/en/articles/4936850-where-do-i-find-my-secret-api-key

## Define request for the OpenAI service

In [4]:
def call_completion_model(prompt:str, model:str="gpt-3.5-turbo", max_tokens:int=512) ->str:
    """Creates a request for the OpenAI Completion service and returns the response.
    
    :param prompt: The prompt for the completion model
    :param model: OpenAI model name
    :param max_tokens: Model's max tokens parameter
    """

    completion = client.chat.completions.create(
    messages=[
        {
            "role": "user",
            "content": prompt,
        }
    ],
    model=model,
)

    return completion.choices[0].message.content

## De-identify data using Presidio Analyzer and Anonymizer

In [5]:
from presidio_analyzer import AnalyzerEngine
from presidio_anonymizer import AnonymizerEngine

analyzer = AnalyzerEngine()
anonymizer = AnonymizerEngine()

sample = """
Hello, my name is David Johnson and I live in Maine.
My credit card number is 4095-2609-9393-4932 and my crypto wallet id is 16Yeky6GMjeNkAiNcBY7ZhrLoMSgg1BoyZ.

On September 18 I visited microsoft.com and sent an email to test@presidio.site,  from the IP 192.168.0.1.

My passport: 191280342 and my phone number: (212) 555-1234.

This is a valid International Bank Account Number: IL150120690000003111111 . Can you please check the status on bank account 954567876544?

Kate's social security number is 078-05-1126.  Her driver license? it is 1234567A.
"""

results = analyzer.analyze(sample, language="en")
anonymized = anonymizer.anonymize(text=sample, analyzer_results=results)
anonymized_text = anonymized.text
print(anonymized_text)



Hello, my name is <PERSON> and I live in <LOCATION>.
My credit card number is <CREDIT_CARD> and my crypto wallet id is <CRYPTO>.

On <DATE_TIME> I visited <URL> and sent an email to <EMAIL_ADDRESS>,  from the IP <IP_ADDRESS>.

My passport: <US_PASSPORT> and my phone number: <PHONE_NUMBER>.

This is a valid International Bank Account Number: <IBAN_CODE> . Can you please check the status on bank account <US_BANK_NUMBER>?

<PERSON>'s social security number is <US_SSN>.  Her driver license? it is <US_DRIVER_LICENSE>.



## Create prompt (instructions + text to manipulate)

In [6]:
def create_prompt(anonymized_text: str) -> str:
    """
    Create the prompt with instructions to GPT-3.
    
    :param anonymized_text: Text with placeholders instead of PII values, e.g. My name is <PERSON>.
    """

    prompt = f"""
    Your role is to create synthetic text based on de-identified text with placeholders instead of Personally Identifiable Information (PII).
    Replace the placeholders (e.g. ,<PERSON>, {{DATE}}, {{ip_address}}) with fake values.
    Instructions:
    a. Use completely random numbers, so every digit is drawn between 0 and 9.
    b. Use realistic names that come from diverse genders, ethnicities and countries.
    c. If there are no placeholders, return the text as is.
    d. Keep the formatting as close to the original as possible.
    e. If PII exists in the input, replace it with fake values in the output.
    f. Remove whitespace before and after the generated text
    
    input: [[TEXT STARTS]] How do I change the limit on my credit card {{credit_card_number}}?[[TEXT ENDS]]
    output: How do I change the limit on my credit card 2539 3519 2345 1555?
    input: [[TEXT STARTS]]<PERSON> was the chief science officer at <ORGANIZATION>.[[TEXT ENDS]]
    output: Katherine Buckjov was the chief science officer at NASA.
    input: [[TEXT STARTS]]Cameroon lives in <LOCATION>.[[TEXT ENDS]]
    output: Vladimir lives in Moscow.
    
    input: [[TEXT STARTS]]{anonymized_text}[[TEXT ENDS]]
    output:"""
    return prompt

In [7]:
print("This is the prompt with de-identified values:")
print(create_prompt(anonymized_text))

This is the prompt with de-identified values:

    Your role is to create synthetic text based on de-identified text with placeholders instead of Personally Identifiable Information (PII).
    Replace the placeholders (e.g. ,<PERSON>, {DATE}, {ip_address}) with fake values.
    Instructions:
    a. Use completely random numbers, so every digit is drawn between 0 and 9.
    b. Use realistic names that come from diverse genders, ethnicities and countries.
    c. If there are no placeholders, return the text as is.
    d. Keep the formatting as close to the original as possible.
    e. If PII exists in the input, replace it with fake values in the output.
    f. Remove whitespace before and after the generated text
    
    input: [[TEXT STARTS]] How do I change the limit on my credit card {credit_card_number}?[[TEXT ENDS]]
    output: How do I change the limit on my credit card 2539 3519 2345 1555?
    input: [[TEXT STARTS]]<PERSON> was the chief science officer at <ORGANIZATION>.[[TEXT 

## Call the LLM

In [8]:
gpt_res = call_completion_model(create_prompt(anonymized_text))

In [9]:
print(gpt_res)

Hello, my name is Aaliyah and I live in Tokyo.
My credit card number is 4928 7562 1034 8907 and my crypto wallet id is 0x3B 7a 5f 1C.

On 02/07/2023 15:45 I visited www.example.com and sent an email to example@email.com,  from the IP 127.0.0.1.

My passport: L921483B and my phone number: +1 (555) 123-4567.

This is a valid International Bank Account Number: FR76 1234 5789 1256 3321 7564 901. Can you please check the status on bank account 987654321?

Eliana's social security number is 123-45-6789.  Her driver license? it is DL12345678.


### Alternatively, run on a list of template sentences:

In [10]:
import urllib

templates = []

url = "https://raw.githubusercontent.com/microsoft/presidio-research/master/presidio_evaluator/data_generator/raw_data/templates.txt"
for line in urllib.request.urlopen(url):
    templates.append(line.decode('utf-8')) 

In [11]:
print("Example templates:")
templates[:5]

Example templates:


['I want to increase limit on my card # {{credit_card_number}} for certain duration of time. is it possible?\n',
 'My credit card {{credit_card_number}} has been lost, Can I request you to block it.\n',
 'Need to change billing date of my card {{credit_card_number}}\n',
 'I want to update my primary and secondary address to the same: {{address}}\n',
 "In case of my child's account, we need to add {{person}} as guardian\n"]

In [12]:
templates_to_use = templates[:5]


import time
pp = pprint.PrettyPrinter(indent=2, width=110)
sentences = []
for template in templates_to_use:
    synth_sentence = call_completion_model(create_prompt(template))
    sentence_dict = {"original": template, "synthetic":synth_sentence.strip()}
    sentences.append(sentence_dict)
    pp.pprint(sentence_dict)
    time.sleep(3) # wait to not get blocked by service (only applicable for the free tier)
    print("--------------")


{ 'original': 'I want to increase limit on my card # {{credit_card_number}} for certain duration of time. is '
              'it possible?\n',
  'synthetic': 'I want to increase limit on my card # 4701 2895 7462 8306 for certain duration of time. is '
               'it possible?'}
--------------
{ 'original': 'My credit card {{credit_card_number}} has been lost, Can I request you to block it.\n',
  'synthetic': 'My credit card 4892 7634 1023 8756 has been lost, Can I request you to block it.'}
--------------
{ 'original': 'Need to change billing date of my card {{credit_card_number}}\n',
  'synthetic': 'Need to change billing date of my card 4876 2035 6981 7423'}
--------------
{ 'original': 'I want to update my primary and secondary address to the same: {{address}}\n',
  'synthetic': 'I want to update my primary and secondary address to the same: 123 Main Street, Apt 4.'}
--------------
{ 'original': "In case of my child's account, we need to add {{person}} as guardian\n",
  'synthet

--------------
{ 'original': '{{name}} lives at {{building_number}} {{street_name}}, {{city}}\n',
  'synthetic': 'John Smith lives at 635 Poplar Street, Houston'}
--------------
{ 'original': '{{first_name_male}} had given {{first_name}} his address: {{building_number}} '
              '{{street_name}}\n',
  'synthetic': 'Adam had given Sarah his address: 44 Apple Street'}
--------------
{ 'original': '{{first_name_male}} had given {{first_name}} his address: {{building_number}} '
              '{{street_name}}, {{city}}\n',
  'synthetic': 'David had given Emma his address: 515 Elm Street, Camden.'}
--------------
{ 'original': 'What is your address? it is {{address}}\n',
  'synthetic': 'What is your address? it is 3498 Allensby Street, Los Angeles, CA 90011.'}
--------------
{'original': 'We moved here from {{city}}\n', 'synthetic': 'We moved here from Paris.'}
--------------
{'original': 'We moved here from {{country}}\n', 'synthetic': 'We moved here from Venezuela.'}
--------------


--------------
{ 'original': "It was a done thing between him and {{first_name}}'s kid; and everybody thought so.\n",
  'synthetic': "It was a done thing between him and Jeffery's kid; and everybody thought so."}
--------------
{ 'original': 'Capitalized words like Wisdom and Discipline are often mistaken with names.\n',
  'synthetic': 'Capitalized words like Wisdom and Discipline are often mistaken with names.'}
--------------
{ 'original': 'The letter arrived at {{address}} last night.\n',
  'synthetic': 'The letter arrived at 1143 Orange Street last night.'}
--------------
{ 'original': 'The Princess Royal arrived at {{city}} this morning from {{country}}.\n',
  'synthetic': 'The Princess Royal arrived at London this morning from France.'}
--------------
{'original': "I'm in {{city}}, at the conference\n", 'synthetic': "I'm in Toronto, at the conference."}
--------------
{ 'original': '{{name}}, the {{job}}, said: "I\'m glad to hear that this has been withdrawn – quite why they '
  

--------------
{ 'original': '{{prefix}} {{last_name}} flew to {{city}} on {{day_of_week}} morning.',
  'synthetic': 'Dr. Nguyen flew to Los Angeles on Tuesday morning.'}
--------------


This notebook demonstrates how to leverage OpenAI models for fake/surrogate data generation. It uses Presidio to first de-identify data (as de-identification might be required prior to passing the model to OpenAI), and then uses OpenAI completion models to create synthetic/fake/surrogate data based on real data. OpenAI models would also potentially remove additional PII entities, if those are not detected by Presidio.

Some impressions:
1. LLMs sometimes gives additonal output, especially if the text is a question or concerning a human/bot interaction. Engineering the prompt can mitigate some of these issues. Potential post-processing might be required.
2. LLMs sometimes creates fake values even in the absence of placeholders.
3. LLMs re-uses context from other sentences, which could cause phone numbers are sometimes generated using a credit card pattern or other similar mistakes.
4. Co-references are sometimes missed (i.e. two name placeholders that should be filled with the same name, or referencing he/she to a male/female name)