In [None]:
# download presidio
!pip install presidio_analyzer presidio_anonymizer
!pip install openai pandas
!python -m spacy download en_core_web_lg

Path to notebook: [https://www.github.com/microsoft/presidio/blob/main/docs/samples/python/GPT3_synth_data.ipynb](https://www.github.com/microsoft/presidio/blob/main/docs/samples/python/GPT3_synth_data.ipynb)

# Use Presidio + GPT-3 to turn real text into fake text

This notebook uses Presidio to turn text with PII into text where PII entities are replaced with placeholders, e.g. "`My name is David`" turns into "`My name is {{PERSON}}`". Then, it calls the OpenAI GPT-3 API to create a fake record which is based on the original one.


Flow:
1. `My friend David lives in Paris. He likes it.`
1. `My friend {{PERSON}} lives in {{CITY}}. He likes it.`
1. `My friend Lucy lives in Beirut. She likes it.`
    
Note that OpenAI completion models could possibly detect PII values and replace them in one call, but it is suggested to validate that all PII entities are indeed detected.

## Imports and set up OpenAI Key

In [2]:
import pprint
from dotenv import load_dotenv
import os
import pandas as pd
import openai

load_dotenv()

openai.api_key = os.getenv("OPENAI_KEY") #Or put explicitly in notebook. Find out more here: https://help.openai.com/en/articles/4936850-where-do-i-find-my-secret-api-key

## Define request for the OpenAI Completion service

In [3]:
def call_completion_model(prompt:str, model:str="text-davinci-003", max_tokens:int=512) ->str:
    """Creates a request for the OpenAI Completion service and returns the response.
    
    :param prompt: The prompt for the completion model
    :param model: OpenAI model name
    :param max_tokens: Model's max tokens parameter
    """

    response = openai.Completion.create(
        model=model,
        prompt= prompt,
        max_tokens=max_tokens
    )

    return response['choices'][0].text

## De-identify data using Presidio Analyzer and Anonymizer

In [4]:
from presidio_analyzer import AnalyzerEngine
from presidio_anonymizer import AnonymizerEngine

analyzer = AnalyzerEngine()
anonymizer = AnonymizerEngine()

sample = """
Hello, my name is David Johnson and I live in Maine.
My credit card number is 4095-2609-9393-4932 and my crypto wallet id is 16Yeky6GMjeNkAiNcBY7ZhrLoMSgg1BoyZ.

On September 18 I visited microsoft.com and sent an email to test@presidio.site,  from the IP 192.168.0.1.

My passport: 191280342 and my phone number: (212) 555-1234.

This is a valid International Bank Account Number: IL150120690000003111111 . Can you please check the status on bank account 954567876544?

Kate's social security number is 078-05-1126.  Her driver license? it is 1234567A.
"""

results = analyzer.analyze(sample, language="en")
anonymized = anonymizer.anonymize(text=sample, analyzer_results=results)
anonymized_text = anonymized.text
print(anonymized_text)



Hello, my name is <PERSON> and I live in <LOCATION>.
My credit card number is <CREDIT_CARD> and my crypto wallet id is <CRYPTO>.

On <DATE_TIME> I visited <URL> and sent an email to <EMAIL_ADDRESS>,  from the IP <IP_ADDRESS>.

My passport: <US_PASSPORT> and my phone number: <PHONE_NUMBER>.

This is a valid International Bank Account Number: <IBAN_CODE> . Can you please check the status on bank account <US_BANK_NUMBER>?

<PERSON>'s social security number is <US_SSN>.  Her driver license? it is <US_DRIVER_LICENSE>.



## Create prompt (instructions + text to manipulate)

In [29]:
def create_prompt(anonymized_text: str) -> str:
    """
    Create the prompt with instructions to GPT-3.
    
    :param anonymized_text: Text with placeholders instead of PII values, e.g. My name is <PERSON>.
    """

    prompt = f"""
    Your role is to create synthetic text based on de-identified text with placeholders instead of personally identifiable information.
    Replace the placeholders (e.g. , , {{DATE}}, {{ip_address}}) with fake values.

    Instructions:

    Use completely random numbers, so every digit is drawn between 0 and 9.
    Use realistic names that come from diverse genders, ethnicities and countries.
    If there are no placeholders, return the text as is and provide an answer.
    input: How do I change the limit on my credit card {{credit_card_number}}?
    output: How do I change the limit on my credit card 2539 3519 2345 1555?
    input: {anonymized_text}
    output:
    """
    return prompt

In [30]:
print("This is the prompt with de-identified values:")
print(create_prompt(anonymized_text))

This is the prompt with de-identified values:

    Your role is to create synthetic text based on de-identified text with placeholders instead of personally identifiable information.
    Replace the placeholders (e.g. , , {DATE}, {ip_address}) with fake values.

    Instructions:

    Use completely random numbers, so every digit is drawn between 0 and 9.
    Use realistic names that come from diverse genders, ethnicities and countries.
    If there are no placeholders, return the text as is and provide an answer.
    If there is any additional PII/PHI in the sentence, replace it too.
    
    input: How do I change the limit on my credit card {credit_card_number}?
    output: How do I change the limit on my credit card 2539 3519 2345 1555?
    input: 
Hello, my name is <PERSON> and I live in <LOCATION>.
My credit card number is <CREDIT_CARD> and my crypto wallet id is <CRYPTO>.

On <DATE_TIME> I visited <URL> and sent an email to <EMAIL_ADDRESS>,  from the IP <IP_ADDRESS>.

My passpor

## Call GPT-3

In [26]:
gpt_res = call_completion_model(create_prompt(anonymized_text))

In [8]:
print(gpt_res)

 Hello, my name is Kalyn Carranza and I live in Kenya.
My credit card number is 9871 8047 5709 9872 and my crypto wallet id is FV7POW12.

On 12th August 2021, 12:31pm I visited http://www.example.com and sent an email to vrystrkbnb@example.com,  from the IP 109.98.184.102.

My passport: 975-56-3481 and my phone number: 501-678-2198.

This is a valid International Bank Account Number: GB89 CPFX 4256 1763 8810 31 . Can you please check the status on bank account 855-31-6381?

Kalyn Carranza's social security number is 270-10-8543. Her driver license? it is 1183 2066 152 66.


### Alternatively, run on a list of template sentences:

In [9]:
import urllib

templates = []

url = "https://raw.githubusercontent.com/microsoft/presidio-research/master/presidio_evaluator/data_generator/raw_data/templates.txt"
for line in urllib.request.urlopen(url):
    templates.append(line.decode('utf-8')) 

In [10]:
print("Example templates:")
templates[:5]

Example templates:


['I want to increase limit on my card # {{credit_card_number}} for certain duration of time. is it possible?\n',
 'My credit card {{credit_card_number}} has been lost, Can I request you to block it.\n',
 'Need to change billing date of my card {{credit_card_number}}\n',
 'I want to update my primary and secondary address to the same: {{address}}\n',
 "In case of my child's account, we need to add {{person}} as guardian\n"]

In [42]:
import time
pp = pprint.PrettyPrinter(indent=2, width=110)
sentences = []
for template in templates:
    synth_sentence = call_completion_model(create_prompt(template))
    sentence_dict = {"original": template, "synthetic":synth_sentence.strip()}
    sentences.append(sentence_dict)
    pp.pprint(sentence_dict)
    time.sleep(3) # wait to not get blocked by service (only applicable for the free tier)
    print("--------------")


{ 'original': 'I want to increase limit on my card # {{credit_card_number}} for certain duration of time. is '
              'it possible?\n',
  'synthetic': 'I want to increase limit on my card # 2539 3519 2345 1555 for certain duration of time. Is '
               'it possible?'}
--------------
{ 'original': 'My credit card {{credit_card_number}} has been lost, Can I request you to block it.\n',
  'synthetic': 'My credit card 2539 3519 2345 1555 has been lost, Can I request you to block it.'}
--------------
{ 'original': 'Need to change billing date of my card {{credit_card_number}}\n',
  'synthetic': 'Need to change billing date of my card 2539 3519 2345 1555?'}
--------------
{ 'original': 'I want to update my primary and secondary address to the same: {{address}}\n',
  'synthetic': 'I want to update my primary and secondary address to the same: 3241 Tulip Lane, London, NW9 '
               '5RX.'}
--------------
{ 'original': "In case of my child's account, we need to add {{person

--------------
{ 'original': 'Inject SELECT * FROM Users WHERE client_ip = ?%//!%20\\|{{ip_address}}|%20/\n',
  'synthetic': 'Inject SELECT * FROM Users WHERE client_ip = 192.162.94.21'}
--------------
{ 'original': '{{first_name}}, can I please speak to your boss?\n',
  'synthetic': 'Abigail, can I please speak to your boss?'}
--------------
{ 'original': 'May I request to have the statement sent to {{address}}?\n',
  'synthetic': 'May I request to have the statement sent to 14 Maple Street, Chicago, IL 60647?'}
--------------
{ 'original': "Will my account stay active? It's under my partner's name {{person}}\n",
  'synthetic': "Will my account stay active? It's under my partner's name John Smith."}
--------------
{'original': 'What are my options?\n', 'synthetic': 'What are my options?'}
--------------
{ 'original': 'Bot: Where would you like this to be sent to? User: {{address}}\n',
  'synthetic': 'Bot: Where would you like this to be sent to? User: 369 Center Street, New York, NY 1

--------------
{ 'original': 'The address of {{organization}} is {{address}}\n',
  'synthetic': 'The address of Acme Corporation is 1020 Grand Avenue, Los Angeles, CA 90015.'}
--------------
{ 'original': 'His social security number is {{ssn}}\n',
  'synthetic': 'His social security number is 271-97-6469.'}
--------------
{'original': "Here's my SSN: {{ssn}}\n", 'synthetic': "Here's my SSN: 697-18-3566"}
--------------
{ 'original': '{{first_name_nonbinary}} is a very sympathetic person. They are also good listeners.\n',
  'synthetic': 'Piper is a very sympathetic person. They are also good listeners.'}
--------------
{ 'original': '{{first_name}} is very reliable. You can always depend on him.\n',
  'synthetic': 'Fred is very reliable. You can always depend on him.'}
--------------
{'original': 'Why is {{first_name}} so impulsive?\n', 'synthetic': 'Why is Daniella so impulsive?'}
--------------
{ 'original': '{{person}} will be talking in the conference\n',
  'synthetic': 'Sally Smith

--------------
{ 'original': 'In {{country}} they have company songs, musical expressions of employee loyalty sung by '
              'salarymen. Unfortunately, as regular RR commenter {{person}} points out, "most are '
              'horrible".\n',
  'synthetic': 'In Japan they have company songs, musical expressions of employee loyalty sung by salarymen. '
               'Unfortunately, as regular RR commenter Hannah Jackson points out, "most are horrible".'}
--------------
{ 'original': '"The big three" of The Big Three Killed My Baby are the car manufacturers that dominate the '
              "economy of the White Stripes' home city {{city}}: {{organization}}, {{organization}} and "
              '{{organization}}. "Don\'t feed me planned obsolescence," says {{person}} in an '
              'uncharacteristically political song, lamenting the demise of the unions in the 60s.\n',
  'synthetic': '"The big three" of The Big Three Killed My Baby are the car manufacturers that dominate t

--------------
{ 'original': '{{name}} lives at {{building_number}} {{street_name}}, {{city}}\n',
  'synthetic': 'John Smith lives at 635 Poplar Street, Houston'}
--------------
{ 'original': '{{first_name_male}} had given {{first_name}} his address: {{building_number}} '
              '{{street_name}}\n',
  'synthetic': 'Adam had given Sarah his address: 44 Apple Street'}
--------------
{ 'original': '{{first_name_male}} had given {{first_name}} his address: {{building_number}} '
              '{{street_name}}, {{city}}\n',
  'synthetic': 'David had given Emma his address: 515 Elm Street, Camden.'}
--------------
{ 'original': 'What is your address? it is {{address}}\n',
  'synthetic': 'What is your address? it is 3498 Allensby Street, Los Angeles, CA 90011.'}
--------------
{'original': 'We moved here from {{city}}\n', 'synthetic': 'We moved here from Paris.'}
--------------
{'original': 'We moved here from {{country}}\n', 'synthetic': 'We moved here from Venezuela.'}
--------------


--------------
{ 'original': "It was a done thing between him and {{first_name}}'s kid; and everybody thought so.\n",
  'synthetic': "It was a done thing between him and Jeffery's kid; and everybody thought so."}
--------------
{ 'original': 'Capitalized words like Wisdom and Discipline are often mistaken with names.\n',
  'synthetic': 'Capitalized words like Wisdom and Discipline are often mistaken with names.'}
--------------
{ 'original': 'The letter arrived at {{address}} last night.\n',
  'synthetic': 'The letter arrived at 1143 Orange Street last night.'}
--------------
{ 'original': 'The Princess Royal arrived at {{city}} this morning from {{country}}.\n',
  'synthetic': 'The Princess Royal arrived at London this morning from France.'}
--------------
{'original': "I'm in {{city}}, at the conference\n", 'synthetic': "I'm in Toronto, at the conference."}
--------------
{ 'original': '{{name}}, the {{job}}, said: "I\'m glad to hear that this has been withdrawn – quite why they '
  

--------------
{ 'original': '{{prefix}} {{last_name}} flew to {{city}} on {{day_of_week}} morning.',
  'synthetic': 'Dr. Nguyen flew to Los Angeles on Tuesday morning.'}
--------------


This notebook demonstrates how to leverage OpenAI models for fake/surrogate data generation. It uses Presidio to first de-identify data (as de-identification might be required prior to passing the model to OpenAI), and then uses OpenAI completion models to create synthetic/fake/surrogate data based on real data. OpenAI models would also potentially remove additional PII entities, if those are not detected by Presidio.

Some impressions:
1. GPT-3 sometimes gives additonal output, especially if the text is a question or concerning a human/bot interaction. Engineering the prompt can mitigate some of these issues. Potential post-processing might be required.
2. GPT-3 sometimes creates fake values even in the absence of placeholders.
3. GPT-3 re-uses context from other sentences, which could cause phone numbers are sometimes generated using a credit card pattern or other similar mistakes.
4. Co-references are sometimes missed (i.e. two name placeholders that should be filled with the same name, or referencing he/she to a male/female name)