This notebook demonstrates how to leverage DBRX models for fake/surrogate data generation. It uses Presidio to first de-identify data (as de-identification might be required prior to passing the model to DBRX), and then uses DBRX completion models to create synthetic/fake/surrogate data based on real data. DBRX models would also potentially remove additional PII entities, if those are not detected by Presidio.

Some impressions:
1. DBRX sometimes gives additonal output, especially if the text is a question or concerning a human/bot interaction. Engineering the prompt can mitigate some of these issues. Potential post-processing might be required.
2. DBRX sometimes creates fake values even in the absence of placeholders.
3. DBRX re-uses context from other sentences, which could cause phone numbers are sometimes generated using a credit card pattern or other similar mistakes.
4. Co-references are sometimes missed (i.e. two name placeholders that should be filled with the same name, or referencing he/she to a male/female name)

In [0]:
!pip install presidio_analyzer presidio_anonymizer
!python -m spacy download en_core_web_lg

[43mNote: you may need to restart the kernel using dbutils.library.restartPython() to use updated packages.[0m
Collecting presidio_analyzer
  Using cached presidio_analyzer-2.2.354-py3-none-any.whl (92 kB)
Collecting presidio_anonymizer
  Using cached presidio_anonymizer-2.2.354-py3-none-any.whl (31 kB)
Collecting tldextract
  Using cached tldextract-5.1.2-py3-none-any.whl (97 kB)
Collecting phonenumbers<9.0.0,>=8.12
  Using cached phonenumbers-8.13.34-py2.py3-none-any.whl (2.6 MB)
Collecting pycryptodome>=3.10.1
  Using cached pycryptodome-3.20.0-cp35-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (2.1 MB)
Collecting requests-file>=1.4
  Using cached requests_file-2.0.0-py2.py3-none-any.whl (4.2 kB)
Installing collected packages: phonenumbers, pycryptodome, requests-file, presidio_anonymizer, tldextract, presidio_analyzer
Successfully installed phonenumbers-8.13.34 presidio_analyzer-2.2.354 presidio_anonymizer-2.2.354 pycryptodome-3.20.0 requests-file-2.0.0 tldextract-5.1.2
[4

In [0]:
%pip install openai --upgrade
%pip install databricks-genai
%pip install databricks-genai-inference
%pip install mlflow

dbutils.library.restartPython()

[43mNote: you may need to restart the kernel using dbutils.library.restartPython() to use updated packages.[0m
Collecting openai
  Using cached openai-1.17.1-py3-none-any.whl (268 kB)
Collecting typing-extensions<5,>=4.7
  Using cached typing_extensions-4.11.0-py3-none-any.whl (34 kB)
Collecting httpx<1,>=0.23.0
  Using cached httpx-0.27.0-py3-none-any.whl (75 kB)
Collecting httpcore==1.*
  Using cached httpcore-1.0.5-py3-none-any.whl (77 kB)
Collecting h11<0.15,>=0.13
  Using cached h11-0.14.0-py3-none-any.whl (58 kB)
Installing collected packages: typing-extensions, h11, httpcore, httpx, openai
  Attempting uninstall: typing-extensions
    Found existing installation: typing_extensions 4.4.0
    Not uninstalling typing-extensions at /databricks/python3/lib/python3.10/site-packages, outside environment /local_disk0/.ephemeral_nfs/envs/pythonEnv-750ba81f-578d-4b8c-b6d7-1ea7a95f0f7a
    Can't uninstall 'typing_extensions'. No files were found to uninstall.
  Attempting uninstall: open

Path to original notebook: [https://www.github.com/microsoft/presidio/blob/main/docs/samples/python/GPT3_synth_data.ipynb](https://www.github.com/microsoft/presidio/blob/main/docs/samples/python/GPT3_synth_data.ipynb)

# Use Presidio + DBRX to turn real text into fake text

This notebook uses Presidio to turn text with PII into text where PII entities are replaced with placeholders, e.g. "`My name is David`" turns into "`My name is {{PERSON}}`". Then, it calls the DBRX Foundation Model API to create a fake record which is based on the original one.


Flow:
1. `My friend David lives in Paris. He likes it.`
1. `My friend {{PERSON}} lives in {{CITY}}. He likes it.`
1. `My friend Lucy lives in Beirut. She likes it.`
    
Note that DBRX completion models could possibly detect PII values and replace them in one call, but it is suggested to validate that all PII entities are indeed detected.

## Imports and set up OpenAI Key

## Define request for the OpenAI Completion service

In [0]:
import pprint
import os
import pandas as pd
from openai import OpenAI
import mlflow

databricks_token = mlflow.utils.databricks_utils.get_databricks_host_creds().token

In [0]:
client = OpenAI(
  api_key=databricks_token,
  base_url="https://e2-demo-field-eng.cloud.databricks.com/serving-endpoints"
)

def call_completion_model(prompt:str, model:str="databricks-dbrx-instruct", max_tokens:int=512) ->str:
    
    response = client.chat.completions.create(
      messages=[
        {
          "role": "system",
          "content": "You are an AI system helping detect, classify, and anonymize sensitive PII data"
        },
        {
          "role": "user",
          "content": prompt
        }
      ],
      model=model,
      max_tokens=max_tokens
    )

    # return response['choices'][0].text
    return response.choices[0]
  


## De-identify data using Presidio Analyzer and Anonymizer

In [0]:
from presidio_analyzer import AnalyzerEngine
from presidio_anonymizer import AnonymizerEngine

analyzer = AnalyzerEngine()
anonymizer = AnonymizerEngine()

sample = """
Hello, my name is Juan Lamadrid Martinez Sanchez Ochoa and I lives in mezico area south of the bridge.
My credit card number is 4095-2609-9393-4932 and my crypto wallet id is 16Yeky6GMjeNkAiNcBY7ZhrLoMSgg1BoyZ.

On September 18 I visited microsoft.com and sent an email to test@presidio.site,  from the IP 192.168.0.1.

My passport: 191280342 and my phone number: (212) 555-1234.

This is a valid International Bank Account Number: IL150120690000003111111 . Can you please check the status on bank account 954567876544?

Kate's social security number is 078-05-1126.  Her driver license? it is 1234567A.
"""

results = analyzer.analyze(sample, language="en")

print(results)

anonymized = anonymizer.anonymize(text=sample, analyzer_results=results)
anonymized_text = anonymized.text
print("\n\n" + anonymized_text)


[type: CREDIT_CARD, start: 129, end: 148, score: 1.0, type: CRYPTO, start: 176, end: 210, score: 1.0, type: EMAIL_ADDRESS, start: 274, end: 292, score: 1.0, type: IBAN_CODE, start: 433, end: 456, score: 1.0, type: IP_ADDRESS, start: 307, end: 318, score: 0.95, type: PERSON, start: 19, end: 41, score: 0.85, type: PERSON, start: 42, end: 55, score: 0.85, type: LOCATION, start: 71, end: 82, score: 0.85, type: DATE_TIME, start: 216, end: 228, score: 0.85, type: PERSON, start: 522, end: 526, score: 0.85, type: US_SSN, start: 555, end: 566, score: 0.85, type: PHONE_NUMBER, start: 365, end: 379, score: 0.75, type: PHONE_NUMBER, start: 555, end: 566, score: 0.75, type: US_DRIVER_LICENSE, start: 595, end: 603, score: 0.6499999999999999, type: URL, start: 239, end: 252, score: 0.5, type: URL, start: 279, end: 290, score: 0.5, type: US_PASSPORT, start: 334, end: 343, score: 0.4, type: US_BANK_NUMBER, start: 507, end: 519, score: 0.4, type: IN_PAN, start: 129, end: 139, score: 0.05, type: US_SSN, 

## Create prompt (instructions + text to manipulate)

In [0]:
def create_prompt(anonymized_text: str) -> str:
    """
    Create the prompt with instructions to DBRX.
    
    :param anonymized_text: Text with placeholders instead of PII values, e.g. My name is <PERSON>.
    """

    prompt = f"""
    Your role is to create synthetic text based on de-identified text with placeholders instead of personally identifiable information.
    Replace the placeholders (e.g. , , {{DATE}}, {{ip_address}}) with fake values.

    Instructions:

    Use completely random numbers, so every digit is drawn between 0 and 9.
    Use realistic names that come from diverse genders, ethnicities and countries.
    If there are no placeholders, return the text as is and provide an answer.
    input: How do I change the limit on my credit card {{credit_card_number}}?
    output: How do I change the limit on my credit card 2539 3519 2345 1555?
    input: {anonymized_text}
    output:
    """
    return prompt

In [0]:
print("This is the prompt with de-identified values:")
print(create_prompt(anonymized_text))

This is the prompt with de-identified values:

    Your role is to create synthetic text based on de-identified text with placeholders instead of personally identifiable information.
    Replace the placeholders (e.g. , , {DATE}, {ip_address}) with fake values.

    Instructions:

    Use completely random numbers, so every digit is drawn between 0 and 9.
    Use realistic names that come from diverse genders, ethnicities and countries.
    If there are no placeholders, return the text as is and provide an answer.
    input: How do I change the limit on my credit card {credit_card_number}?
    output: How do I change the limit on my credit card 2539 3519 2345 1555?
    input: 
Hello, my name is <PERSON> and I lives in <LOCATION> south of the bridge.
My credit card number is <CREDIT_CARD> and my crypto wallet id is <CRYPTO>.

On <DATE_TIME> I visited <URL> and sent an email to <EMAIL_ADDRESS>,  from the IP <IP_ADDRESS>.

My passport: <US_PASSPORT> and my phone number: <PHONE_NUMBER>.

T

## Call DBRX

In [0]:
dbrx_res = call_completion_model(create_prompt(anonymized_text))

In [0]:
print(dbrx_res.message.content)

Hello, my name is Lea and I live in Paris, France.
My credit card number is 4532 8015 3322 1111 and my crypto wallet id is 0x4a1E5e5267F9a5a5.

On 2022-03-01 14:30 I visited <https://www.example.com> and sent an email to [johndoe@example.com](mailto:johndoe@example.com), from the IP 192.168.1.101.

My passport: US123456789 and my phone number: +1 (123) 456-7890.

This is a valid International Bank Account Number: DE89 3704 0044 0532 0130 00. Can you please check the status on bank account 1234567890?

Lea's social security number is 666-12-1234. Her driver license? it is CA123456789.

Note: I have replaced the placeholders with fake but realistic values, and I have removed any sensitive information.


### Alternatively, run on a list of template sentences:

In [0]:
import urllib

templates = []

url = "https://raw.githubusercontent.com/microsoft/presidio-research/master/presidio_evaluator/data_generator/raw_data/templates.txt"
for line in urllib.request.urlopen(url):
    templates.append(line.decode('utf-8')) 

In [0]:
print("Example templates:")
templates[:5]

Example templates:


['I want to increase limit on my card # {{credit_card_number}} for certain duration of time. is it possible?\n',
 'My credit card {{credit_card_number}} has been lost, Can I request you to block it.\n',
 'Need to change billing date of my card {{credit_card_number}}\n',
 'I want to update my primary and secondary address to the same: {{address}}\n',
 "In case of my child's account, we need to add {{person}} as guardian\n"]

In [0]:
import time
pp = pprint.PrettyPrinter(indent=2, width=110)
sentences = []
for template in templates:
    synth_sentence = call_completion_model(create_prompt(template))
    sentence_dict = {"original": template, "synthetic":synth_sentence.message.content}
    sentences.append(sentence_dict)
    pp.pprint(sentence_dict)
    time.sleep(5) # wait to not get blocked by service (only applicable for the free tier)
    print("--------------")


{ 'original': 'I want to increase limit on my card # {{credit_card_number}} for certain duration of time. is '
              'it possible?\n',
  'synthetic': 'I want to increase the limit on my card # 4567 8901 2345 6789 for a certain duration of '
               'time. Is it possible?'}
--------------
{ 'original': 'My credit card {{credit_card_number}} has been lost, Can I request you to block it.\n',
  'synthetic': 'Sure, I can help you with that. To request a block on your lost credit card, you can use the '
               'following number as a reference: 4921 5236 8741 2583. This is not a real credit card number, '
               'but it will help us to process your request more quickly.'}
--------------
{ 'original': 'Need to change billing date of my card {{credit_card_number}}\n',
  'synthetic': 'Need to change billing date of my card 4321 0987 6543 2109?\n'
               '\n'
               'Please note that the credit card number provided is a synthetic value and does not r