# 3. Sythethic data generation with presidio-research
## 3.1 Overview 
In this notebook, we will focus on generating synthetic data using Presidio Research, an open-source tool developed by Microsoft.

Synthetic data is artificial data generated with the purpose of preserving privacy, testing systems or creating training data for machine learning algorithms. It is designed to mimic real data in terms of essential characteristics, without containing any personally identifiable information (PII). This makes it a valuable tool in data analysis where privacy is a concern.

Presidio Research offers a robust framework for generating synthetic data. It uses the detected entities from the Analyzer engine and replaces them with synthetic replacements, creating a new dataset that maintains the structure and statistical properties of the original data, but without any real PII.

In this hands-on lab, we will walk you through the process of using Presidio to generate synthetic data. We will start by setting up the necessary libraries and initializing the Presidio analyzer and anonymizer engines. We will then define a few sample texts and use Presidio to generate multiple synthetic samples based on these texts.

By the end of this lab, you will have a solid understanding of how to use Presidio for synthetic data generation, and you will be equipped with the skills to use this tool in your data privacy and security projects

## 3.2. Generate sythetic data from simple examples
Presidio's data generator is based on the Python Faker tool and allows you to generate a sythetic dataset from sentence templates. 
Example of sentences templates:
- I live at {{address}}
- You can email me at {{email}}. Thanks, {{first_name}}
- My phone number is {{phone_number}}

There are two main scenario where we can use Presidio Data Generator:
1. To create a fake dataset for evaluation or training purposes, given a list of predefined templates as example above.
2. To expand an existing labeled dataset with extra synthetic samples for augmentation.

In this part, we'll showcase the first scenario which we need to generate N=10 samples based on 2 template. The process in high level is following:
1. Define the templates which you want to generate fake data from
2. Initialize the Presidio synthetic data generator: This component is responsible for generating synthetic data based on the sentence template.
3. Generate fake data: Use the generate_fake_data method of the data generator to create fake data. In this case, we want to generate 10 samples.
4. Review the fake data: The generate_fake_data method returns a list of synthetic samples. Each sample should keep the structure and information of the original samples, but with the new fake PII entities added.

In [18]:
# Import libraries
from presidio_evaluator.data_generator import PresidioDataGenerator

sentence_templates = [
    "My name is {{name}}. And my email_address is {{email}}.",
    "I just moved to {{address}} and my phone number is {{phone_number}}.",
]

data_generator = PresidioDataGenerator()
fake_records = data_generator.generate_fake_data(sentence_templates, 10)
fake_records = list(fake_records)
for record in fake_records:
    print(record.fake)
    print(record.spans)

Sampling: 100%|██████████| 10/10 [00:00<00:00, 3356.52it/s]

My name is Jessica Wong. And my email_address is michelle19@example.net.
[{"value": "michelle19@example.net", "start": 49, "end": 71, "type": "email"}, {"value": "Jessica Wong", "start": 11, "end": 23, "type": "name"}]
My name is Nicholas Coleman. And my email_address is wendy09@example.org.
[{"value": "wendy09@example.org", "start": 53, "end": 72, "type": "email"}, {"value": "Nicholas Coleman", "start": 11, "end": 27, "type": "name"}]
My name is Mark Allen. And my email_address is wernerphilip@example.net.
[{"value": "wernerphilip@example.net", "start": 47, "end": 71, "type": "email"}, {"value": "Mark Allen", "start": 11, "end": 21, "type": "name"}]
I just moved to 94356 Mitchell Walks Suite 148
South Sarahbury, GA 79619 and my phone number is 8702241684.
[{"value": "8702241684", "start": 96, "end": 106, "type": "phone_number"}, {"value": "94356 Mitchell Walks Suite 148\nSouth Sarahbury, GA 79619", "start": 16, "end": 72, "type": "address"}]
My name is Harry Walter. And my email_addre




## 3.3. Enhanced Synthetic Data Generation with Presidio and Faker
By using Faker, we can generate 10 new fake sentences from two given templates, although inconsistencies may appear. For example, a template like "My name is {{name}}. And my email_address is {{email}}." might produce mismatched names and emails like Jessica Wong with michelle19@example.net. To solve this, the RecordFaker object from Presidio Data Generator ensures consistency in attributes related to the same individual (such as pairing Michael Smith with michael.smith@gmail.com).
The fake_name_generator_file can be downloaded from https://www.fakenamegenerator.com/order.php

Moreover, Presidio Data Generator enables customization and some extra faker providers beyond those included in the orginal Faker. The standard providers from Faker are listed here: https://faker.readthedocs.io/en/master/index.html#compatibility. But with Presidio Data Generator you can also add provides like:
- NationalityProvider
- OrganizationProvider
- UsDriverLicenseProvider
- ReligionProvider
- IpAddressProvider
- AgeProvider
- AddressProvider
- PhoneNumberProvider
- HopitalProvider

In the next part, we will demonstrate how you can utilize these features.

In [19]:
# First import the libraries
import pandas as pd
from presidio_evaluator.data_generator.faker_extensions import (
    RecordsFaker,
    AgeProvider,
)

In [20]:
# Read FakeNameGenerator CSV
fake_name_generator_df = pd.read_csv("../data_samples/FakeNameGenerator.com_3000.csv")

# Update to match existing templates
fake_name_generator_df = PresidioDataGenerator.update_fake_name_generator_df(fake_name_generator_df)
fake_name_generator_df.head()

Unnamed: 0,number,gender,nationality,prefix,first_name,middle_initial,last_name,street_name,city,state_abbr,...,company,domain_name,person,name,first_name_female,first_name_male,prefix_female,prefix_male,last_name_female,last_name_male
0,1,female,Czech,Mrs.,Marie,J,Hamanová,P.O. Box 255,Kangerlussuaq,QE,...,Simple Solutions,MarathonDancing.gl,Marie Hamanová,Marie Hamanová,Marie,,Mrs.,,Hamanová,
1,2,female,French,Ms.,Patricia,G,Desrosiers,Avenida Noruega 42,Vila Real,VR,...,Formula Gray,LostMillions.com.pt,Patricia Desrosiers,Patricia Desrosiers,Patricia,,Ms.,,Desrosiers,
2,3,female,American,Ms.,Debra,O,Neal,1659 Hoog St,Brakpan,GA,...,Dahlkemper's,MediumTube.co.za,Debra Neal,Debra Neal,Debra,,Ms.,,Neal,
3,4,male,French,Mr.,Peverell,C,Racine,183 Epimenidou Street,Limassol,LI,...,Quickbiz,ImproveLook.com.cy,Peverell Racine,Peverell Racine,,Peverell,,Mr.,,Racine
4,5,female,Slovenian,Mrs.,Iolanda,S,Tratnik,Karu põik 61,Pärnu,PR,...,Dubrow's Cafeteria,PostTan.com.ee,Iolanda Tratnik,Iolanda Tratnik,Iolanda,,Mrs.,,Tratnik,


In [25]:

sentence_templates = [
    "My name is {{person}}. I'm {{age}} years old and my email is {{email}}",
    "My credit card {{credit_card_number}} has been lost, Can I request you to block it.",
    "Need to change billing date of my card {{credit_card_number}}",
    "In case of my child's account, we need to add {{person}} as guardian",
]

# Create RecordsFaker (extension which handles records instead of independent values) and add additional specific providers
fake = RecordsFaker(records=fake_name_generator_df, locale="en_US")
# Add additional providers, not part of the default Faker
# fake.add_provider(IpAddressProvider)  # Both Ipv4 and IPv6 IP addresses
# fake.add_provider(NationalityProvider)  # Read countries + nationalities from file
# fake.add_provider(OrganizationProvider)  # Read organization names from file
# fake.add_provider(UsDriverLicenseProvider)  # Read US driver license numbers from file
fake.add_provider(AgeProvider)  # Age values (unavailable on Faker)
# fake.add_provider(AddressProviderNew)  # Extend the default address formats

# Generate a new PresidioDataGenerator with the custom faker
data_generator = PresidioDataGenerator(custom_faker=fake, 
                                       lower_case_ratio = 0.05)

fake_records = data_generator.generate_fake_data(sentence_templates, 10)

fake_records = list(fake_records)
for record in fake_records:
    print(record.fake)
    print(record.spans)

Sampling: 100%|██████████| 10/10 [00:00<00:00, 10005.50it/s]

My name is Rasmus H. Katajisto. I'm 48 years old and my email is RasmusKatajisto@rhyta.com
[{"value": "RasmusKatajisto@rhyta.com", "start": 65, "end": 90, "type": "email"}, {"value": "48", "start": 36, "end": 38, "type": "age"}, {"value": "Rasmus H. Katajisto", "start": 11, "end": 30, "type": "person"}]
My credit card 565158332017 has been lost, Can I request you to block it.
[{"value": "565158332017", "start": 15, "end": 27, "type": "credit_card_number"}]
My name is Asma Akhtakhanov. I'm 58 years old and my email is AsmaAkhtakhanov@cuvox.de
[{"value": "AsmaAkhtakhanov@cuvox.de", "start": 62, "end": 86, "type": "email"}, {"value": "58", "start": 33, "end": 35, "type": "age"}, {"value": "Asma Akhtakhanov", "start": 11, "end": 27, "type": "person"}]
My name is Anastasio L Milani. I'm 81 years old and my email is AnastasioMilani@dayrep.com
[{"value": "AnastasioMilani@dayrep.com", "start": 64, "end": 90, "type": "email"}, {"value": "81", "start": 35, "end": 37, "type": "age"}, {"value": "A


