In [None]:
from tqdm import tqdm_notebook as tqdm
from presidio_evaluator.data_generator.main import generate, read_synth_dataset

import datetime
import json

# Generate fake PII data using Presidio's data generator

Presidio's data generator allows you to generate a synthetic dataset with two preriquisites:
1. A fake PII csv (We used https://www.fakenamegenerator.com/)
2. A text file with template sentences or paragraphs. In this file, each PII entity placeholder is written in brackets. The name of the PII entity should be one of the columns in the fake PII csv file.

The generator creates fake sentences based on the provided fake PII csv AND a list of [extension functions](../presidio_evaluator/data_generator/extensions.py) and a few additional 3rd party libraries like `Faker`, and `haikunator`.


For example:
1. **A fake PII csv**:

| FIRST_NAME  |  LAST_NAME  |  EMAIL |
|-------------|-------------|-----------|
| David       |  Brown      |  david.brown@jobhop.com |
| Mel         |  Brown      |  melb@hobjob.com |


2. **Templates**:

My name is [FIRST_NAME]

You can email me at [EMAIL]. Thanks, [FIRST_NAME]

What's your last name? It's [LAST_NAME]

Every time I see you falling I get down on my knees and pray


### Generate files
Based on these two prerequisites, a requested number of examples and an output file name:

In [None]:
EXAMPLES = 100
SPAN_TO_TAG = True #Whether to create tokens + token labels (tags)
TEMPLATES_FILE = '../../presidio_evaluator/data_generator/' \
                 'raw_data/templates.txt'
KEEP_ONLY_TAGGED = False
LOWER_CASE_RATIO = 0.1
IGNORE_TYPES = {"IP_ADDRESS", 'US_SSN', 'URL'}

cur_time = datetime.date.today().strftime("%B_%d_%Y")

OUTPUT = "../../data/generated_size_{}_date_{}.json".format(EXAMPLES, cur_time)

fake_pii_csv = '../../presidio_evaluator/data_generator/' \
               'raw_data/FakeNameGenerator.com_3000.csv'
utterances_file = TEMPLATES_FILE
dictionary_path = None

examples = generate(fake_pii_csv=fake_pii_csv,
                        utterances_file=utterances_file,
                        dictionary_path=dictionary_path,
                        output_file=OUTPUT,
                        lower_case_ratio=LOWER_CASE_RATIO,
                        num_of_examples=EXAMPLES,
                        ignore_types=IGNORE_TYPES,
                        keep_only_tagged=KEEP_ONLY_TAGGED,
                        span_to_tag=SPAN_TO_TAG)

To read a dataset file into the InputSample format, use `read_synth_dataset`:

In [None]:
input_samples = read_synth_dataset(OUTPUT)

In [None]:
input_samples[0]

The full structure of each input_sample is the following. It includes different feature values per token as calculated by Spacy

In [None]:
input_samples[0].to_dict()

#### Verify randomness of dataset

In [None]:
from collections import Counter
count_per_template_id = Counter([sample.metadata['Template#'] for sample in input_samples])
for key in sorted(count_per_template_id):
    print("{}: {}".format(key,count_per_template_id[key]))
    
print(sum(count_per_template_id.values()))

#### Transform to the CONLL structure:

In [None]:
from presidio_evaluator import InputSample

conll = InputSample.create_conll_dataset(input_samples)
conll.head(5)

#### Copyright notice:


Data generated for evaluation was created using Fake Name Generator.

Fake Name Generator identities by the [Fake Name Generator](https://www.fakenamegenerator.com/) 
are licensed under a [Creative Commons Attribution-Share Alike 3.0 United States License](http://creativecommons.org/licenses/by-sa/3.0/us/). Fake Name Generator and the Fake Name Generator logo are trademarks of Corban Works, LLC.