In [1]:
from tqdm import tqdm_notebook as tqdm
from presidio_evaluator.data_generator.main import generate,read_synth_dataset

import datetime
import json

# Generate fake PII data using Presidio's data generator

Presidio's data generator allows you to generate a synthetic dataset with two preriquisites:
1. A fake PII csv (We used https://www.fakenamegenerator.com/)
2. A text file with template sentences or paragraphs. In this file, each PII entity placeholder is written in brackets. The name of the PII entity should be one of the columns in the fake PII csv file.

The generator creates fake sentences based on the provided fake PII csv AND a list of [extension functions](../presidio_evaluator/data_generator/extensions.py) and a few additional 3rd party libraries like `Faker`, and `haikunator`.


For example:
1. **A fake PII csv**:

| FIRST_NAME  |  LAST_NAME  |  EMAIL |
|-------------|-------------|-----------|
| David       |  Brown      |  david.brown@jobhop.com |
| Mel         |  Brown      |  melb@hobjob.com |


2. **Templates**:

My name is [FIRST_NAME]

You can email me at [EMAIL]. Thanks, [FIRST_NAME]

What's your last name? It's [LAST_NAME]

Every time I see you falling I get down on my knees and pray


### Generate files
Based on these two prerequisites, a requested number of examples and an output file name:

In [2]:
EXAMPLES = 100
SPAN_TO_TAG = True #Whether to create tokens + token labels (tags)
TEMPLATES_FILE = '../presidio_evaluator/data_generator/' \
                 'raw_data/templates.txt'
KEEP_ONLY_TAGGED = False
LOWER_CASE_RATIO = 0.1
IGNORE_TYPES = {"IP_ADDRESS", 'US_SSN', 'URL'}

cur_time = datetime.date.today().strftime("%B_%d_%Y")

OUTPUT = "../data/generated_size_{}_date_{}.json".format(EXAMPLES, cur_time)

fake_pii_csv = '../presidio_evaluator/data_generator/' \
               'raw_data/FakeNameGenerator.com_3000.csv'
utterances_file = TEMPLATES_FILE
dictionary_path = None

examples = generate(fake_pii_csv=fake_pii_csv,
                        utterances_file=utterances_file,
                        dictionary_path=dictionary_path,
                        output_file=OUTPUT,
                        lower_case_ratio=LOWER_CASE_RATIO,
                        num_of_examples=EXAMPLES,
                        ignore_types=IGNORE_TYPES,
                        keep_only_tagged=KEEP_ONLY_TAGGED,
                        span_to_tag=SPAN_TO_TAG)

Preparing sample sentences for ingestion
Preparing fake PII data for ingestion
Generating address parts
Generating roles
Generating titles
Generating nationalities
Generating IBANs
Generating company names
Finished preparing fake PII data
loading model en_core_web_lg
generated 100 examples
Finished creating generated dataset. File location:../data/generated_size_100_date_March_08_2020.json


100%|██████████| 100/100 [00:11<00:00,  8.34it/s]


To read a dataset file into the InputSample format, use `read_synth_dataset`:

In [3]:
input_samples = read_synth_dataset(OUTPUT)

In [4]:
input_samples[0]

Full text: Stm Auto Parts is the brainchild of our 3 founders: Hanna Mattila, Royale Moïse and Emmie Ström.  The idea was born (on the beach) while they were constructing a website to be the basis of another start-up idea.
Spans: [Type: ORGANIZATION, value: Stm Auto Parts, start: 0, end: 14, Type: PERSON, value: Hanna Mattila, start: 52, end: 65, Type: PERSON, value: Royale Moïse, start: 67, end: 79, Type: PERSON, value: Emmie Ström, start: 84, end: 95]
Tokens: [Stm, Auto, Parts, is, the, brainchild, of, our, 3, founders, :, Hanna, Mattila, ,, Royale, Moïse, and, Emmie, Ström, .,  , The, idea, was, born, (, on, the, beach, ), while, they, were, constructing, a, website, to, be, the, basis, of, another, start, -, up, idea, .]
Tags: ['B-ORGANIZATION', 'I-ORGANIZATION', 'L-ORGANIZATION', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'B-PERSON', 'L-PERSON', 'O', 'B-PERSON', 'L-PERSON', 'O', 'B-PERSON', 'L-PERSON', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O

The full structure of each input_sample is the following. It includes different feature values per token as calculated by Spacy

In [5]:
input_samples[0].to_dict()

{'full_text': 'Stm Auto Parts is the brainchild of our 3 founders: Hanna Mattila, Royale Moïse and Emmie Ström.  The idea was born (on the beach) while they were constructing a website to be the basis of another start-up idea.',
 'masked': None,
 'spans': [{'entity_type': 'ORGANIZATION',
   'entity_value': 'Stm Auto Parts',
   'start_position': 0,
   'end_position': 14},
  {'entity_type': 'PERSON',
   'entity_value': 'Hanna Mattila',
   'start_position': 52,
   'end_position': 65},
  {'entity_type': 'PERSON',
   'entity_value': 'Royale Moïse',
   'start_position': 67,
   'end_position': 79},
  {'entity_type': 'PERSON',
   'entity_value': 'Emmie Ström',
   'start_position': 84,
   'end_position': 95}],
 'tokens': [{'text': 'Stm',
   'idx': 0,
   'tag_': 'NNP',
   'pos_': 'PROPN',
   'dep_': 'compound',
   'lemma_': 'Stm',
   '_': {'is_in_vocabulary': False}},
  {'text': 'Auto',
   'idx': 4,
   'tag_': 'NNP',
   'pos_': 'PROPN',
   'dep_': 'compound',
   'lemma_': 'Auto',
   '_': {'is_in

#### Verify randomness of dataset

In [6]:
from collections import Counter
count_per_template_id = Counter([sample.metadata['Template#'] for sample in input_samples])
for key in sorted(count_per_template_id):
    print("{}: {}".format(key,count_per_template_id[key]))
    
print(sum(count_per_template_id.values()))

0: 1
1: 1
3: 1
4: 1
5: 1
7: 1
9: 2
10: 1
11: 1
15: 1
18: 2
19: 1
20: 1
22: 2
23: 2
24: 1
25: 1
29: 1
30: 1
32: 1
35: 2
38: 3
39: 2
40: 1
42: 2
44: 1
47: 3
50: 1
53: 1
56: 1
59: 1
61: 1
63: 2
64: 2
67: 3
70: 1
71: 1
72: 2
74: 1
75: 2
77: 1
78: 1
79: 2
80: 1
82: 2
83: 1
84: 1
86: 1
87: 1
89: 3
90: 1
91: 2
92: 2
94: 2
95: 1
97: 2
98: 1
102: 2
103: 1
107: 1
108: 1
109: 1
111: 1
114: 1
116: 1
117: 2
119: 2
120: 1
122: 1
124: 1
125: 2
100


#### Transform to the CONLL structure:

In [8]:
from presidio_evaluator import InputSample

conll = InputSample.create_conll_dataset(input_samples)
conll.head(5)

Unnamed: 0,text,pos,tag,Template#,gender,country,label,sentence
0,Stm,PROPN,NNP,117,female,Ukraine,B-ORG,0
1,Auto,PROPN,NNP,117,female,Ukraine,I-ORG,0
2,Parts,PROPN,NNPS,117,female,Ukraine,I-ORG,0
3,is,AUX,VBZ,117,female,Ukraine,O,0
4,the,DET,DT,117,female,Ukraine,O,0


#### Copyright notice:


Data generated for evaluation was created using Fake Name Generator.

Fake Name Generator identities by the [Fake Name Generator](https://www.fakenamegenerator.com/) 
are licensed under a [Creative Commons Attribution-Share Alike 3.0 United States License](http://creativecommons.org/licenses/by-sa/3.0/us/). Fake Name Generator and the Fake Name Generator logo are trademarks of Corban Works, LLC.
