In [8]:
from tqdm import tqdm_notebook as tqdm
from presidio_evaluator.data_generator.main import generate,read_synth_dataset

import datetime
import json

# Generate fake PII data using Presidio's data generator

Presidio's data generator allows you to generate a synthetic dataset with two preriquisites:
1. A fake PII csv (We used https://www.fakenamegenerator.com/)
2. A text file with template sentences or paragraphs. In this file, each PII entity placeholder is written in brackets. The name of the PII entity should be one of the columns in the fake PII csv file.

The generator creates fake sentences based on the provided fake PII csv AND a list of [extension functions](../presidio_evaluator/data_generator/extensions.py) and a few additional 3rd party libraries like `Faker`, and `haikunator`.


For example:
1. **A fake PII csv**:

| FIRST_NAME  |  LAST_NAME  |  EMAIL |
|-------------|-------------|-----------|
| David       |  Brown      |  david.brown@jobhop.com |
| Mel         |  Brown      |  melb@hobjob.com |


2. **Templates**:

My name is [FIRST_NAME]

You can email me at [EMAIL]. Thanks, [FIRST_NAME]

What's your last name? It's [LAST_NAME]

Every time I see you falling I get down on my knees and pray


### Generate files
Based on these two prerequisites, a requested number of examples and an output file name:

In [9]:
EXAMPLES = 3000
SPAN_TO_TAG = True #Whether to create tokens + token labels (tags)
TEMPLATES_FILE = '../presidio_evaluator/data_generator/' \
                 'raw_data/conll_based_templates.txt'
KEEP_ONLY_TAGGED = False
LOWER_CASE_RATIO = 0.1
IGNORE_TYPES = {"IP_ADDRESS", 'US_SSN', 'URL'}

cur_time = datetime.date.today().strftime("%B_%d_%Y")

OUTPUT = "../data/generated_size_{}_date_{}.json".format(EXAMPLES, cur_time)

fake_pii_csv = '../presidio_evaluator/data_generator/' \
               'raw_data/FakeNameGenerator.com_3000.csv'
utterances_file = TEMPLATES_FILE
dictionary_path = None

examples = generate(fake_pii_csv=fake_pii_csv,
                        utterances_file=utterances_file,
                        dictionary_path=dictionary_path,
                        output_file=OUTPUT,
                        lower_case_ratio=LOWER_CASE_RATIO,
                        num_of_examples=EXAMPLES,
                        ignore_types=IGNORE_TYPES,
                        keep_only_tagged=KEEP_ONLY_TAGGED,
                        span_to_tag=SPAN_TO_TAG)

Preparing sample sentences for ingestion
Preparing fake PII data for ingestion
Generating address parts


  0%|          | 0/3000 [00:00<?, ?it/s]

Generating roles
Generating titles
Generating nationalities
Generating IBANs
Generating company names
Finished preparing fake PII data


100%|██████████| 3000/3000 [00:35<00:00, 83.52it/s]


generated 3000 examples
Finished creating generated dataset. File location:../data/generated_size_3000_date_April_03_2020.json


To read a dataset file into the InputSample format, use `read_synth_dataset`:

In [10]:
input_samples = read_synth_dataset(OUTPUT)

In [11]:
input_samples[0]

Full text: - Your Future Is Now in the hive of American pensions
Spans: [Type: ORGANIZATION, value: Your Future Is Now, start: 2, end: 20, Type: NATIONALITY, value: American, start: 36, end: 44]
Tokens: [-, Your, Future, Is, Now, in, the, hive, of, American, pensions]
Tags: ['O', 'B-ORGANIZATION', 'I-ORGANIZATION', 'I-ORGANIZATION', 'L-ORGANIZATION', 'O', 'O', 'O', 'O', 'U-NATIONALITY', 'O']

The full structure of each input_sample is the following. It includes different feature values per token as calculated by Spacy

In [12]:
input_samples[0].to_dict()

{'full_text': '- Your Future Is Now in the hive of American pensions',
 'masked': None,
 'spans': [{'entity_type': 'ORGANIZATION',
   'entity_value': 'Your Future Is Now',
   'start_position': 2,
   'end_position': 20},
  {'entity_type': 'NATIONALITY',
   'entity_value': 'American',
   'start_position': 36,
   'end_position': 44}],
 'tokens': [{'text': '-',
   'idx': 0,
   'tag_': ':',
   'pos_': 'PUNCT',
   'dep_': 'punct',
   'lemma_': '-',
   '_': {'is_in_vocabulary': False}},
  {'text': 'Your',
   'idx': 2,
   'tag_': 'PRP$',
   'pos_': 'DET',
   'dep_': 'poss',
   'lemma_': '-PRON-',
   '_': {'is_in_vocabulary': False}},
  {'text': 'Future',
   'idx': 7,
   'tag_': 'NN',
   'pos_': 'NOUN',
   'dep_': 'nsubj',
   'lemma_': 'future',
   '_': {'is_in_vocabulary': False}},
  {'text': 'Is',
   'idx': 14,
   'tag_': 'VBZ',
   'pos_': 'AUX',
   'dep_': 'ROOT',
   'lemma_': 'be',
   '_': {'is_in_vocabulary': False}},
  {'text': 'Now',
   'idx': 17,
   'tag_': 'RB',
   'pos_': 'ADV',
   'd

#### Verify randomness of dataset

In [13]:
from collections import Counter
count_per_template_id = Counter([sample.metadata['Template#'] for sample in input_samples])
for key in sorted(count_per_template_id):
    print("{}: {}".format(key,count_per_template_id[key]))
    
print(sum(count_per_template_id.values()))

1: 1
3: 1
4: 1
5: 1
10: 2
13: 2
16: 2
24: 1
30: 1
36: 1
39: 1
41: 1
43: 1
45: 2
52: 1
55: 1
65: 1
66: 1
67: 1
68: 1
69: 1
70: 1
71: 1
83: 1
86: 1
88: 1
91: 1
94: 2
96: 1
97: 1
100: 1
101: 2
102: 1
104: 1
106: 1
110: 3
111: 1
112: 1
114: 1
119: 1
124: 1
127: 1
128: 1
129: 1
130: 3
133: 1
137: 1
139: 1
140: 1
142: 1
143: 2
146: 1
148: 1
152: 1
157: 2
158: 2
160: 1
162: 1
166: 1
170: 1
179: 1
186: 2
189: 1
192: 2
194: 1
196: 1
197: 1
198: 1
199: 1
200: 1
204: 1
214: 1
217: 1
219: 1
222: 1
228: 1
229: 1
231: 1
241: 1
247: 1
250: 1
255: 1
266: 1
267: 1
272: 1
281: 3
282: 2
284: 1
287: 2
291: 1
292: 1
293: 1
295: 1
298: 1
299: 1
305: 1
310: 1
315: 1
317: 1
318: 1
319: 1
324: 1
325: 1
327: 1
328: 1
329: 1
330: 1
332: 3
333: 1
336: 1
338: 1
341: 1
346: 1
348: 1
351: 1
353: 1
360: 1
362: 1
363: 2
365: 1
366: 1
371: 1
374: 2
377: 1
378: 1
381: 1
384: 1
385: 2
392: 1
396: 1
397: 1
399: 2
410: 1
413: 1
414: 1
416: 1
418: 1
421: 1
426: 1
434: 1
437: 1
444: 1
447: 1
450: 1
451: 1
453: 2
454: 1
456: 

4491: 1
4498: 1
4503: 1
4504: 2
4509: 1
4511: 1
4513: 1
4518: 1
4521: 1
4527: 2
4529: 1
4537: 1
4539: 1
4540: 1
4542: 2
4550: 1
4551: 1
4553: 1
4560: 1
4562: 1
4564: 1
4571: 1
4572: 1
4573: 2
4574: 1
4577: 1
4578: 1
4580: 1
4581: 1
4582: 1
4583: 1
4590: 2
4591: 1
4592: 1
4597: 2
4601: 1
4604: 1
4615: 1
4617: 1
4618: 1
4620: 2
4633: 1
4636: 1
4640: 1
4646: 1
4647: 1
4648: 2
4650: 1
4651: 1
4654: 1
4658: 1
4659: 1
4662: 1
4665: 1
4673: 1
4675: 1
4676: 2
4679: 1
4692: 1
4700: 1
4702: 1
4704: 1
4705: 1
4708: 1
4716: 1
4721: 1
4724: 1
4726: 3
4732: 1
4733: 1
4736: 1
4739: 1
4740: 1
4744: 2
4745: 2
4754: 2
4755: 1
4762: 1
4765: 1
4766: 2
4769: 1
4770: 1
4779: 1
4787: 1
4794: 2
4796: 1
4797: 1
4798: 1
4805: 1
4806: 1
4808: 2
4810: 2
4811: 1
4812: 1
4818: 3
4823: 1
4829: 1
4831: 1
4832: 1
4835: 1
4841: 1
4844: 1
4846: 1
4849: 1
4853: 1
4857: 1
4871: 1
4873: 1
4874: 2
4877: 1
4880: 1
4884: 1
4885: 2
4888: 1
4892: 1
4894: 3
4898: 1
4903: 1
4905: 2
4909: 2
4911: 1
4912: 1
4914: 1
4920: 2
4921: 1


#### Transform to the CONLL structure:

In [14]:
from presidio_evaluator import InputSample

conll = InputSample.create_conll_dataset(input_samples)
conll.head(5)

Unnamed: 0,text,pos,tag,Template#,gender,country,label,sentence
0,-,PUNCT,:,1529,male,Indonesia,O,0
1,Your,DET,PRP$,1529,male,Indonesia,B-ORG,0
2,Future,NOUN,NN,1529,male,Indonesia,I-ORG,0
3,Is,AUX,VBZ,1529,male,Indonesia,I-ORG,0
4,Now,ADV,RB,1529,male,Indonesia,I-ORG,0


#### Copyright notice:


Data generated for evaluation was created using Fake Name Generator.

Fake Name Generator identities by the [Fake Name Generator](https://www.fakenamegenerator.com/) 
are licensed under a [Creative Commons Attribution-Share Alike 3.0 United States License](http://creativecommons.org/licenses/by-sa/3.0/us/). Fake Name Generator and the Fake Name Generator logo are trademarks of Corban Works, LLC.
