Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PresidioSentenceFaker #50

Merged
Merged
Show file tree
Hide file tree
Changes from 25 commits
Commits
Show all changes
37 commits
Select commit Hold shift + click to select a range
56ab9a5
Map DOMAIN_NAME entity to URL
Robbie-Palmer Aug 2, 2022
f538bc5
Add PresidioFakeRecordGenerator class
Robbie-Palmer Aug 3, 2022
6e8eec3
Fix bug in PresidioAnalyzerWrapper where 'en' is always the chosen la…
Robbie-Palmer Aug 3, 2022
5bfca66
Update PresidioAnalyzerWrapper to use the provided language in the de…
Robbie-Palmer Aug 3, 2022
79f796a
Format span_to_tag.py
Robbie-Palmer Aug 3, 2022
146dc63
Merge branch 'microsoft:master' into fake-record-generator
Robbie-Palmer Aug 5, 2022
6863e19
Map DOMAIN_NAME entity to URL
Robbie-Palmer Aug 2, 2022
b5efd1e
Add PresidioFakeRecordGenerator class
Robbie-Palmer Aug 3, 2022
04a43fe
Fix bug in PresidioAnalyzerWrapper where 'en' is always the chosen la…
Robbie-Palmer Aug 3, 2022
7818b28
Update PresidioAnalyzerWrapper to use the provided language in the de…
Robbie-Palmer Aug 3, 2022
279ca75
Format span_to_tag.py
Robbie-Palmer Aug 3, 2022
ddf8e72
Fix python3.7 support for getting raw data dir path
Robbie-Palmer Dec 7, 2022
a385403
Strip whitespace from ends of template files in PresidioDataGenerator
Robbie-Palmer Dec 7, 2022
6aede14
Test PresidioFakeRecordGenerator
Robbie-Palmer Dec 7, 2022
e6c09fd
Fix mutable default argument problem in PresidioFakeRecordGenerator
Robbie-Palmer Dec 7, 2022
4afd7c0
Unit test PresidioFakeRecordGenerator
Robbie-Palmer Dec 7, 2022
76c24b7
Expose ReligionProvider from faker_extensions package
Robbie-Palmer Dec 7, 2022
fd7bc88
Format tests/__init__.py
Robbie-Palmer Dec 7, 2022
c5e22f3
Add missing religions.csv and us_driver_license_format.yaml to packag…
Robbie-Palmer Dec 7, 2022
c0ea8da
Fix UsDriverLicenseProvider to provide us_driver_license entity
Robbie-Palmer Dec 7, 2022
798c9e5
Simplify Generate_data notebook by using PresidioFakeRecordGenerator
Robbie-Palmer Dec 7, 2022
e2f1e34
Update Data Generator README to include PresidioFakeRecordGenerator u…
Robbie-Palmer Dec 7, 2022
6731399
Merge remote-tracking branch 'origin/fake-record-generator' into fake…
Robbie-Palmer Dec 7, 2022
b7fbec1
Merge branch 'master' into fake-record-generator
omri374 Dec 18, 2022
7fc5bfe
Fix grammar in 1_Generate_data.ipynb
Robbie-Palmer Dec 20, 2022
407718d
Make it possible to use PresidioFakeRecordGenerator without the defau…
Robbie-Palmer Jan 3, 2023
57d3279
Merge remote-tracking branch 'origin/fake-record-generator' into fake…
Robbie-Palmer Jan 3, 2023
ae56da6
Merge remote-tracking branch 'upstream/master' into fake-record-gener…
Robbie-Palmer Jan 3, 2023
ce59376
Add Optional type annotations to parameters
Robbie-Palmer Jan 17, 2023
8041123
Rename PresidioDataGenerator to SentenceFaker
Robbie-Palmer Jan 17, 2023
97049ee
Minimize the responsibilities of SentenceFaker
Robbie-Palmer Jan 17, 2023
63dfa38
Move SentenceFaker into `faker_extensions` package
Robbie-Palmer Jan 17, 2023
3c1cc48
Fix imports
Robbie-Palmer Jan 17, 2023
ba327e8
Rename presidio_data_generator.py to presidio_sentence_faker.py
Robbie-Palmer Jan 17, 2023
a30342a
Fix 1_Generate_data.ipynb
Robbie-Palmer Jan 17, 2023
9795d8c
Add support for providing your own base records for PresidioSentenceF…
Robbie-Palmer Jan 17, 2023
c36f722
Fix SentenceFaker docstring
Robbie-Palmer Jan 20, 2023
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Jump to
Jump to file
Failed to load files.
Diff view
Diff view
246 changes: 58 additions & 188 deletions notebooks/1_Generate_data.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -19,18 +19,7 @@
"import tqdm\n",
"\n",
"from presidio_evaluator import InputSample\n",
"from presidio_evaluator.data_generator import PresidioDataGenerator\n",
"from presidio_evaluator.data_generator.faker_extensions import (\n",
" FakerSpansResult,\n",
" RecordsFaker,\n",
" IpAddressProvider,\n",
" NationalityProvider,\n",
" OrganizationProvider,\n",
" UsDriverLicenseProvider,\n",
" AgeProvider,\n",
" AddressProviderNew,\n",
" PhoneNumberProviderNew,\n",
")"
"from presidio_evaluator.data_generator import PresidioDataGenerator, PresidioFakeRecordGenerator"
]
},
{
Expand Down Expand Up @@ -98,18 +87,12 @@
"source": [
"## Generate a full dataset\n",
"\n",
"In this example we customize the data generator to:\n",
"1. Accept more types of entities (by adding more providers to Faker. see [Faker's documentation](https://faker.readthedocs.io/en/master/index.html#how-to-create-a-provider)\n",
"In this example we use the `PresidioFakeRecordGenerator` which extends the `PresidioDataGenerator` to:\n",
"1. Accept more types of entities (by adding more providers to Faker. see [Faker's documentation](https://faker.readthedocs.io/en/master/index.html#how-to-create-a-provider))\n",
"2. Handle records of multiple PII entities per fake person for a more realistic dataset\n",
"3. Translate the generated entity types to match Presidio's\n",
"\n",
"We then translate the generated entity types to match Presidio's, and save the new dataset in json and CONLL03 formats."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"a. Specify parameters:"
"We then save the new dataset in json and CONLL03 formats."
]
},
{
Expand All @@ -121,16 +104,12 @@
"outputs": [],
"source": [
"number_of_samples = 1500\n",
"lower_case_ratio = 0.05\n",
"locale = 'en'\n",
"cur_time = datetime.date.today().strftime(\"%B_%d_%Y\")\n",
"\n",
"raw_data_path = Path(\"../presidio_evaluator/data_generator/raw_data\")\n",
"output_file = f\"../data/generated_size_{number_of_samples}_date_{cur_time}.json\"\n",
"output_conll = f\"../data/generated_size_{number_of_samples}_date_{cur_time}.tsv\"\n",
"\n",
"templates_file_path = Path(raw_data_path, \"templates.txt\").resolve()\n",
"fake_name_generator_file = Path(raw_data_path, \"FakeNameGenerator.com_3000.csv\").resolve()\n",
"\n",
"lower_case_ratio = 0.05"
"output_conll = f\"../data/generated_size_{number_of_samples}_date_{cur_time}.tsv\""
]
},
{
Expand All @@ -141,58 +120,40 @@
}
},
"source": [
"b. Read [FakeNameGenerator](https://www.fakenamegenerator.com/) data (optional, extends the set of fake values)\n",
"and create a `RecordsFaker` which returns a fake person record (with multiple values) instead of one value,\n",
"The `PresidioFakeRecordGenerator` loads [FakeNameGenerator](https://www.fakenamegenerator.com/) data to extend the set of fake values\n",
"and creates a `RecordsFaker` which returns a fake person record (with multiple values) instead of one value,\n",
"allowing dependencies between values belonging to the same fake person\n",
"(e.g. name = Michael Smith with the email michael.smith@gmail.com).\n",
"\n",
"The `fake_name_generator_file` can be downloaded from https://www.fakenamegenerator.com/order.php\n",
"The `fake_name_generator_file` is included in the presidio_evaluator package and can be sourced from https://www.fakenamegenerator.com/order.php\n",
"\n",
"> Note that you can create fake records for multiple name sets, allowing you to adapt the fake data to the real data if needed. "
"> Note by using the lower level PresidioDataGenerator and RecordsFaker classes, you can create fake records for multiple name sets, allowing you to adapt the fake data to the real data if needed. "
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"pycharm": {
"name": "#%%\n"
}
},
"outputs": [],
"source": [
"# Read FakeNameGenerator CSV\n",
"fake_name_generator_df = pd.read_csv(fake_name_generator_file)\n",
"\n",
"# Update to match existing templates\n",
"fake_name_generator_df = PresidioDataGenerator.update_fake_name_generator_df(fake_name_generator_df)\n",
"fake_name_generator_df.head()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"outputs": [],
"source": [
"c. Create a Faker object (in this case, a `RecordsFaker`)"
"record_generator = PresidioFakeRecordGenerator(locale, lower_case_ratio)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"scrolled": true
},
"metadata": {},
"outputs": [],
"source": [
"# Create RecordsFaker (extension which handles records instead of independent values) and add additional specific providers\n",
"fake = RecordsFaker(records=fake_name_generator_df, locale=\"en_US\")"
"pd.DataFrame(record_generator._data_generator.faker.records).head()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"d. Add more providers, not part of the original Faker package"
"`PresidioFakeRecordGenerator` adds additional providers by default, which are not included in the Faker package.\n",
"These can be found in `presidio_evaluator.data_generator.faker_extensions.providers`"
]
},
{
Expand All @@ -201,13 +162,16 @@
"metadata": {},
"outputs": [],
"source": [
"fake.add_provider(IpAddressProvider) # Both Ipv4 and IPv6 IP addresses\n",
"fake.add_provider(NationalityProvider) # Read countries + nationalities from file\n",
"fake.add_provider(OrganizationProvider) # Read organization names from file\n",
"fake.add_provider(UsDriverLicenseProvider) # Read US driver license numbers from file\n",
"fake.add_provider(AgeProvider) # Age values (unavailable on Faker)\n",
"fake.add_provider(AddressProviderNew) # Extend the default address formats\n",
"fake.add_provider(PhoneNumberProviderNew) # Extend the default phone number formats"
"from presidio_evaluator.data_generator.faker_extensions.providers import *\n",
melmatlis marked this conversation as resolved.
Show resolved Hide resolved
"\n",
"IpAddressProvider # Both Ipv4 and IPv6 IP addresses\n",
"NationalityProvider # Read countries + nationalities from file\n",
"OrganizationProvider # Read organization names from file\n",
"UsDriverLicenseProvider # Read US driver license numbers from file\n",
"AgeProvider # Age values (unavailable on Faker\n",
"AddressProviderNew # Extend the default address formats\n",
"PhoneNumberProviderNew # Extend the default phone number formats\n",
"ReligionProvider # Read religioons from file"
]
},
{
Expand All @@ -218,7 +182,7 @@
}
},
"source": [
"e. Create the Presidio Data Generator object and add provider aliases if the templates have a different entity name than the Faker object"
"`PresidioFakeRecordGenerator.PROVIDER_ALIASES` can be extended to add additional provider aliases for when templates have a different entity name than the Faker object"
]
},
{
Expand All @@ -231,47 +195,28 @@
},
"outputs": [],
"source": [
"# Create Presidio Data Generator\n",
"data_generator = PresidioDataGenerator(\n",
" custom_faker=fake, lower_case_ratio=lower_case_ratio\n",
")\n",
"\n",
"# Create entity aliases (e.g. if faker supports \"name\" but templates contain \"person\").\n",
"data_generator.add_provider_alias(provider_name=\"name\", new_name=\"person\")\n",
"data_generator.add_provider_alias(\n",
" provider_name=\"credit_card_number\", new_name=\"credit_card\"\n",
")\n",
"data_generator.add_provider_alias(provider_name=\"date_of_birth\", new_name=\"birthday\")"
"PresidioFakeRecordGenerator.PROVIDER_ALIASES"
]
},
{
"cell_type": "markdown",
"metadata": {
"pycharm": {
"name": "#%% md\n"
},
"scrolled": true
}
},
"source": [
"f. Generate data"
"Generate data"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"pycharm": {
"name": "#%%\n"
}
},
"metadata": {},
"outputs": [],
"source": [
"sentence_templates = PresidioDataGenerator.read_template_file(templates_file_path)\n",
"fake_records = data_generator.generate_fake_data(\n",
" templates=sentence_templates, n_samples=number_of_samples\n",
")\n",
"\n",
"fake_records = list(fake_records)\n",
"fake_records = record_generator.generate_new_fake_records(num_samples=number_of_samples)\n",
"pprint.pprint(fake_records[0])"
]
},
Expand Down Expand Up @@ -326,85 +271,17 @@
"count_per_entity"
]
},
{
"cell_type": "markdown",
"metadata": {
"pycharm": {
"name": "#%% md\n"
}
},
"source": [
"#### Translate tags from Faker's to Presidio's (optional)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"pycharm": {
"name": "#%%\n"
}
},
"metadata": {},
"outputs": [],
"source": [
"translator = {\n",
" \"person\": \"PERSON\",\n",
" \"ip_address\": \"IP_ADDRESS\",\n",
" \"us_driver_license\": \"US_DRIVER_LICENSE\",\n",
" \"organization\": \"ORGANIZATION\",\n",
" \"name_female\": \"PERSON\",\n",
" \"address\": \"STREET_ADDRESS\",\n",
" \"country\": \"GPE\",\n",
" \"state\": \"GPE\",\n",
" \"credit_card_number\": \"CREDIT_CARD\",\n",
" \"city\": \"GPE\",\n",
" \"street_name\": \"STREET_ADDRESS\",\n",
" \"building_number\": \"STREET_ADDRESS\",\n",
" \"name\": \"PERSON\",\n",
" \"iban\": \"IBAN_CODE\",\n",
" \"last_name\": \"PERSON\",\n",
" \"last_name_male\": \"PERSON\",\n",
" \"last_name_female\": \"PERSON\",\n",
" \"first_name\": \"PERSON\",\n",
" \"first_name_male\": \"PERSON\",\n",
" \"first_name_female\": \"PERSON\",\n",
" \"phone_number\": \"PHONE_NUMBER\",\n",
" \"url\": \"DOMAIN_NAME\",\n",
" \"ssn\": \"US_SSN\",\n",
" \"email\": \"EMAIL_ADDRESS\",\n",
" \"date_time\": \"DATE_TIME\",\n",
" \"date_of_birth\": \"DATE_TIME\",\n",
" \"day_of_week\": \"DATE_TIME\",\n",
" \"year\": \"DATE_TIME\",\n",
" \"name_male\": \"PERSON\",\n",
" \"prefix_male\": \"TITLE\",\n",
" \"prefix_female\": \"TITLE\",\n",
" \"prefix\": \"TITLE\",\n",
" \"nationality\": \"NRP\",\n",
" \"nation_woman\": \"NRP\",\n",
" \"nation_man\": \"NRP\",\n",
" \"nation_plural\": \"NRP\",\n",
" \"first_name_nonbinary\": \"PERSON\",\n",
" \"postcode\": \"STREET_ADDRESS\",\n",
" \"secondary_address\": \"STREET_ADDRESS\",\n",
" \"job\": \"TITLE\",\n",
" \"zipcode\": \"ZIP_CODE\",\n",
" \"state_abbr\": \"GPE\",\n",
" \"age\": \"AGE\",\n",
"}\n",
"\n",
"def update_entity_types(dataset:List[FakerSpansResult], entity_mapping:Dict[str,str]):\n",
" \"\"\"Replace entity types using a translator dictionary.\"\"\"\n",
"\n",
" for sample in dataset:\n",
" # update entity types on spans\n",
" for span in sample.spans:\n",
" span.type = entity_mapping[span.type]\n",
" # update entity types on the template string\n",
" for key, value in entity_mapping.items():\n",
" sample.template = sample.template.replace(\"{{\" + key + \"}}\", \"{{\" + value + \"}}\")\n",
"\n",
"update_entity_types(fake_records, entity_mapping=translator)"
"import json\n",
"import dataclasses\n",
"def get_json(result) -> str:\n",
" spans_dict = json.dumps([dataclasses.asdict(span) for span in result.spans])\n",
" return dict(fake=result.fake, spans=spans_dict, template=result.template, template_id=result.template_id)"
]
},
{
Expand All @@ -413,33 +290,17 @@
"metadata": {},
"outputs": [],
"source": [
"fake_records[0]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Frequency of new entity types after mapping"
"len(fake_records)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"pycharm": {
"name": "#%%\n"
}
},
"metadata": {},
"outputs": [],
"source": [
"\n",
"count_per_entity_new = Counter()\n",
"for record in fake_records:\n",
" for span in record.spans:\n",
" count_per_entity_new[span.type] += 1\n",
"\n",
"count_per_entity_new.most_common()"
"for record in fake_records[:10]:\n",
" print(get_json(record))"
]
},
{
Expand Down Expand Up @@ -490,6 +351,15 @@
"InputSample.to_json(dataset=input_samples, output_file=output_file)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"output_file"
]
},
{
"cell_type": "markdown",
"metadata": {
Expand Down Expand Up @@ -557,9 +427,9 @@
"hash": "2509fbe9adc3579fd0ef23e6a2c6fb50cb745caa174aafdf017283479e60bc43"
},
"kernelspec": {
"display_name": "presidio",
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "presidio"
"name": "python3"
},
"language_info": {
"codemirror_mode": {
Expand All @@ -571,9 +441,9 @@
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.8.12"
"version": "3.7.15"
}
},
"nbformat": 4,
"nbformat_minor": 2
}
"nbformat_minor": 4
}