Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PresidioSentenceFaker #50

Merged
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
37 commits
Select commit Hold shift + click to select a range
56ab9a5
Map DOMAIN_NAME entity to URL
Robbie-Palmer Aug 2, 2022
f538bc5
Add PresidioFakeRecordGenerator class
Robbie-Palmer Aug 3, 2022
6e8eec3
Fix bug in PresidioAnalyzerWrapper where 'en' is always the chosen la…
Robbie-Palmer Aug 3, 2022
5bfca66
Update PresidioAnalyzerWrapper to use the provided language in the de…
Robbie-Palmer Aug 3, 2022
79f796a
Format span_to_tag.py
Robbie-Palmer Aug 3, 2022
146dc63
Merge branch 'microsoft:master' into fake-record-generator
Robbie-Palmer Aug 5, 2022
6863e19
Map DOMAIN_NAME entity to URL
Robbie-Palmer Aug 2, 2022
b5efd1e
Add PresidioFakeRecordGenerator class
Robbie-Palmer Aug 3, 2022
04a43fe
Fix bug in PresidioAnalyzerWrapper where 'en' is always the chosen la…
Robbie-Palmer Aug 3, 2022
7818b28
Update PresidioAnalyzerWrapper to use the provided language in the de…
Robbie-Palmer Aug 3, 2022
279ca75
Format span_to_tag.py
Robbie-Palmer Aug 3, 2022
ddf8e72
Fix python3.7 support for getting raw data dir path
Robbie-Palmer Dec 7, 2022
a385403
Strip whitespace from ends of template files in PresidioDataGenerator
Robbie-Palmer Dec 7, 2022
6aede14
Test PresidioFakeRecordGenerator
Robbie-Palmer Dec 7, 2022
e6c09fd
Fix mutable default argument problem in PresidioFakeRecordGenerator
Robbie-Palmer Dec 7, 2022
4afd7c0
Unit test PresidioFakeRecordGenerator
Robbie-Palmer Dec 7, 2022
76c24b7
Expose ReligionProvider from faker_extensions package
Robbie-Palmer Dec 7, 2022
fd7bc88
Format tests/__init__.py
Robbie-Palmer Dec 7, 2022
c5e22f3
Add missing religions.csv and us_driver_license_format.yaml to packag…
Robbie-Palmer Dec 7, 2022
c0ea8da
Fix UsDriverLicenseProvider to provide us_driver_license entity
Robbie-Palmer Dec 7, 2022
798c9e5
Simplify Generate_data notebook by using PresidioFakeRecordGenerator
Robbie-Palmer Dec 7, 2022
e2f1e34
Update Data Generator README to include PresidioFakeRecordGenerator u…
Robbie-Palmer Dec 7, 2022
6731399
Merge remote-tracking branch 'origin/fake-record-generator' into fake…
Robbie-Palmer Dec 7, 2022
b7fbec1
Merge branch 'master' into fake-record-generator
omri374 Dec 18, 2022
7fc5bfe
Fix grammar in 1_Generate_data.ipynb
Robbie-Palmer Dec 20, 2022
407718d
Make it possible to use PresidioFakeRecordGenerator without the defau…
Robbie-Palmer Jan 3, 2023
57d3279
Merge remote-tracking branch 'origin/fake-record-generator' into fake…
Robbie-Palmer Jan 3, 2023
ae56da6
Merge remote-tracking branch 'upstream/master' into fake-record-gener…
Robbie-Palmer Jan 3, 2023
ce59376
Add Optional type annotations to parameters
Robbie-Palmer Jan 17, 2023
8041123
Rename PresidioDataGenerator to SentenceFaker
Robbie-Palmer Jan 17, 2023
97049ee
Minimize the responsibilities of SentenceFaker
Robbie-Palmer Jan 17, 2023
63dfa38
Move SentenceFaker into `faker_extensions` package
Robbie-Palmer Jan 17, 2023
3c1cc48
Fix imports
Robbie-Palmer Jan 17, 2023
ba327e8
Rename presidio_data_generator.py to presidio_sentence_faker.py
Robbie-Palmer Jan 17, 2023
a30342a
Fix 1_Generate_data.ipynb
Robbie-Palmer Jan 17, 2023
9795d8c
Add support for providing your own base records for PresidioSentenceF…
Robbie-Palmer Jan 17, 2023
c36f722
Fix SentenceFaker docstring
Robbie-Palmer Jan 20, 2023
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Jump to
Jump to file
Failed to load files.
Diff view
Diff view
279 changes: 69 additions & 210 deletions notebooks/1_Generate_data.ipynb

Large diffs are not rendered by default.

68 changes: 32 additions & 36 deletions presidio_evaluator/data_generator/README.md
Original file line number Diff line number Diff line change
@@ -1,16 +1,14 @@
# Presidio Data Generator
# Data Generation

This data generator takes a text file with templates (e.g. `my name is {{person}}`)
and creates a list of InputSamples which contain fake PII entities
instead of placeholders. It further creates spans (start and end of each entity)
for model training and evaluation.
The `PresidioSentenceFaker` generates sentences from templates (e.g. `my name is {{person}}`) where the placeholders
are replaced with fake PII entities, along with metadata about the spans (the start and end of each entity) for model training and evaluation.

## Scenarios

There are two main scenarios for using the Presidio Data Generator:
There are two main scenarios for using the `PresidioSentenceFaker`:

1. Create a fake dataset for evaluation or training purposes, given a list of predefined templates
(see [this file](raw_data/templates.txt) for example)
(uses [this file](raw_data/templates.txt) by default)
2. Augment an existing labeled dataset with additional fake values.

In both scenarios the process is similar. In scenario 2, the existing dataset is first translated into templates,
Expand All @@ -20,9 +18,8 @@ and then scenario 1 is applied.

This generator heavily relies on the [Faker package](https://www.github.com/joke2k/faker) with a few differences:

1. `PresidioDataGenerator` returns not only fake text, but also the spans in which fake entities appear in the text.

2. `Faker` samples each value independently.
1. `PresidioSentenceFaker` returns not only fake text, but also the spans in which fake entities appear in the text
2. `Faker` samples each value independently.
In many cases we would want to keep the semantic dependency between two values.
For example, for the template `My name is {{name}} and my email is {{email}}`,
we would prefer a result which has the name within the email address,
Expand All @@ -34,46 +31,45 @@ It accepts a dictionary / pandas DataFrame, and favors returning objects from th

For a full example, see the [Generate Data Notebook](../../notebooks/1_Generate_data.ipynb).

Simple example:
`PresidioSentenceFaker` provides a high level interface for using the full power of the `presidio_evaluator`
package. Its results use the presidio PII entities, not the `Faker` entities.
It is loaded by default with template strings, and the additional Presidio Entity Providers.

```python
from presidio_evaluator.data_generator import PresidioDataGenerator

sentence_templates = [
"My name is {{name}}",
"Please send it to {{address}}",
"I just moved to {{city}} from {{country}}"
]


data_generator = PresidioDataGenerator()
fake_records = data_generator.generate_fake_data(
templates=sentence_templates, n_samples=10
)
from presidio_evaluator.data_generator import PresidioSentenceFaker

fake_records = list(fake_records)
record_generator = PresidioSentenceFaker(locale='en', lower_case_ratio=0.05)
fake_records = record_generator.generate_new_fake_sentences(1500)

# Print the spans of the first sample
print(fake_records[0].fake)
print(fake_records[0].spans)



```

The process in high level is the following:

1. Translate a NER dataset (e.g. CONLL or OntoNotes) into a list of
templates: `My name is John` -> `My name is [PERSON]`
2. (Optional) add new Faker providers to the `PresidioDataGenerator` to support types of PII not returned by Faker
3. (Optional) map dataset entity names into provider equivalents by calling `PresidioDataGenerator.add_provider_alias`.
This will create entity aliases (e.g. faker supports "name" but templates contain "person")
4. Generate samples using the templates list
5. Split the generated dataset to train/test/validation while making sure
2. Construct a `PresidioSentenceFaker` instance by:
- Choosing your appropriate locale e.g. `en_US`
- Choosing the lower case ration
- Pass in your list of templates (or default to those provided)
- Optionally extend with provided templates accessible via `from presidio_evaluator.data_generator import presidio_templates_file_path`
- Pass in any custom entity providers (or default to those provided)
- Optionally extend with inbuilt presidio entity providers accessible via `from presidio_evaluator.data_generator import presidio_additional_entity_providers`
- Add a mapping from the output provider entity type to a Presidio recognised entity type where appropriate
- e.g. For a `TownProvider` which outputs entity type of `town`, execute `PresidioSentenceFaker.ENTITY_TYPE_MAPPING['town'] = 'GPE'`)
- Pass in a DataFrame representing your underlying PII records (or default to those provided)
- Optionally extend with inbuilt presidio entity providers accessible via `from presidio_evaluator.data_generator.faker_extensions.datasets import load_fake_person_df`
- Add any additional aliases required by your dataset by adding to `PresidioSentenceFaker.PROVIDER_ALIASES`
- e.g. if the entity providers support "name" but your dataset templates contain "person", you can add this alias
with `PresidioSentenceFaker.PROVIDER_ALIASES['name'] = 'person'`)
3. Generate sentences
4. Split the generated dataset to train/test/validation while making sure
that samples from the same template would only appear in one set
6. Adapt datasets for the various models (Spacy, Flair, CRF, sklearn)
7. Train models
8. Evaluate using one of the [evaluation notebooks](../../notebooks/models)
5. Adapt datasets for the various models (Spacy, Flair, CRF, sklearn)
6. Train models
7. Evaluate using one of the [evaluation notebooks](../../notebooks/models)

Notes:

Expand Down
15 changes: 13 additions & 2 deletions presidio_evaluator/data_generator/__init__.py
Original file line number Diff line number Diff line change
@@ -1,4 +1,14 @@
from .presidio_data_generator import PresidioDataGenerator
from pathlib import Path

from . import raw_data

_raw_data_path = raw_data.__path__
if not hasattr(_raw_data_path, '__getitem__'):
_raw_data_path = _raw_data_path._path
raw_data_dir = Path(_raw_data_path[0])

from .presidio_sentence_faker import PresidioSentenceFaker, presidio_templates_file_path, \
presidio_additional_entity_providers
from .presidio_pseudonymize import PresidioPseudonymization


Expand All @@ -8,4 +18,5 @@ def read_synth_dataset():
)


__all__ = ["PresidioDataGenerator", "PresidioPseudonymization", "read_synth_dataset"]
__all__ = ["PresidioSentenceFaker", "PresidioPseudonymization", "read_synth_dataset",
"raw_data_dir", "presidio_templates_file_path", "presidio_additional_entity_providers"]
10 changes: 4 additions & 6 deletions presidio_evaluator/data_generator/faker_extensions/__init__.py
Original file line number Diff line number Diff line change
@@ -1,7 +1,4 @@
from .data_objects import FakerSpan, FakerSpansResult
from .span_generator import SpanGenerator
from .record_generator import RecordGenerator
from .records_faker import RecordsFaker
from .providers import (
NationalityProvider,
OrganizationProvider,
Expand All @@ -10,21 +7,22 @@
AddressProviderNew,
PhoneNumberProviderNew,
AgeProvider,
ReligionProvider,
HospitalProvider
)
from .span_generator import SpanGenerator

__all__ = [
"SpanGenerator",
"FakerSpan",
"FakerSpansResult",
"RecordGenerator",
"NationalityProvider",
"OrganizationProvider",
"UsDriverLicenseProvider",
"IpAddressProvider",
"AddressProviderNew",
"PhoneNumberProviderNew",
"AgeProvider",
"RecordsFaker",
"HospitalProvider"
"ReligionProvider",
"HospitalProvider",
]
99 changes: 99 additions & 0 deletions presidio_evaluator/data_generator/faker_extensions/datasets.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,99 @@
import random
import re

import pandas as pd
from pandas import DataFrame

from presidio_evaluator.data_generator import raw_data_dir


def _camel_to_snake(name):
# Borrowed from https://stackoverflow.com/questions/1175208/elegant-python-function-to-convert-camelcase-to-snake-case
name = re.sub("(.)([A-Z][a-z]+)", r"\1_\2", name)
return re.sub("([a-z0-9])([A-Z])", r"\1_\2", name).lower()


def _full_name(row):
if random.random() > 0.2:
return f'{row.first_name} {row.last_name}'
else:
space_after_initials = " " if random.random() > 0.5 else ". "
return f'{row.first_name} {row.middle_initial}{space_after_initials}{row.last_name}'


def _name_gendered(row):
first_name_female, prefix_female, last_name_female = (
(row.first_name, row.prefix, row.last_name)
if row.gender == "female"
else ("", "", "")
)
first_name_male, prefix_male, last_name_male = (
(row.first_name, row.prefix, row.last_name)
if row.gender == "male"
else ("", "", "")
)
return (
first_name_female,
first_name_male,
prefix_female,
prefix_male,
last_name_female,
last_name_male,
)


def load_fake_person_df() -> DataFrame:
"""
:return: A DataFrame loaded with data from FakeNameGenerator.com, and cleaned to match faker conventions
"""
fake_person_data_path = raw_data_dir / "FakeNameGenerator.com_3000.csv"
fake_person_df = pd.read_csv(fake_person_data_path)
fake_person_df.columns = [_camel_to_snake(col) for col in fake_person_df.columns]
# Update some column names to fit Faker
fake_person_df.rename(
columns={"country": "country_code", "state": "state_abbr"}, inplace=True
)
fake_person_df.rename(
columns={
"country_full": "country",
"name_set": "nationality",
"street_address": "street_name",
"state_full": "state",
"given_name": "first_name",
"surname": "last_name",
"title": "prefix",
"email_address": "email",
"telephone_number": "phone_number",
"telephone_country_code": "country_calling_code",
"birthday": "date_of_birth",
"cc_number": "credit_card_number",
"cc_type": "credit_card_provider",
"cc_expires": "credit_card_expire",
"occupation": "job",
"domain": "domain_name",
"username": "user_name",
"zip_code": "zipcode",
},
inplace=True,
)
fake_person_df["person"] = fake_person_df.apply(_full_name, axis=1)
fake_person_df["name"] = fake_person_df["person"]
genderized = fake_person_df.apply(
lambda x: pd.Series(
_name_gendered(x),
index=[
"first_name_female",
"first_name_male",
"prefix_female",
"prefix_male",
"last_name_female",
"last_name_male",
],
),
axis=1,
result_type="expand",
)
# Remove credit card data, rely on Faker's as it is more realistic
del fake_person_df["credit_card_number"]
fake_person_df = pd.concat([fake_person_df, genderized], axis="columns")
return fake_person_df
33 changes: 13 additions & 20 deletions presidio_evaluator/data_generator/faker_extensions/providers.py
Original file line number Diff line number Diff line change
Expand Up @@ -11,14 +11,14 @@
from faker.providers.address.en_US import Provider as AddressProvider
from faker.providers.phone_number.en_US import Provider as PhoneNumberProvider

from presidio_evaluator.data_generator import raw_data_dir


class NationalityProvider(BaseProvider):
def __init__(self, generator, nationality_file: Union[str, Path] = None):
super().__init__(generator=generator)
if not nationality_file:
nationality_file = Path(
Path(__file__).parent.parent, "raw_data", "nationalities.csv"
).resolve()
nationality_file = (raw_data_dir / "nationalities.csv").resolve()

self.nationality_file = nationality_file
self.nationalities = self.load_nationalities()
Expand All @@ -44,17 +44,15 @@ def nation_plural(self):

class OrganizationProvider(BaseProvider):
def __init__(
self,
generator,
organizations_file: Union[str, Path] = None,
self,
generator,
organizations_file: Union[str, Path] = None,
):
super().__init__(generator=generator)
if not organizations_file:
# company names assembled from stock exchange listings (aex, bse, cnq, ger, lse, nasdaq, nse, nyse, par, tyo),
# US government websites like https://www.sec.gov/rules/other/4-460list.htm, and other sources
organizations_file = Path(
Path(__file__).parent.parent, "raw_data", "companies_and_organizations.csv"
).resolve()
organizations_file = (raw_data_dir / "companies_and_organizations.csv").resolve()
self.organizations_file = organizations_file
self.organizations = self.load_organizations()

Expand All @@ -71,13 +69,11 @@ def company(self):
class UsDriverLicenseProvider(BaseProvider):
def __init__(self, generator):
super().__init__(generator=generator)
us_driver_license_file = Path(
Path(__file__).parent.parent, "raw_data", "us_driver_license_format.yaml"
).resolve()
us_driver_license_file = (raw_data_dir / "us_driver_license_format.yaml").resolve()
formats = yaml.safe_load(open(us_driver_license_file))
self.formats = formats['en']['faker']['driving_license']['usa']

def driver_license(self) -> str:
def us_driver_license(self) -> str:
melmatlis marked this conversation as resolved.
Show resolved Hide resolved
# US driver's licenses patterns vary by state. Here we sample a random state and format
us_state = random.choice(list(self.formats))
us_state_format = random.choice(self.formats[us_state])
Expand All @@ -86,15 +82,13 @@ def driver_license(self) -> str:

class ReligionProvider(BaseProvider):
def __init__(
self,
generator,
religions_file: Union[str, Path] = None,
self,
generator,
religions_file: Union[str, Path] = None,
):
super().__init__(generator=generator)
if not religions_file:
religions_file = Path(
Path(__file__).parent.parent, "raw_data", "religions.csv"
).resolve()
religions_file = (raw_data_dir / "religions.csv").resolve()
self.religions_file = religions_file
self.religions = self.load_religions()

Expand All @@ -117,7 +111,6 @@ def ip_address(self):


class AgeProvider(BaseProvider):

formats = OrderedDict(
[
("%#", 0.8),
Expand Down