Skip to content

Commit

Permalink
Merge pull request #50 from Robbie-Palmer/fake-record-generator
Browse files Browse the repository at this point in the history
PresidioSentenceFaker
  • Loading branch information
omri374 committed Jan 20, 2023
2 parents 6e6dbb9 + c36f722 commit 1be5e86
Show file tree
Hide file tree
Showing 23 changed files with 803 additions and 930 deletions.
279 changes: 69 additions & 210 deletions notebooks/1_Generate_data.ipynb

Large diffs are not rendered by default.

68 changes: 32 additions & 36 deletions presidio_evaluator/data_generator/README.md
Original file line number Diff line number Diff line change
@@ -1,16 +1,14 @@
# Presidio Data Generator
# Data Generation

This data generator takes a text file with templates (e.g. `my name is {{person}}`)
and creates a list of InputSamples which contain fake PII entities
instead of placeholders. It further creates spans (start and end of each entity)
for model training and evaluation.
The `PresidioSentenceFaker` generates sentences from templates (e.g. `my name is {{person}}`) where the placeholders
are replaced with fake PII entities, along with metadata about the spans (the start and end of each entity) for model training and evaluation.

## Scenarios

There are two main scenarios for using the Presidio Data Generator:
There are two main scenarios for using the `PresidioSentenceFaker`:

1. Create a fake dataset for evaluation or training purposes, given a list of predefined templates
(see [this file](raw_data/templates.txt) for example)
(uses [this file](raw_data/templates.txt) by default)
2. Augment an existing labeled dataset with additional fake values.

In both scenarios the process is similar. In scenario 2, the existing dataset is first translated into templates,
Expand All @@ -20,9 +18,8 @@ and then scenario 1 is applied.

This generator heavily relies on the [Faker package](https://www.github.com/joke2k/faker) with a few differences:

1. `PresidioDataGenerator` returns not only fake text, but also the spans in which fake entities appear in the text.

2. `Faker` samples each value independently.
1. `PresidioSentenceFaker` returns not only fake text, but also the spans in which fake entities appear in the text
2. `Faker` samples each value independently.
In many cases we would want to keep the semantic dependency between two values.
For example, for the template `My name is {{name}} and my email is {{email}}`,
we would prefer a result which has the name within the email address,
Expand All @@ -34,46 +31,45 @@ It accepts a dictionary / pandas DataFrame, and favors returning objects from th

For a full example, see the [Generate Data Notebook](../../notebooks/1_Generate_data.ipynb).

Simple example:
`PresidioSentenceFaker` provides a high level interface for using the full power of the `presidio_evaluator`
package. Its results use the presidio PII entities, not the `Faker` entities.
It is loaded by default with template strings, and the additional Presidio Entity Providers.

```python
from presidio_evaluator.data_generator import PresidioDataGenerator

sentence_templates = [
"My name is {{name}}",
"Please send it to {{address}}",
"I just moved to {{city}} from {{country}}"
]


data_generator = PresidioDataGenerator()
fake_records = data_generator.generate_fake_data(
templates=sentence_templates, n_samples=10
)
from presidio_evaluator.data_generator import PresidioSentenceFaker

fake_records = list(fake_records)
record_generator = PresidioSentenceFaker(locale='en', lower_case_ratio=0.05)
fake_records = record_generator.generate_new_fake_sentences(1500)

# Print the spans of the first sample
print(fake_records[0].fake)
print(fake_records[0].spans)



```

The process in high level is the following:

1. Translate a NER dataset (e.g. CONLL or OntoNotes) into a list of
templates: `My name is John` -> `My name is [PERSON]`
2. (Optional) add new Faker providers to the `PresidioDataGenerator` to support types of PII not returned by Faker
3. (Optional) map dataset entity names into provider equivalents by calling `PresidioDataGenerator.add_provider_alias`.
This will create entity aliases (e.g. faker supports "name" but templates contain "person")
4. Generate samples using the templates list
5. Split the generated dataset to train/test/validation while making sure
2. Construct a `PresidioSentenceFaker` instance by:
- Choosing your appropriate locale e.g. `en_US`
- Choosing the lower case ration
- Pass in your list of templates (or default to those provided)
- Optionally extend with provided templates accessible via `from presidio_evaluator.data_generator import presidio_templates_file_path`
- Pass in any custom entity providers (or default to those provided)
- Optionally extend with inbuilt presidio entity providers accessible via `from presidio_evaluator.data_generator import presidio_additional_entity_providers`
- Add a mapping from the output provider entity type to a Presidio recognised entity type where appropriate
- e.g. For a `TownProvider` which outputs entity type of `town`, execute `PresidioSentenceFaker.ENTITY_TYPE_MAPPING['town'] = 'GPE'`)
- Pass in a DataFrame representing your underlying PII records (or default to those provided)
- Optionally extend with inbuilt presidio entity providers accessible via `from presidio_evaluator.data_generator.faker_extensions.datasets import load_fake_person_df`
- Add any additional aliases required by your dataset by adding to `PresidioSentenceFaker.PROVIDER_ALIASES`
- e.g. if the entity providers support "name" but your dataset templates contain "person", you can add this alias
with `PresidioSentenceFaker.PROVIDER_ALIASES['name'] = 'person'`)
3. Generate sentences
4. Split the generated dataset to train/test/validation while making sure
that samples from the same template would only appear in one set
6. Adapt datasets for the various models (Spacy, Flair, CRF, sklearn)
7. Train models
8. Evaluate using one of the [evaluation notebooks](../../notebooks/models)
5. Adapt datasets for the various models (Spacy, Flair, CRF, sklearn)
6. Train models
7. Evaluate using one of the [evaluation notebooks](../../notebooks/models)

Notes:

Expand Down
15 changes: 13 additions & 2 deletions presidio_evaluator/data_generator/__init__.py
Original file line number Diff line number Diff line change
@@ -1,4 +1,14 @@
from .presidio_data_generator import PresidioDataGenerator
from pathlib import Path

from . import raw_data

_raw_data_path = raw_data.__path__
if not hasattr(_raw_data_path, '__getitem__'):
_raw_data_path = _raw_data_path._path
raw_data_dir = Path(_raw_data_path[0])

from .presidio_sentence_faker import PresidioSentenceFaker, presidio_templates_file_path, \
presidio_additional_entity_providers
from .presidio_pseudonymize import PresidioPseudonymization


Expand All @@ -8,4 +18,5 @@ def read_synth_dataset():
)


__all__ = ["PresidioDataGenerator", "PresidioPseudonymization", "read_synth_dataset"]
__all__ = ["PresidioSentenceFaker", "PresidioPseudonymization", "read_synth_dataset",
"raw_data_dir", "presidio_templates_file_path", "presidio_additional_entity_providers"]
10 changes: 4 additions & 6 deletions presidio_evaluator/data_generator/faker_extensions/__init__.py
Original file line number Diff line number Diff line change
@@ -1,7 +1,4 @@
from .data_objects import FakerSpan, FakerSpansResult
from .span_generator import SpanGenerator
from .record_generator import RecordGenerator
from .records_faker import RecordsFaker
from .providers import (
NationalityProvider,
OrganizationProvider,
Expand All @@ -10,21 +7,22 @@
AddressProviderNew,
PhoneNumberProviderNew,
AgeProvider,
ReligionProvider,
HospitalProvider
)
from .span_generator import SpanGenerator

__all__ = [
"SpanGenerator",
"FakerSpan",
"FakerSpansResult",
"RecordGenerator",
"NationalityProvider",
"OrganizationProvider",
"UsDriverLicenseProvider",
"IpAddressProvider",
"AddressProviderNew",
"PhoneNumberProviderNew",
"AgeProvider",
"RecordsFaker",
"HospitalProvider"
"ReligionProvider",
"HospitalProvider",
]
99 changes: 99 additions & 0 deletions presidio_evaluator/data_generator/faker_extensions/datasets.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,99 @@
import random
import re

import pandas as pd
from pandas import DataFrame

from presidio_evaluator.data_generator import raw_data_dir


def _camel_to_snake(name):
# Borrowed from https://stackoverflow.com/questions/1175208/elegant-python-function-to-convert-camelcase-to-snake-case
name = re.sub("(.)([A-Z][a-z]+)", r"\1_\2", name)
return re.sub("([a-z0-9])([A-Z])", r"\1_\2", name).lower()


def _full_name(row):
if random.random() > 0.2:
return f'{row.first_name} {row.last_name}'
else:
space_after_initials = " " if random.random() > 0.5 else ". "
return f'{row.first_name} {row.middle_initial}{space_after_initials}{row.last_name}'


def _name_gendered(row):
first_name_female, prefix_female, last_name_female = (
(row.first_name, row.prefix, row.last_name)
if row.gender == "female"
else ("", "", "")
)
first_name_male, prefix_male, last_name_male = (
(row.first_name, row.prefix, row.last_name)
if row.gender == "male"
else ("", "", "")
)
return (
first_name_female,
first_name_male,
prefix_female,
prefix_male,
last_name_female,
last_name_male,
)


def load_fake_person_df() -> DataFrame:
"""
:return: A DataFrame loaded with data from FakeNameGenerator.com, and cleaned to match faker conventions
"""
fake_person_data_path = raw_data_dir / "FakeNameGenerator.com_3000.csv"
fake_person_df = pd.read_csv(fake_person_data_path)
fake_person_df.columns = [_camel_to_snake(col) for col in fake_person_df.columns]
# Update some column names to fit Faker
fake_person_df.rename(
columns={"country": "country_code", "state": "state_abbr"}, inplace=True
)
fake_person_df.rename(
columns={
"country_full": "country",
"name_set": "nationality",
"street_address": "street_name",
"state_full": "state",
"given_name": "first_name",
"surname": "last_name",
"title": "prefix",
"email_address": "email",
"telephone_number": "phone_number",
"telephone_country_code": "country_calling_code",
"birthday": "date_of_birth",
"cc_number": "credit_card_number",
"cc_type": "credit_card_provider",
"cc_expires": "credit_card_expire",
"occupation": "job",
"domain": "domain_name",
"username": "user_name",
"zip_code": "zipcode",
},
inplace=True,
)
fake_person_df["person"] = fake_person_df.apply(_full_name, axis=1)
fake_person_df["name"] = fake_person_df["person"]
genderized = fake_person_df.apply(
lambda x: pd.Series(
_name_gendered(x),
index=[
"first_name_female",
"first_name_male",
"prefix_female",
"prefix_male",
"last_name_female",
"last_name_male",
],
),
axis=1,
result_type="expand",
)
# Remove credit card data, rely on Faker's as it is more realistic
del fake_person_df["credit_card_number"]
fake_person_df = pd.concat([fake_person_df, genderized], axis="columns")
return fake_person_df
33 changes: 13 additions & 20 deletions presidio_evaluator/data_generator/faker_extensions/providers.py
Original file line number Diff line number Diff line change
Expand Up @@ -11,14 +11,14 @@
from faker.providers.address.en_US import Provider as AddressProvider
from faker.providers.phone_number.en_US import Provider as PhoneNumberProvider

from presidio_evaluator.data_generator import raw_data_dir


class NationalityProvider(BaseProvider):
def __init__(self, generator, nationality_file: Union[str, Path] = None):
super().__init__(generator=generator)
if not nationality_file:
nationality_file = Path(
Path(__file__).parent.parent, "raw_data", "nationalities.csv"
).resolve()
nationality_file = (raw_data_dir / "nationalities.csv").resolve()

self.nationality_file = nationality_file
self.nationalities = self.load_nationalities()
Expand All @@ -44,17 +44,15 @@ def nation_plural(self):

class OrganizationProvider(BaseProvider):
def __init__(
self,
generator,
organizations_file: Union[str, Path] = None,
self,
generator,
organizations_file: Union[str, Path] = None,
):
super().__init__(generator=generator)
if not organizations_file:
# company names assembled from stock exchange listings (aex, bse, cnq, ger, lse, nasdaq, nse, nyse, par, tyo),
# US government websites like https://www.sec.gov/rules/other/4-460list.htm, and other sources
organizations_file = Path(
Path(__file__).parent.parent, "raw_data", "companies_and_organizations.csv"
).resolve()
organizations_file = (raw_data_dir / "companies_and_organizations.csv").resolve()
self.organizations_file = organizations_file
self.organizations = self.load_organizations()

Expand All @@ -71,13 +69,11 @@ def company(self):
class UsDriverLicenseProvider(BaseProvider):
def __init__(self, generator):
super().__init__(generator=generator)
us_driver_license_file = Path(
Path(__file__).parent.parent, "raw_data", "us_driver_license_format.yaml"
).resolve()
us_driver_license_file = (raw_data_dir / "us_driver_license_format.yaml").resolve()
formats = yaml.safe_load(open(us_driver_license_file))
self.formats = formats['en']['faker']['driving_license']['usa']

def driver_license(self) -> str:
def us_driver_license(self) -> str:
# US driver's licenses patterns vary by state. Here we sample a random state and format
us_state = random.choice(list(self.formats))
us_state_format = random.choice(self.formats[us_state])
Expand All @@ -86,15 +82,13 @@ def driver_license(self) -> str:

class ReligionProvider(BaseProvider):
def __init__(
self,
generator,
religions_file: Union[str, Path] = None,
self,
generator,
religions_file: Union[str, Path] = None,
):
super().__init__(generator=generator)
if not religions_file:
religions_file = Path(
Path(__file__).parent.parent, "raw_data", "religions.csv"
).resolve()
religions_file = (raw_data_dir / "religions.csv").resolve()
self.religions_file = religions_file
self.religions = self.load_religions()

Expand All @@ -117,7 +111,6 @@ def ip_address(self):


class AgeProvider(BaseProvider):

formats = OrderedDict(
[
("%#", 0.8),
Expand Down

0 comments on commit 1be5e86

Please sign in to comment.