Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PresidioSentenceFaker #50

Merged

Conversation

Robbie-Palmer
Copy link
Contributor

@Robbie-Palmer Robbie-Palmer commented Aug 5, 2022

1_Generate_data.ipynb and presidio_data_generator.py's main function contain code for generating fake records utilising the new providers and the mappings defined in this package.
This is a useful higher-level API for consuming the functionality of this package in other projects e.g.

from presidio_evaluator.data_generator import PresidioFakeRecordGenerator

entity_generator = PresidioFakeRecordGenerator(locale="en_US", lower_case_ratio=0.05,
                                               random_seed=42)
fake_patterns = entity_generator.generate_new_fake_records(num_samples=10000)

The version of the code in this PR takes from the notebook, which includes a mapping from Faker entity types to Presidio entity types, which isn't done in presidio_data_generator.py's main function.
Otherwise, it is the same code, just wrapped up into a class.

In this PR, I've also changed what I think are bugs.

  • The presidio_entities_map which maps Faker entity types to Presidio entity types
  • PresidioAnalyzerWrapper's predict method is hard coded to always use "en" instead of the provided language
  • PresidioAnalyzerWrapper's construction logic around language is confusing, especially between providing an analyzer or not
    • Validated that if an analyzer is provided, the chosen language must be supported by the analyzer
    • If an analyzer is not provided, construct a new one, passing in the language as the only supported language

And made some stylistic changes which I think makes the code more readable, though obviously, this is subjective

  • Dict literals to dict constructors
  • Usage of / for creating pathlib Paths
  • Usage of f strings
  • Formatting from PyCharm, just because this is the IDE, I'm using, and it makes it easy to keep things tidy

Map NATIONALITY entity to NRP not to LOCATION
Change dict literals to dict constructors to improve readability
Asa higher level abstraction over PresidioDataGenerator for utilising all templates and providers in this library
…fault AnalyzerEngine

Validate chosen language is available in provided AnalyzerEngine
@omri374
Copy link
Contributor

omri374 commented Aug 22, 2022

Thanks @Robbie-Palmer. We'll review this soon.

@omri374
Copy link
Contributor

omri374 commented Aug 31, 2022

Thanks @Robbie-Palmer, this is a great addition and greatly improves the codebase!
A few mimor comments:

  1. Could you add a few unit-tests for the new class?
  2. The distinction between the previous and suggest class mighit not be 100% clear to the user. Should we update the tutorials or add more clarification? Perhaps rename it to something else? (I don't have a suggestion, just a thought).

@omri374
Copy link
Contributor

omri374 commented Aug 31, 2022

/azp run

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@omri374
Copy link
Contributor

omri374 commented Oct 11, 2022

/azp run

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@omri374
Copy link
Contributor

omri374 commented Nov 14, 2022

@Robbie-Palmer would you be interested in completing this PR? we'd be happy to help if needed

@Robbie-Palmer
Copy link
Contributor Author

@omri374 I'm sorry for the really long delay! I've been pulled into a number of unrelated projects these past few months
No availability this coming week, but I'll prioritise addressing your comments asap afterwards

@omri374
Copy link
Contributor

omri374 commented Nov 21, 2022

Thanks @Robbie-Palmer! Happy to collaborate and help on this if needed, there are really good contributions in this PR.

Map NATIONALITY entity to NRP not to LOCATION
Change dict literals to dict constructors to improve readability
Asa higher level abstraction over PresidioDataGenerator for utilising all templates and providers in this library
…fault AnalyzerEngine

Validate chosen language is available in provided AnalyzerEngine
Update PresidioDataGenerator tests to make stronger assertions about contents of results
Update PresidioFakeRecordGenerator to use ReligionProvider
Copy link
Contributor

@omri374 omri374 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

First of all, thank you @Robbie-Palmer! this PR introduces really great code improvements and bug fixes.

On the new class topic, I agree we should first align on the right structure. Backward compatibility is not a big issue IMHO as this package was just recently released to PyPI and we can create a deprecation notice for the old class name.

Here are the options as I see it:

  1. Continue having one class
  2. Have two classes, which is proposed in this PR
  3. Have a base class, which is inherited by another class, similar to this PR but maintaining some relation between the two.

Another discussion was on the name(s) of classes.

It makes sense IMHO to have two classes, one serving as a wrapper for Faker and the other as an orchestrator. We should just think of the names in a way that wouldn't confuse, especially in light of the terms used in faker, e.g. Faker and Generator.

Happy to hear your thoughts @Robbie-Palmer and @melmatlis.
We could also set up a short call if that would be the easiest.

Co-authored-by: melmatlis <93650751+melmatlis@users.noreply.github.com>
Rename PresidioFakeRecordGenerator to PresidioSentenceFaker to distinguish it from `RecordGenerator`
Move functions for loading data from FakeNameGenerator.com in faker format into new datasets.py module
Move logic for choosing templates out of SentenceFaker into PresidioSentenceFaker
Remove generic read file function
Add missing HospitalProvider
Update Recognizer tests to use PresidioSentenceProvider
Make single module to hold all sentence semantic dependency logic for Faker, including SentenceFaker, RecordGenerator and RecordsFaker
…aker

Rename faker_to_presidio_entity_type to ENTITY_TYPE_MAPPING
Make presidio_templates_file_path and presidio_additional_entity_providers available from package
Update Data Generation README to outline choices
@omri374 omri374 changed the base branch from master to data-generator-2.1 January 19, 2023 20:26
@omri374
Copy link
Contributor

omri374 commented Jan 19, 2023

/azp run

@azure-pipelines
Copy link

Azure Pipelines could not run because the pipeline triggers exclude this branch/path.

@omri374 omri374 changed the base branch from data-generator-2.1 to master January 19, 2023 22:18
@omri374
Copy link
Contributor

omri374 commented Jan 19, 2023

/azp run

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@omri374 omri374 changed the base branch from master to data-generator-2.1 January 20, 2023 07:26
@Robbie-Palmer Robbie-Palmer changed the title Fake Record Generator PresidioSentenceFaker Jan 20, 2023
@omri374 omri374 merged commit 1be5e86 into microsoft:data-generator-2.1 Jan 20, 2023
@omri374 omri374 mentioned this pull request Jan 20, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants