Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

update address regex in the standard pii policy #150

Merged
merged 2 commits into from
Aug 15, 2023

Conversation

JustEmrick
Copy link
Contributor

Description

This update is a change to the templated Standard PII policy that will more accurately find US based addresses

Problem

  1. In doing some testing with a small test dataset, I noticed that NO address was being discovered by the standard PII policy template. I created my own template with updated regex and it correctly identified each variation of the address from the test set.

Testing

Using the test dataset below which is contained in my Mantium application, I ran both the template standard PII policy , which includes an address scan, and my own custom policy. In comparison, I was able to correctly identify the addresses in the test dataset using my regex whereas in the template, none were identified.

Standard Policy

image

Custom Policy

This contains the new regex
image

Test Dataset

image

@zimventures zimventures changed the title this change updates the address regex in the standard pii policy update address regex in the standard pii policy Aug 14, 2023
Copy link
Contributor

@zimventures zimventures left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just need to get that test to pass and this looks good - thanks for the fix!

@JustEmrick
Copy link
Contributor Author

Looks like the test that's failing is the one doing the Regex check on the address and making sure there isn't a match on the incorrect address examples. Not sure why the first would be considered incorrect. Recommendation would be a test that matches all or a portion of a correct address as I'd personally want to catch potential mistakes via fat fingering or the way a system may input the data from a system. An example would be if Street address, city, state and postal code are in different columns, it probably wouldn't get loaded in to a database correctly.

Failing Test:

def test_address_pattern_invalid(self):
        """Verify that the Address regex pattern does not match invalid addresses."""
        invalid_addresses = [
            '123 Main Street, Los Angeles, CA',
            '987 Elm Avenue, New York 10001',
            '001 Oak Court, San Francisco 94102',
        ]
        self.verify_pattern('Address', invalid_addresses, False)

@zimventures
Copy link
Contributor

I would agree - the fields shown as invalid addresses are still "address-like" and as such, should be flagged -> which apparently they now are! We have two options here:

  1. remove the test entirely
  2. Change the list of invalid addresses to be truly invalid ("Joe Cool", "123 Marbles", "Place of Business", etc...)

Personally, I feel as though option 2 keeps the spirit of the test alive.

@JustEmrick
Copy link
Contributor Author

I like the second idea. I updated the test criteria to be more incorrect as you recommended.

Copy link
Contributor Author

@JustEmrick JustEmrick left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated so regex tests pass. The old descriptions were too address adjacent and likely that users could have those in an actual dataset. The changes are more obviously fake.

Copy link
Contributor Author

@JustEmrick JustEmrick left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated the test that was failing for identifying an "incorrect address" but was landing on a valid portion of it.

@zimventures zimventures merged commit 10b991b into mantiumai:main Aug 15, 2023
2 checks passed
@zimventures zimventures added the enhancement New feature or request label Aug 15, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
No open projects
Status: 0.2.0
Development

Successfully merging this pull request may close these issues.

None yet

2 participants