Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Test-suite for data anonymization #2

Open
louislva opened this issue Nov 13, 2022 · 4 comments
Open

Test-suite for data anonymization #2

louislva opened this issue Nov 13, 2022 · 4 comments
Labels
task A smaller, more concise task

Comments

@louislva
Copy link
Owner

louislva commented Nov 13, 2022

We need to develop automatic data anonymization, and to do that sanely, we should have a test-suite to check for false negatives in the data anonymization.

A simple way to do that: Record a number of sessions of humans typing in (fake) sensitive data, and save them as JSON files. Then make a test-suite that puts each JSON file through the anonymize() function, and checks whether the values to be anonymized are present after. It should also check for them inside the concatenated keystrokes. If they are still present, this should fail the test case.

The kind of sensitive data we should test for:

  • Email address
  • Password
  • Name
  • Home address
  • Phone no
  • Bank account / credit card details
  • Crypto seed phrases
  • API keys
  • Social security / VAT number / passport number
  • ... anything else you can think of! Please throw a comment!
@JohannesHa
Copy link

some of these seem to be covered by the maskInputOptions parameter of the rrweb.record function. The other cases could be handled in the maskInputFn and maskTextFn parameters of rrweb.record.
https://github.com/rrweb-io/rrweb/blob/master/guide.md#options

MaskInputOptions:
https://github.com/rrweb-io/rrweb/blob/588164aa12f1d94576f89ae0210b98f6e971c895/packages/rrweb-snapshot/src/types.ts#L77-L95

Probably still makes sense to build some kind of test-suite with mock events for rrweb.record to ensure that all edge cases are covered.

@louislva
Copy link
Owner Author

That actually looks pretty suitable! I'm curious whether maskInputFn & maskTextFn can also replace the masked value with a placeholder? Even if they can't, for shipping V1 we just need to censor personal data, not nessacarily do the placeholders (although they'd be really useful to train with). I think we'll just put a "anonymization_scheme_version" column in the database, so you can see what's what.

Also, how do you think we'll go about censoring data we don't know is personally identifiable? For example, if I'm logged into Google, it'll display my full name in certain places.

One idea I had was to automatically scrape it (or simply ask the user for all their personal details), save it locally, and then use maskTextFn to look for the data which we know to be personal.

@louislva
Copy link
Owner Author

Looked into it, you set a maskTextSelector (could probably be *), and then maskTextFn get triggered, which basically maps from old text to new text. So yes, we can do placeholders 🥳

@louislva
Copy link
Owner Author

Another important test case: profile picture anonymization! (in the top right of Github for example; pretty easy to recover someone's identity with a picture of their face)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
task A smaller, more concise task
Projects
None yet
Development

No branches or pull requests

2 participants