Test-suite for data anonymization #2

louislva · 2022-11-13T10:48:28Z

We need to develop automatic data anonymization, and to do that sanely, we should have a test-suite to check for false negatives in the data anonymization.

A simple way to do that: Record a number of sessions of humans typing in (fake) sensitive data, and save them as JSON files. Then make a test-suite that puts each JSON file through the anonymize() function, and checks whether the values to be anonymized are present after. It should also check for them inside the concatenated keystrokes. If they are still present, this should fail the test case.

The kind of sensitive data we should test for:

Email address
Password
Name
Home address
Phone no
Bank account / credit card details
Crypto seed phrases
API keys
Social security / VAT number / passport number
... anything else you can think of! Please throw a comment!

The text was updated successfully, but these errors were encountered:

JohannesHa · 2022-11-14T23:17:41Z

some of these seem to be covered by the maskInputOptions parameter of the rrweb.record function. The other cases could be handled in the maskInputFn and maskTextFn parameters of rrweb.record.
https://github.com/rrweb-io/rrweb/blob/master/guide.md#options

MaskInputOptions:
https://github.com/rrweb-io/rrweb/blob/588164aa12f1d94576f89ae0210b98f6e971c895/packages/rrweb-snapshot/src/types.ts#L77-L95

Probably still makes sense to build some kind of test-suite with mock events for rrweb.record to ensure that all edge cases are covered.

louislva · 2022-11-14T23:31:08Z

That actually looks pretty suitable! I'm curious whether maskInputFn & maskTextFn can also replace the masked value with a placeholder? Even if they can't, for shipping V1 we just need to censor personal data, not nessacarily do the placeholders (although they'd be really useful to train with). I think we'll just put a "anonymization_scheme_version" column in the database, so you can see what's what.

Also, how do you think we'll go about censoring data we don't know is personally identifiable? For example, if I'm logged into Google, it'll display my full name in certain places.

One idea I had was to automatically scrape it (or simply ask the user for all their personal details), save it locally, and then use maskTextFn to look for the data which we know to be personal.

louislva · 2022-11-15T13:03:52Z

Looked into it, you set a maskTextSelector (could probably be *), and then maskTextFn get triggered, which basically maps from old text to new text. So yes, we can do placeholders 🥳

louislva · 2022-11-20T00:49:22Z

Another important test case: profile picture anonymization! (in the top right of Github for example; pretty easy to recover someone's identity with a picture of their face)

louislva mentioned this issue Nov 13, 2022

Automatic data anonymization #3

Open

louislva added the task A smaller, more concise task label Nov 13, 2022

louislva mentioned this issue Nov 13, 2022

Verify that rrweb collects keystrokes & create a utility to extract them all to one string #8

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Test-suite for data anonymization #2

Test-suite for data anonymization #2

louislva commented Nov 13, 2022 •

edited

Loading

JohannesHa commented Nov 14, 2022

louislva commented Nov 14, 2022

louislva commented Nov 15, 2022

louislva commented Nov 20, 2022

Test-suite for data anonymization #2

Test-suite for data anonymization #2

Comments

louislva commented Nov 13, 2022 • edited Loading

JohannesHa commented Nov 14, 2022

louislva commented Nov 14, 2022

louislva commented Nov 15, 2022

louislva commented Nov 20, 2022

louislva commented Nov 13, 2022 •

edited

Loading