Enable finer-grained recording of event properties and categories. #46

Zsailer · 2020-05-19T17:44:11Z

This replaces #30 and takes a different approach to handle personal data.

First, it's important to state under this PR (this is in the docs, too)—JT does not collect personal data+personally identifiable information by default.

This PR requires all event schemas to explicitly state whether they collect personal data or not. Further, it requires operators to explicitly opt-in to collecting personal data. Otherwise, no personal data will ever be collected.

Changes to the schema definition

Requires all schemas to include a top-level field, personal-data: <bool>, explicitly stating if the event would collect personal data
Requires all properties to include a category field. unrestricted is a special value for properties that are always safe to collect. Otherwise, category is a free field to schema authors and should define what type of data each event property is.

Changes to the EventLog API.

EventLog ignores all events with personal-data: true by default.
EventLog has a new boolean trait, collect_personal_data. This must be set to True to record events with personal-data: true.
If collect_personal_data=True, the EventLog will collect PII events, but remove all properties that aren't explicitly labeled as category: unrestricted.
EventLog has a new list trait, allowed_categories. EventLog will not remove properties whose categories are listed in the `allowed_categories`` trait.
Handlers can also define their own allowed_categories attribute that supersedes the EventLog trait.

I think this addresses most of the points raised in #30. This does not apply any hashing/encryptions to personal data. I think we should definitely add that in a future PR. This is a step in the right direction that balances security and flexibility.

Example Snippets

Here is a simple example of a schema under this PR:

$id: example.schema
version: 1
personal-data: true
properties:
  name:
    title: Name
    category: unrestricted
    description: Name of event
    type: string
  user:
    title: User name
    category: personal-identifier
    description: Name of the user who initiated the event.
  affiliation:
    title: Affiliation
    category: personally-identifiable-information
    description: Affiliation of the user.

Here is an example of an EventLog configured to collect PII:

e = EventLog(
    ...
    collect_personal_data=True,
    allowed_categories=[
        'personally-identifiable-information',
        ...
    ]
)

Let me know what y'all think. @ellisonbg @blink1073 @yuvipanda @jaipreet-s @minrk

vidartf · 2020-05-22T09:42:56Z

This seems like a good thing! I have one suggestion that I hope will make it even better 😉

The examples seems to use personal information (and degrees of this) as the only measure of sensitivity. I understand why this would be the top priority, and why a shorthand exist on the top-level and server config. That being said, I think this should support other kinds of sensitivity as well (and including them in the examples might make it easier to keep this in mind and reason about):

A level of confidentiality in e.g., a military setting (confidential, restricted, ..., top-secret, cosmic top secret). Like for personally-identifiable information this level is linear (you can set a threshold), but would be a separate axis from personally identifiable.
A bag of unrelated sensitivity categories, e.g. for business-related data ("client data", "employee information", "licensed data", "restricted data level X").

The main relevant points from this is that it might be relevant to assign several categories to one schema field (["top-secret", "personal-identifier"] or ["client-data", "licensed-data"]). It might also be useful to have a special category "unknown" that would encompass all possible classes (as it might be impossible to be exhaustive in the schema). As such, the suggestion would be:

change "category" in the schema to "categories" and make it a list.
add a special value "unknown" that will match all possible categories.

Hope this makes sense, and that it can be added without detracting from achieving the personal data protection goal 👍

blink1073 · 2020-05-22T12:53:38Z

I agree with the concept of multiple categories for the reasons @vidartf states above (having needed to use such qualifiers many times at work).

minrk · 2020-05-25T11:35:58Z

@vidartf good point on the possibility of multiple categories. For an inclusive match, should it be all or any? I.e. if I publish a 'top-secret personal-identifier' message and EventLog is subscribed only to top-secret, should it be included or only if watching both top-secret and personal-identifier? I can imagine either way being desirable.

We can, of course, accomplish most of these with a flat list as long as enough unique tags are applied, but that probably gets untenable.

It seems like maybe there's a case to be made for a more sophisticated label filter with arbitrary user filtering logic, as in kubernetes label filters? Should there be an arbitrary (one-level all strings) dict of labels to enable this?

I think categories is a great start and we probably can solve this in a second pass with more generic 'labels', so we don't hold up progress, but I'm okay, too, if folks want to try to solve this better in the first pass.

Zsailer · 2020-05-26T20:35:13Z

Thanks, @vidartf!

I totally agree with everything you said. Allowing administrators to list multiple categories is probably something we should allow here.

add a special value "unknown" that will match all possible categories.

Is the "unrestricted" category different than the "unknown" label?

"unrestricted" essentially means "always record", which sounds similar to your use-case of "unknown". If these are equivalent, do you think "unknown" is a better label?

Zsailer · 2020-05-26T20:54:37Z

should it be all or any?

@minrk, I think it should be all.

I would like to err on the side of "more explicit" when handling personal data. Because this field is particularly targeted at handling personal data, I would rather data not be collected by mistake, rather than have an unexpected piece of data slip through from an any match.

This forces administrators to list every collected category explicitly in their config. It's more work on administrators, but safer for personal data.

"Label filtering" with arbitrary filtering logic is an interesting idea, but again, I'm not sure we want to allow this when handling personal data. I would rather add more constraints that prevent major accidental breaches of personal data collection.

Keep in mind, because we're using Python's logging standard library, admins can easily add their own filters to the logging handlers they pass to the EventLog (see my example here). It would be up to the schema authors to create separate properties for more clever filtering (preferably not on sensitivity level).

Perhaps the issue is that "category" is too broad of a term. Maybe we should narrow this property to something like "pii-category".

minrk · 2020-05-27T08:05:05Z

Thanks, then I think we can get started with a single category field, and require opt-in to all individual categories for now. Then maybe wait for a scenario where this becomes untenable before working on a next revision?

vidartf · 2020-06-01T11:23:34Z

@vidartf good point on the possibility of multiple categories. For an inclusive match, should it be all or any? I.e. if I publish a 'top-secret personal-identifier' message and EventLog is subscribed only to top-secret, should it be included or only if watching both top-secret and personal-identifier? I can imagine either way being desirable.

Sorry for being slow to respond, I had a week off from coding. Given the example case, I might have misunderstood the PR so please don't stall on me. I was under the impression that the subscription was of the type "everything but allowed_categories", but you seem to be saying it is possible to subscribe to only a specific category (which is reasonable), which opens a bunch of follow-up questions. I'm more than happy to open up a new issue for that discussion.

vidartf · 2020-06-01T11:26:46Z

Is the "unrestricted" category different than the "unknown" label?

It would be close to opposite, indicating "this message might match any and all categories" (e.g. if you for some reason wanted to log all kernel messages, it is hard to be exhaustive about which kinds of restriction categories it will match). Think of it as a * glob. But yeah, I'll write this out more properly in another issue.

Zsailer · 2020-06-01T19:19:57Z

Thanks, @vidartf. I think I understand what you mean. You're asking if there's a way to subscribe to multiple events without explicitly listing every event type—i.e. subscribe to all events matching some category.

This is slightly different. This is mostly a post-processing mechanism that drops properties/fields/attrs from an event that includes personal data. It scans each incoming event (that you're already subscribed to) and drops any properties of that event that includes personal-data. The event itself is still collected, but won't include these properties. (You'll still record the type of event, timestamp, etc., just not any properties with categories not white-listed).

That said, I did add one extra fail-safe switch to prevent collecting the events entirely if they include personal data. That's the collect_personal_data=<bool> config option. This is one more sanity check at the beginning that forces operators to be sure they want to collect such events.

Zsailer · 2020-06-01T19:23:53Z

I've updated this PR to allow multiple categories.

I'd love another review before merging.

yuvipanda

Apologies for the late review, and thank you very much for explicitly pinging me off-github, @Zsailer!

yuvipanda · 2020-06-05T13:23:28Z

docs/pages/personal-data.rst

+
+    from jupyter_telemetry import EventLog
+
+    e = EventLog(collect_personal_data=True)


This will much more likely be set by operators with:

c.EventLog.collect_personal_data = True

If we make operators explicitly specify the list of schemas they want (which we already do), this would possibly unintentionally hide that in slightly hard to debug ways. I would love for us to move everything to a per-schema setting, possibly just using categories. Maybe by default only 'unrestricted' will be recorded unless you explicitly opt in? Should have the same effect as this, but be more explicit.

yuvipanda · 2020-06-05T13:28:14Z

docs/pages/personal-data.rst

+
+    e = EventLog(
+      collect_personal_data=True,
+      allowed_categories=['personally-identifiable-information']


Since this is set at an EventLog level, this doesn't let us tune what kinda information we want to collect from which schema. In one schema, 'personal-identifier' might be a pseudonym, and in another it could be your phone number. Those two should be treated completely differently, and keeping this at the global level doesn't allow admins to do that. Schemas could use something more specific than 'personal-identifier', but that requires co-ordination between different schema writers to make sure they don't overlap. This pseudonym / phone number mixup could happen unintentionally, for example, when you have a different spawner / authenticator combo.

What do you think about moving this instead to allowed_schemas? Instead of strings, you can optionally add a dict with schema name, allowed categories, etc.

I like this idea a lot, @yuvipanda. Let me refactor things a bit and see what we think.

One thing that isn't clear here—should individual handlers dictate what type of data is logged by that handler? Right now, allowed_schemas is set at the EventLog level. Should we, instead, set allowed_schemas (with extra metadata about categories) in each handler?

I think it should be layered - handlers inherit allowed_schemas but can impose their own.

yuvipanda

I don't know how to GitHub, tried to hit 'Request Changes' last time...

Zsailer · 2020-08-21T23:18:20Z

Ok! I've taken another crack at this given @yuvipanda's feedback. I think it's much better!

The allowed_schemas trait now takes a dictionary like the following:

# User config to explicitly list what data should be recorded.
allowed_schemas = {
    "uri.to.schema": {
        # explicitly list properties to record
        "properties": ["name", "email"],
        # and/or list categories to record
        "categories": ["personal-information"]
    }
}

If an event's property is explicitly listed, it will be recorded in that event's logged data. Otherwise, it will only be recorded if the property's categories matches the listed category. (<-- my only question is whether this should be an any or all... my inclination is to make this an any).

Zsailer · 2020-08-24T15:28:58Z

Closing and reopening as a new PR, since this work has changed dramatically.

Zsailer added 12 commits April 30, 2020 13:51

Add PII awareness to event formatter

48922ee

mult-level security

002d116

add default event level to handlers in eventlog initialize

65cf063

fix tests

898a118

various linting issues

add09dd

add docs about sensitive data

38509ee

wip

64088bf

replace multi-level approach to tagging approach

2b17927

add explicit argument to collect personal data

39011bc

more explicit language in docs

1db553d

add sentence to schema docs about personal data

4c3e83a

typo in formatter

6a255be

This was referenced May 19, 2020

[MRG] Add multi-level security awareness to event formatter #30

Closed

Personal data design considerations #31

Open

Zsailer added 2 commits May 19, 2020 10:50

remove unused traits

53d6a2b

linting errors fixed

c247032

Zsailer mentioned this pull request May 19, 2020

Telemetry and Event Logging jupyter-server/jupyter_server#233

Closed

Zsailer requested review from yuvipanda, ellisonbg, jaipreet-s and minrk May 21, 2020 21:06

make categories field a list

4cfcded

linting trouble

2d60081

Zsailer mentioned this pull request Jun 1, 2020

pin jupyter_telemetry dependency jupyterhub/jupyterhub#3067

Merged

yuvipanda reviewed Jun 5, 2020

View reviewed changes

yuvipanda requested changes Jun 5, 2020

View reviewed changes

Zsailer added 2 commits August 19, 2020 15:04

make allowed_schemas trait handle category filtering

483718e

update docs

766ca4f

Zsailer added 2 commits August 21, 2020 16:21

fix keys in tests

a318744

linting cleanup

d0407c6

Zsailer changed the title ~~Handling Personal data~~ Enable finer-grained recording of event properties and categories. Aug 22, 2020

Zsailer closed this Aug 24, 2020

Zsailer mentioned this pull request Aug 24, 2020

Fine grained data collection #49

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enable finer-grained recording of event properties and categories. #46

Enable finer-grained recording of event properties and categories. #46

Zsailer commented May 19, 2020 •

edited

vidartf commented May 22, 2020

blink1073 commented May 22, 2020

minrk commented May 25, 2020

Zsailer commented May 26, 2020

Zsailer commented May 26, 2020

minrk commented May 27, 2020

vidartf commented Jun 1, 2020

vidartf commented Jun 1, 2020

Zsailer commented Jun 1, 2020

Zsailer commented Jun 1, 2020

yuvipanda left a comment •

edited

yuvipanda Jun 5, 2020

yuvipanda Jun 5, 2020

yuvipanda Jun 5, 2020

yuvipanda Jun 5, 2020

Zsailer Aug 18, 2020

Zsailer Aug 18, 2020

blink1073 Aug 19, 2020

yuvipanda left a comment

Zsailer commented Aug 21, 2020

Zsailer commented Aug 24, 2020


		from jupyter_telemetry import EventLog

		e = EventLog(collect_personal_data=True)

Enable finer-grained recording of event properties and categories. #46

Enable finer-grained recording of event properties and categories. #46

Conversation

Zsailer commented May 19, 2020 • edited

Example Snippets

vidartf commented May 22, 2020

blink1073 commented May 22, 2020

minrk commented May 25, 2020

Zsailer commented May 26, 2020

Zsailer commented May 26, 2020

minrk commented May 27, 2020

vidartf commented Jun 1, 2020

vidartf commented Jun 1, 2020

Zsailer commented Jun 1, 2020

Zsailer commented Jun 1, 2020

yuvipanda left a comment • edited

Choose a reason for hiding this comment

yuvipanda Jun 5, 2020

Choose a reason for hiding this comment

yuvipanda Jun 5, 2020

Choose a reason for hiding this comment

yuvipanda Jun 5, 2020

Choose a reason for hiding this comment

yuvipanda Jun 5, 2020

Choose a reason for hiding this comment

Zsailer Aug 18, 2020

Choose a reason for hiding this comment

Zsailer Aug 18, 2020

Choose a reason for hiding this comment

blink1073 Aug 19, 2020

Choose a reason for hiding this comment

yuvipanda left a comment

Choose a reason for hiding this comment

Zsailer commented Aug 21, 2020

Zsailer commented Aug 24, 2020

Zsailer commented May 19, 2020 •

edited

yuvipanda left a comment •

edited