Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enable finer-grained recording of event properties and categories. #46

Closed
wants to merge 20 commits into from

Conversation

Zsailer
Copy link
Member

@Zsailer Zsailer commented May 19, 2020

This replaces #30 and takes a different approach to handle personal data.

First, it's important to state under this PR (this is in the docs, too)—JT does not collect personal data+personally identifiable information by default.

This PR requires all event schemas to explicitly state whether they collect personal data or not. Further, it requires operators to explicitly opt-in to collecting personal data. Otherwise, no personal data will ever be collected.

Changes to the schema definition

  • Requires all schemas to include a top-level field, personal-data: <bool>, explicitly stating if the event would collect personal data
  • Requires all properties to include a category field. unrestricted is a special value for properties that are always safe to collect. Otherwise, category is a free field to schema authors and should define what type of data each event property is.

Changes to the EventLog API.

  • EventLog ignores all events with personal-data: true by default.
  • EventLog has a new boolean trait, collect_personal_data. This must be set to True to record events with personal-data: true.
  • If collect_personal_data=True, the EventLog will collect PII events, but remove all properties that aren't explicitly labeled as category: unrestricted.
  • EventLog has a new list trait, allowed_categories. EventLog will not remove properties whose categories are listed in the `allowed_categories`` trait.
  • Handlers can also define their own allowed_categories attribute that supersedes the EventLog trait.

I think this addresses most of the points raised in #30. This does not apply any hashing/encryptions to personal data. I think we should definitely add that in a future PR. This is a step in the right direction that balances security and flexibility.

Example Snippets

Here is a simple example of a schema under this PR:

$id: example.schema
version: 1
personal-data: true
properties:
  name:
    title: Name
    category: unrestricted
    description: Name of event
    type: string
  user:
    title: User name
    category: personal-identifier
    description: Name of the user who initiated the event.
  affiliation:
    title: Affiliation
    category: personally-identifiable-information
    description: Affiliation of the user.

Here is an example of an EventLog configured to collect PII:

e = EventLog(
    ...
    collect_personal_data=True,
    allowed_categories=[
        'personally-identifiable-information',
        ...
    ]
)

Let me know what y'all think. @ellisonbg @blink1073 @yuvipanda @jaipreet-s @minrk

@vidartf
Copy link

vidartf commented May 22, 2020

This seems like a good thing! I have one suggestion that I hope will make it even better 😉

The examples seems to use personal information (and degrees of this) as the only measure of sensitivity. I understand why this would be the top priority, and why a shorthand exist on the top-level and server config. That being said, I think this should support other kinds of sensitivity as well (and including them in the examples might make it easier to keep this in mind and reason about):

  • A level of confidentiality in e.g., a military setting (confidential, restricted, ..., top-secret, cosmic top secret). Like for personally-identifiable information this level is linear (you can set a threshold), but would be a separate axis from personally identifiable.

  • A bag of unrelated sensitivity categories, e.g. for business-related data ("client data", "employee information", "licensed data", "restricted data level X").

The main relevant points from this is that it might be relevant to assign several categories to one schema field (["top-secret", "personal-identifier"] or ["client-data", "licensed-data"]). It might also be useful to have a special category "unknown" that would encompass all possible classes (as it might be impossible to be exhaustive in the schema). As such, the suggestion would be:

  • change "category" in the schema to "categories" and make it a list.
  • add a special value "unknown" that will match all possible categories.

Hope this makes sense, and that it can be added without detracting from achieving the personal data protection goal 👍

@blink1073
Copy link
Member

I agree with the concept of multiple categories for the reasons @vidartf states above (having needed to use such qualifiers many times at work).

@minrk
Copy link
Member

minrk commented May 25, 2020

@vidartf good point on the possibility of multiple categories. For an inclusive match, should it be all or any? I.e. if I publish a 'top-secret personal-identifier' message and EventLog is subscribed only to top-secret, should it be included or only if watching both top-secret and personal-identifier? I can imagine either way being desirable.

We can, of course, accomplish most of these with a flat list as long as enough unique tags are applied, but that probably gets untenable.

It seems like maybe there's a case to be made for a more sophisticated label filter with arbitrary user filtering logic, as in kubernetes label filters? Should there be an arbitrary (one-level all strings) dict of labels to enable this?

I think categories is a great start and we probably can solve this in a second pass with more generic 'labels', so we don't hold up progress, but I'm okay, too, if folks want to try to solve this better in the first pass.

@Zsailer
Copy link
Member Author

Zsailer commented May 26, 2020

Thanks, @vidartf!

I totally agree with everything you said. Allowing administrators to list multiple categories is probably something we should allow here.

add a special value "unknown" that will match all possible categories.

Is the "unrestricted" category different than the "unknown" label?

"unrestricted" essentially means "always record", which sounds similar to your use-case of "unknown". If these are equivalent, do you think "unknown" is a better label?

@Zsailer
Copy link
Member Author

Zsailer commented May 26, 2020

should it be all or any?

@minrk, I think it should be all.

I would like to err on the side of "more explicit" when handling personal data. Because this field is particularly targeted at handling personal data, I would rather data not be collected by mistake, rather than have an unexpected piece of data slip through from an any match.

This forces administrators to list every collected category explicitly in their config. It's more work on administrators, but safer for personal data.

"Label filtering" with arbitrary filtering logic is an interesting idea, but again, I'm not sure we want to allow this when handling personal data. I would rather add more constraints that prevent major accidental breaches of personal data collection.

Keep in mind, because we're using Python's logging standard library, admins can easily add their own filters to the logging handlers they pass to the EventLog (see my example here). It would be up to the schema authors to create separate properties for more clever filtering (preferably not on sensitivity level).

Perhaps the issue is that "category" is too broad of a term. Maybe we should narrow this property to something like "pii-category".

@minrk
Copy link
Member

minrk commented May 27, 2020

Thanks, then I think we can get started with a single category field, and require opt-in to all individual categories for now. Then maybe wait for a scenario where this becomes untenable before working on a next revision?

@vidartf
Copy link

vidartf commented Jun 1, 2020

@vidartf good point on the possibility of multiple categories. For an inclusive match, should it be all or any? I.e. if I publish a 'top-secret personal-identifier' message and EventLog is subscribed only to top-secret, should it be included or only if watching both top-secret and personal-identifier? I can imagine either way being desirable.

Sorry for being slow to respond, I had a week off from coding. Given the example case, I might have misunderstood the PR so please don't stall on me. I was under the impression that the subscription was of the type "everything but allowed_categories", but you seem to be saying it is possible to subscribe to only a specific category (which is reasonable), which opens a bunch of follow-up questions. I'm more than happy to open up a new issue for that discussion.

@vidartf
Copy link

vidartf commented Jun 1, 2020

Is the "unrestricted" category different than the "unknown" label?

It would be close to opposite, indicating "this message might match any and all categories" (e.g. if you for some reason wanted to log all kernel messages, it is hard to be exhaustive about which kinds of restriction categories it will match). Think of it as a * glob. But yeah, I'll write this out more properly in another issue.

@Zsailer
Copy link
Member Author

Zsailer commented Jun 1, 2020

Thanks, @vidartf. I think I understand what you mean. You're asking if there's a way to subscribe to multiple events without explicitly listing every event type—i.e. subscribe to all events matching some category.

This is slightly different. This is mostly a post-processing mechanism that drops properties/fields/attrs from an event that includes personal data. It scans each incoming event (that you're already subscribed to) and drops any properties of that event that includes personal-data. The event itself is still collected, but won't include these properties. (You'll still record the type of event, timestamp, etc., just not any properties with categories not white-listed).

That said, I did add one extra fail-safe switch to prevent collecting the events entirely if they include personal data. That's the collect_personal_data=<bool> config option. This is one more sanity check at the beginning that forces operators to be sure they want to collect such events.

@Zsailer
Copy link
Member Author

Zsailer commented Jun 1, 2020

I've updated this PR to allow multiple categories.

I'd love another review before merging.

Copy link
Collaborator

@yuvipanda yuvipanda left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Apologies for the late review, and thank you very much for explicitly pinging me off-github, @Zsailer!


from jupyter_telemetry import EventLog

e = EventLog(collect_personal_data=True)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will much more likely be set by operators with:

c.EventLog.collect_personal_data = True

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we make operators explicitly specify the list of schemas they want (which we already do), this would possibly unintentionally hide that in slightly hard to debug ways. I would love for us to move everything to a per-schema setting, possibly just using categories. Maybe by default only 'unrestricted' will be recorded unless you explicitly opt in? Should have the same effect as this, but be more explicit.


e = EventLog(
collect_personal_data=True,
allowed_categories=['personally-identifiable-information']
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this is set at an EventLog level, this doesn't let us tune what kinda information we want to collect from which schema. In one schema, 'personal-identifier' might be a pseudonym, and in another it could be your phone number. Those two should be treated completely differently, and keeping this at the global level doesn't allow admins to do that. Schemas could use something more specific than 'personal-identifier', but that requires co-ordination between different schema writers to make sure they don't overlap. This pseudonym / phone number mixup could happen unintentionally, for example, when you have a different spawner / authenticator combo.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you think about moving this instead to allowed_schemas? Instead of strings, you can optionally add a dict with schema name, allowed categories, etc.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like this idea a lot, @yuvipanda. Let me refactor things a bit and see what we think.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One thing that isn't clear here—should individual handlers dictate what type of data is logged by that handler? Right now, allowed_schemas is set at the EventLog level. Should we, instead, set allowed_schemas (with extra metadata about categories) in each handler?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it should be layered - handlers inherit allowed_schemas but can impose their own.

Copy link
Collaborator

@yuvipanda yuvipanda left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know how to GitHub, tried to hit 'Request Changes' last time...

@Zsailer
Copy link
Member Author

Zsailer commented Aug 21, 2020

Ok! I've taken another crack at this given @yuvipanda's feedback. I think it's much better!

The allowed_schemas trait now takes a dictionary like the following:

# User config to explicitly list what data should be recorded.
allowed_schemas = {
    "uri.to.schema": {
        # explicitly list properties to record
        "properties": ["name", "email"],
        # and/or list categories to record
        "categories": ["personal-information"]
    }
}

If an event's property is explicitly listed, it will be recorded in that event's logged data. Otherwise, it will only be recorded if the property's categories matches the listed category. (<-- my only question is whether this should be an any or all... my inclination is to make this an any).

@Zsailer Zsailer changed the title Handling Personal data Enable finer-grained recording of event properties and categories. Aug 22, 2020
@Zsailer
Copy link
Member Author

Zsailer commented Aug 24, 2020

Closing and reopening as a new PR, since this work has changed dramatically.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants