[MRG] Add multi-level security awareness to event formatter #30

Zsailer · 2019-09-30T18:03:48Z

This PR adds "multi-level security" awareness to the event log.

Background

I've been playing around with different approaches for handling personally identifiable information when event logging. When I've talked to a few different stake-holders, PII alway became a big point of conversation. Some groups requires PII data in their event logs, others are required by law to drop it. It's a challenge to accommodate both groups. Here's what I came up with...

Technical Details

This PR patches the json formatter to look for a log level in the properties of event schemas. If the given handler that's emitting those events has an attribute handler.event_level, then properties with levels > event_level will be dropped from the emitted event. We should discuss what are appropriate level names. In this PR, I've just arbitrarily chosen:

unclassified
confidential
secret
top_secret

This requires operators to explicitly add event_level to each handler to emit sensitive data from that handler. Each handler can dictate its own level. The same event log can emit sensitive data to one handler and not another. This make it extremely flexible for managing this type of data.

This also requires event schemas to include level attribute for each property in their JSON schema. If this attribute is missing from a property, it is assumed that the property is not "unclassified". Here's what an example schema might look like. The name property for any event with this schema is PII and will not be included if level = 'confidential' on the emitting handler.

    $id: url.to.event.schema
    version: 1
    title: My Event
    description: |
      All events must have a name property
    type: object
    properties:
      name:
        title: Name
        description: |
            Name of event
        level: confidential
        type: string
    required:
    - name

For operators

It's simple to configure this option. Simply add an attribute to each handler:

import logging 

handler = logging.FileHandler('events.log')
handler.event_level = 'confidential'

eventlog = EventLog(
    handlers = [handler],
    ...
)

When the event log is initialized, each handler will get the new JSON formatter with PII awareness.

Zsailer · 2019-09-30T18:15:01Z

Still needs tests and documentation.

Zsailer · 2019-10-01T00:30:37Z

I've refactored this PR a bit to allow for "multi-level security".

Zsailer · 2019-10-02T18:21:55Z

Okay added tests and documentation. This is ready to merge if we think this is a reasonable approach.

Zsailer · 2019-10-02T18:30:04Z

Oops, forgot some docs. Working on that now. The code/tests should be ready for review.

Zsailer · 2019-10-02T19:20:37Z

Ok added initial docs, but they likely need work.

ellisonbg · 2019-10-03T05:50:57Z

@Zsailer this is great work. Thanks for diving into this important area of this work. I opened a new issue #31 with some comments on this, but will also add a note here. The GDPR has a useful description and classification of "personal data":

https://ico.org.uk/for-organisations/guide-to-data-protection/guide-to-the-general-data-protection-regulation-gdpr/key-definitions/what-is-personal-data/

that I think is helpful here. While I like the direction this is going, I am concerned about us picking security levels that are somewhat arbitrary. What if we just had two levels:

Personal data
Not personal data

ellisonbg · 2019-10-03T05:52:14Z

Another key point - adding a username or identifier to essentially any event causes that entire event to be considered personal data. Because of this, I think usernames/identifiers need special treatment.

Zsailer · 2019-10-03T16:38:57Z

Thanks, @ellisonbg!

adding a username or identifier to essentially any event causes that entire event to be considered personal data

What if we filter out the username/identifier properties of the event using this PRs approach? Is the event still considered personal data?

ellisonbg · 2019-10-04T03:40:24Z

Where do we put the username/identifier on events. Is that a common property on a base event schema? If not, I think it should be - I don't want people trying to get around these things by calling it something different and pretending they aren't collecting it.

You raise a great question though. Let's see if we can find an example of an event that shows how this would work. Let's say there is an event that has:

username
social security number
NEW NOTEBOOK ACTION WAS TRIGGERED!

If username alone is removed, the event would definitely still be personal data. However, if both the usearname and social security number are removed, then I think the event (at least this one!) is no longer personal data. So as long as the filtering logic removes all personal data attributes, I think the event can survive...unless there are a combination of non-personal data that in aggregate becomes personal.

This gets into the land of reidentification. When assessing properties as personal data or not, we need to be extremely skeptical about if they could be used to reidentify the individual. The definition of personal data in GDPR does a decent job of covering this, and it would probably be worthwhile for us to have an expert in data privacy law audit our assessment of properties once we get further along.

yuvipanda · 2019-10-08T10:47:13Z

I hope nobody puts social security numbers in JSON events! :D

yuvipanda · 2019-10-08T10:48:14Z

Link to SELinux where we stole this from: https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux/6/html/security-enhanced_linux/mls

yuvipanda · 2019-10-08T10:49:43Z

jupyter_telemetry/formatter.py

+        key = (log_record['__schema__'], log_record['__version__'])
+        schema = self.logger.schemas[key]['properties']
+        # Logging keys that won't be in the schema.
+        ignored_keys = ['__schema__', '__timestamp__', '__version__', 'message']


We should probably ignore every key that starts with and ends with two underscores, since these will never come from the user schema

yuvipanda · 2019-10-08T10:51:39Z

We should also put __event_classification__ or `event_level`` in all emitted events, so analysts can look at it and figure out what fields they can expect to be there. So you can figure out what attributes exist with a combination of (schema_name, version, event_level)

yuvipanda · 2019-10-08T10:57:00Z

For levels, I think we should go from our experience building schemas for things inside the Jupyter project, and be more descriptive. I propose that currently we only have three levels:

unrestricted (or basic?)
user-identifier (usernames, user id, etc)
user-identifiable-information (uncleaned user agent, IPs, lat/lon (why?), social security number (usernames are not passwords, america!))

We should provide strong guidelines on what kinda data gets set to what level, and encourage people to adopt those. We can enforce that in the events we ship in our projects. However, we can't enforce it on others building custom events, and I hope we limit data collection in the default events we emit to only what's needed.

minrk · 2020-01-30T12:03:48Z

docs/pages/schemas.rst

@@ -33,6 +40,7 @@ Here is a minimal example of a valid JSON schema for an event.
    properties:
      name:
        title: Name
+        level: confidential


FWIW, I don't know that this is a good example. Since the name property is required for all events (right?), I think maybe this should always be unclassified. Maybe a two-field example better illustrates it:

name: title: Name level: unclassified description: Name of event user: title: User name level: confidential description: Name of user who initiated event

👍 good point. I'll update this.

minrk · 2020-01-30T12:05:12Z

docs/pages/sensitive-data.rst

+
+Jupyter Telemetry offers flexibility when logging and handling sensitive data. 
+
+Since events may include data with varying degrees of sensitivity, Jupyter Telemetry uses a multi-level security approach. It exposed four levels of sensitivity:


Is there a default level when unspecified, or is level always required to be explicit for every property?

Right now, the default is unclassified if level is not given, but I think it's an open question before merging this. I actually think it makes sense to require the level to be explicit in each property... what are your thoughts?

I think it would be nice to have it be required.

docs/pages/sensitive-data.rst

minrk · 2020-01-30T12:06:29Z

jupyter_telemetry/eventlog.py

            for handler in self.handlers:
+                # If event_log is not an attribute of handler, patch in.
+                if not hasattr(handler, 'event_level'):


Rather than modifying the object, we can use

event_level = getattr(handler, 'event_level', 'unclassified')

minrk · 2020-01-30T12:15:04Z

jupyter_telemetry/formatter.py

+
+        super(JsonEventFormatter, self).__init__(*args, **kwargs)
+
+    def setEventLevel(self, event_level):


Having a public setter API and a protected attribute for the same is not very Pythonic (much of the stdlib logging module is this way for historical reasons, but it doesn't need to be added to). This can be accomplished with a single property, eliminating the setEventLevel setter:

@property def event_level(self): return self._event_level @event_level.setter def event_level(self, event_level): if event_level not in EVENT_LEVELS: raise Exception("Event level '{}' not understood.".format(event_level)) self._event_level = self.handler.event_level

Yes, that is true— but in this case, we're overriding the method from the JsonFormatter parent class (which is an external dependency). The issue is that this method is called by other places in this parent class. Thus, I think we need to keep this method signature consistent with the parent class. I realize this isn't Pythonic, but it's consistent with the method it overrides.

Looks like this was recently changed! Ignore my prior comment. :)

Updating to use property setter.

Zsailer · 2020-02-07T19:22:45Z

@yuvipanda how would you rank the three levels you mentioned above:

unrestricted

user-identifier

user-identifiable-information

Isn't user-identifier more sensitive than user-identifiable-information? For example, a user's affiliation (e.g. Project Jupyter) can be personally-identifiable information when combined with other PII data, but alone, it's less identifiable/sensitive than actually using their username (e.g. zsailer), right? It's unclear to me what the sensitivity "level" of these categories would be.

ellisonbg · 2020-02-08T22:53:48Z

Thanks everyone for contributing to this discussion...

The user-identifier and user-identifiable-information classes are definitely classes that I see in the wild. From my perspective, the usefulness of having classes is in giving the operator information that can be used to (a) remove the information or (b) hash/anonymize appropriately. Because user identifiers tend to be the link that enables reidentification, I think they deserve special consideration. So, overall, I think the three classes proposed by @yuvipanda make sense.

@Zsailer raises an important question about the ordering of the classes. I tend to think of these as more nominal categoricals, rather than ordinal.

yuvipanda · 2020-02-09T08:37:12Z

Apologies for having stepped away from this a while, and thank you for picking this back up, @Zsailer and @minrk.

I agree with @ellisonbg - these aren't ordinal numerical levels. I think of username as differently sensitive than user-identifiable-information. For example, if the fact that my username is 'yuvipanda' leaks, it is a different issue than if my SSN or whatever leaks. So it is a useful distinction to make, without any comment about wether one is more sensitive or not.

So this means admins would have to explicitly opt-in to each level, rather than a 'this level and below' option.

minrk · 2020-02-09T14:50:16Z

Agreed that they should probably be treated as categorical rather than hierarchical. Think of it as a list of tags/topics to subscribe to, along the lines of:

self.topics = ['user-identifiable', 'location']

def emit(self, record):
    for field in fields:
		# intersection = any match, use .issubset for all match
        if schema[field]['topics'].intersection(self.topics):
            include(field)

A couple questions that follow from that:

is there still a place for level in addition to category for things like user-identifiable, or is category all we really need?
Are categories a set dictated by this repo, or a free field to match against for packages? My inclination would be the latter so deployments can define some of their own categories for events, but that could become a documentation/standardization challenge if there are too many sources for events (notebook package, jupyterhub, spawner, local customization, etc.).
If they are categorical, can an event have more than one category? If so, does the subscription filter match using 'all' or 'any'? An event's request ip address might reasonably be labeled 'user-identifiable' and 'location' for instance. Not sure if that would be a good practice or not.

I think it's a good idea to sketch some real events and labels/filters folks could want to see what's the best fit. I'm honestly not sure at this point.

ellisonbg · 2020-02-09T17:16:30Z

Min brings up great questions. I think the missing piece here is policies about how different categories are handled. Some examples of policies that one might apply to fields/events of a given class: * Remove * Hash using different algorithms (salted rotating hash, etc.) * Allow * Allow but fuzz using differential privacy Which of these policies is applied to which classes has an immediate impact on the bigger picture of data privacy and on legal compliance. I would love to see our library to the next step and implement policies like this, as some of them are really difficult to get right. I want to make it really easy for operators to do the right thing and not make silly mistakes that compromise data privacy/security. Given this background, I will try to explore answers to Mins questions... I think there is room for categories to be a free field that is matched against. In terms of policies, that would roughly look like "apply this policy to this tag." If we want to use these categories to apply policies though, then I don't think it makes sense to be able to apply multiple tags (or at least it isn't obvious with the current tags we are discussing). This is mostly the case because some of the policies are contradictory (allow/remove, hash/don't hash).

…

On Sun, Feb 9, 2020 at 6:50 AM Min RK ***@***.***> wrote: Agreed that they should probably be treated as categorical rather than hierarchical. Think of it as a list of tags/topics to subscribe to, along the lines of: self.topics = ['user-identifiable', 'location'] def emit(self, record): for field in fields: # intersection = any match, use .issubset for all match if schema[field]['topics'].intersection(self.topics): include(field) A couple questions that follow from that: - is there still a place for level *in addition to* category for things like user-identifiable, or is category all we really need? - Are categories a set dictated by this repo, or a free field to match against for packages? My inclination would be the latter so deployments can define some of their own categories for events, but that could become a documentation/standardization challenge if there are too many sources for events (notebook package, jupyterhub, spawner, local customization, etc.). - If they are categorical, can an event have more than one category? If so, does the subscription filter match using 'all' or 'any'? An event's request ip address might reasonably be labeled 'user-identifiable' and 'location' for instance. Not sure if that would be a good practice or not. I think it's a good idea to sketch some real events and labels/filters folks could want to see what's the best fit. I'm honestly not sure at this point. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#30?email_source=notifications&email_token=AAAGXUHUUXCGKUMZQOCUL33RCAJ2TA5CNFSM4I36SYM2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOELGOM4A#issuecomment-583853680>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAAGXUHMYHJTSHNEREDBKQ3RCAJ2TANCNFSM4I36SYMQ> .

-- Brian E. Granger Principal Technical Program Manager, AWS AI Platform (brgrange@amazon.com) On Leave - Professor of Physics and Data Science, Cal Poly @ellisonbg on GitHub

Zsailer · 2020-02-10T18:52:25Z

Great stuff here! Thanks, everyone.

👍 to categorical over hierarchical.

is there still a place for level in addition to category for things like user-identifiable, or is category all we really need?

I'm finding that levels are hard to prescribe. Maybe we start with categories+policies first; then, add levels if need arises.

I think @minrk is right—sketching out some "real" events around Jupyter apps might help us identify categories, policies, etc. I'll try to draft some example events for the notebook server and post them here. If someone else wants to try to sketching events for jupyterhub, spawner, etc., that would be great!

In the meantime, I've been doing a deep dive into various legislation/documentation around data privacy—e.g. GDPR, CCPA, NIST (SP 800-53 Rev. 4), OMB (Memorandum 07-16)—to clearly define "personal data" under different regulatory regimes and what policies appropriate address privacy concerns in each regime.

If anyone knows of other documents I should review, feel free to add them here. I'll create a summary document for this repo's docs.

minrk · 2020-02-13T15:21:44Z

@Zsailer I think it's probably important, at least initially and for this repo, to only cover those things to the extent that we have suggested/defined categories for the right classes of data (i.e. is PII a single category, or are there more fine-grained classifications that we need). Assigning things to the right categories should be handled in the individual packages where events are being defined.

👍 to only having categories until and unless levels show up as having a real need.

Zsailer · 2020-05-19T17:45:51Z

Closing in favor of #46

Zsailer requested review from yuvipanda and jaipreet-s September 30, 2019 18:03

Zsailer mentioned this pull request Sep 30, 2019

Examples of logging code run in a cell (along with usernames) #28

Open

Zsailer changed the title ~~Add PII awareness to event formatter~~ [WIP] Add PII awareness to event formatter Sep 30, 2019

Zsailer changed the title ~~[WIP] Add PII awareness to event formatter~~ [WIP] Add multi-level security awareness to event formatter Oct 1, 2019

Zsailer force-pushed the pii branch from 4d146a3 to ca5c28b Compare October 2, 2019 19:18

Zsailer changed the title ~~[WIP] Add multi-level security awareness to event formatter~~ [MRG] Add multi-level security awareness to event formatter Oct 2, 2019

Zsailer mentioned this pull request Oct 7, 2019

Personal data design considerations #31

Open

yuvipanda reviewed Oct 8, 2019

View reviewed changes

minrk reviewed Jan 30, 2020

View reviewed changes

Zsailer added 8 commits April 30, 2020 13:51

Add PII awareness to event formatter

48922ee

mult-level security

002d116

add default event level to handlers in eventlog initialize

65cf063

fix tests

898a118

various linting issues

add09dd

add docs about sensitive data

38509ee

wip

64088bf

replace multi-level approach to tagging approach

2b17927

Zsailer force-pushed the pii branch from ec8d5a7 to 2b17927 Compare May 13, 2020 23:25

Zsailer mentioned this pull request May 19, 2020

Enable finer-grained recording of event properties and categories. #46

Closed

Zsailer closed this May 19, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[MRG] Add multi-level security awareness to event formatter #30

[MRG] Add multi-level security awareness to event formatter #30

Zsailer commented Sep 30, 2019 •

edited

Zsailer commented Sep 30, 2019

Zsailer commented Oct 1, 2019

Zsailer commented Oct 2, 2019

Zsailer commented Oct 2, 2019

Zsailer commented Oct 2, 2019

ellisonbg commented Oct 3, 2019

ellisonbg commented Oct 3, 2019

Zsailer commented Oct 3, 2019

ellisonbg commented Oct 4, 2019

yuvipanda commented Oct 8, 2019

yuvipanda commented Oct 8, 2019

yuvipanda Oct 8, 2019

yuvipanda commented Oct 8, 2019

yuvipanda commented Oct 8, 2019

minrk Jan 30, 2020

Zsailer Jan 30, 2020

minrk Jan 30, 2020

Zsailer Jan 30, 2020

yuvipanda Jan 31, 2020

minrk Jan 30, 2020

minrk Jan 30, 2020

Zsailer Jan 30, 2020 •

edited

Zsailer Feb 6, 2020

Zsailer commented Feb 7, 2020 •

edited

ellisonbg commented Feb 8, 2020

yuvipanda commented Feb 9, 2020

minrk commented Feb 9, 2020

ellisonbg commented Feb 9, 2020 via email

Zsailer commented Feb 10, 2020 •

edited

minrk commented Feb 13, 2020

Zsailer commented May 19, 2020


		Jupyter Telemetry offers flexibility when logging and handling sensitive data.

		Since events may include data with varying degrees of sensitivity, Jupyter Telemetry uses a multi-level security approach. It exposed four levels of sensitivity:


		super(JsonEventFormatter, self).__init__(args, *kwargs)

		def setEventLevel(self, event_level):

[MRG] Add multi-level security awareness to event formatter #30

[MRG] Add multi-level security awareness to event formatter #30

Conversation

Zsailer commented Sep 30, 2019 • edited

Background

Technical Details

For operators

Zsailer commented Sep 30, 2019

Zsailer commented Oct 1, 2019

Zsailer commented Oct 2, 2019

Zsailer commented Oct 2, 2019

Zsailer commented Oct 2, 2019

ellisonbg commented Oct 3, 2019

ellisonbg commented Oct 3, 2019

Zsailer commented Oct 3, 2019

ellisonbg commented Oct 4, 2019

yuvipanda commented Oct 8, 2019

yuvipanda commented Oct 8, 2019

Choose a reason for hiding this comment

yuvipanda commented Oct 8, 2019

yuvipanda commented Oct 8, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Zsailer Jan 30, 2020 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Zsailer commented Feb 7, 2020 • edited

ellisonbg commented Feb 8, 2020

yuvipanda commented Feb 9, 2020

minrk commented Feb 9, 2020

ellisonbg commented Feb 9, 2020 via email

Zsailer commented Feb 10, 2020 • edited

minrk commented Feb 13, 2020

Zsailer commented May 19, 2020

Zsailer commented Sep 30, 2019 •

edited

Zsailer Jan 30, 2020 •

edited

Zsailer commented Feb 7, 2020 •

edited

Zsailer commented Feb 10, 2020 •

edited