Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Proposal: Add Sensitive Data Labels #176

Open
MovieStoreGuy opened this issue Sep 15, 2021 · 15 comments
Open

Proposal: Add Sensitive Data Labels #176

MovieStoreGuy opened this issue Sep 15, 2021 · 15 comments

Comments

@MovieStoreGuy
Copy link
Contributor

The Pitch

The idea is rather simple, extending the current OTLP to include an additional dictionary that is intended to store sensitive data.

The Rationale

Making the use of sensitive data being explicit instead implicit means that data vendors, data processors (pre and post) can first of all:

  • Ensure any data regulations are followed (Thinking GDPR, CCPA, FedRamp to a degree)
  • Allow for consumers of this data to handle the data correctly

As we collect more telemetry from user's experiences, the likelihood of including UGC and PII increases and this should help those system that collect it to be clear on the actions they take on it.

The outcome

The idea I have in mind and would love further discussion on is how this is consumed by the client (the application actually sending the data).

The idea being that the OTLP is extending to include another dictionary that is explicit in its sensitive nature and the SDK implement a AddSensitiveLabel method to make it clear on what it is.

From that the otel-collector could filter, drop, or transform those labels if they are known.

From there, the exported data vendors / processes can set up ingestion policies on what should happen when sensitive data arrives.

@jkowall
Copy link

jkowall commented Sep 15, 2021

Very good idea but who and how it's implemented will be a challenge. Do you see this across all otel data types?

@Oberon00
Copy link
Member

Couldn't this be handled with metadata about semantic conventions? E.g. in the YAML definitions we add a field sensitive: <level> and all fields without that annotation are assumed to be sensitive. The collector could then either process the YAML or we could generate a specific data file from the YAML using the semantic convention generator.

@MovieStoreGuy
Copy link
Contributor Author

MovieStoreGuy commented Sep 16, 2021

@jkowall I see this across all types since "background" tasks have the potential to leak data such as appending the SQL query to a metric capturing latency (not my greatest example but more highlighting that it shouldn't be restricted to on spans and metrics that contain a user context).

@Oberon00 potentially? I haven't looked at it in more detail but it would complicate managing data processors that look for special tags so to speak within the labels or have to do some level of regex matching.
Having it in a seperate map at leasts simplifies how data processors / vendors processor the data without much overhead.

Sorry, I am not ruling out sensitive label prefixes in the semantic convention but also understand those become hard to manage once the convention evolves.

@jamesmoessis
Copy link

jamesmoessis commented Sep 16, 2021

I think this is a good idea.

I am not convinced that the semantic convention is the place where an attribute should be defined as sensitive or not. It could easily change between applications. It probably needs to be defined by the user themselves, and the protocol handles the rest.

@Oberon00
Copy link
Member

Oberon00 commented Sep 16, 2021

Sorry, I am not ruling out sensitive label prefixes in the semantic convention but also understand those become hard to manage once the convention evolves.

What do you mean by prefix? And why would it become difficult to manage once they evolve? I was talking about adding an additional field to semantic conventions. For example, this is the current net.peer.ip definition in YAML: https://github.com/open-telemetry/opentelemetry-specification/blob/v1.6.1/semantic_conventions/trace/general.yaml#L36-L41

We could extend it with a field sensitive: pii (personal identifiable information) to declare it as sensitive data. Like

- id: peer.ip
  sensitive: pii
  brief: ...

@Oberon00
Copy link
Member

It could easily change between applications.

Two questions:

  • Can you give a concrete example of an attribute that has different sensitivity depending on application?
  • Would this mean that per-label sensitivity then needs to become a configuration option for the instrumentation?

@jamesmoessis
Copy link

Can you give a concrete example of an attribute that has different sensitivity depending on application?

The peer.ip attribute that you mentioned is an example where the sensitivity depends on the context. Consider the situation where you are running an HTTP server and users from the internet directly request the server. peer.ip is sensitive. Another situation is where you are running a completely internal-facing service in which your only peers are other internal services. Then peer.ip is not sensitive.

Would this mean that per-label sensitivity then needs to become a configuration option for the instrumentation?

Can't answer this fully, but maybe? I imagined it would mean the users could set "attributes" and "sensitive attributes" separately. That's just how I initially imagined it though, haven't put much more thought into it than that.

@Oberon00
Copy link
Member

Oberon00 commented Sep 16, 2021

The peer.ip attribute that you mentioned is an example where the sensitivity depends on the context. [...]

Makes sense. Although if we just want to cover security we could simplify by always treating it as sensitive.

I imagined it would mean the users could set "attributes" and "sensitive attributes" separately.

Who is the "user" in that case? EDIT: See https://github.com/open-telemetry/opentelemetry-specification/blob/main/specification/glossary.md#user-roles

@jamesmoessis
Copy link

Sorry, I meant user as in application owner.

@MovieStoreGuy
Copy link
Contributor Author

Hey @Oberon00

What do you mean by prefix?

I had made an assumption that the labels keys would be transformed to something like: user.name --> sensitive.user.name for example.

And why would it become difficult to manage once they evolve?

From trying to adopt the semantic convention internally at Atlassian, there has been a few times that fields had changed (sometimes it was us internally updating the labels, or the semantic convention were were extending hand changed). Moreover, we've had AWS implement a conflicting semantic convention used for the databases (@jamesmoessis correct if I am wrong).

My biggest concern going forward with this approach is that convention changes in a non backwards compatible way that would data that was once considered sensitive (and should still be considered) is no; then allowing for sensitive data into systems that shouldn't contain it.

Can you give a concrete example of an attribute that has different sensitivity depending on application?

An application that is running in different environments could consider internal testing / synthetic users non sensitive compared real life users (this is more applicable for Real User Monitoring).
Moreover, an internal application for internal use only may consider its database query command not sensitive comparing against something like an application that services customer / real user requests.
Furthermore, an application running in Europe would have to keep all of its data about the user only within Europe for regulatory purposes.

Would this mean that per-label sensitivity then needs to become a configuration option for the instrumentation?

I would argue so, I don't think the semantic convention can extend fair enough to cover all cases where data regulations are concerned.

@Oberon00
Copy link
Member

Oberon00 commented Sep 17, 2021

I had made an assumption that the labels keys would be transformed to something like: user.name --> sensitive.user.name for example.

No, my suggestion was to just have this metadata in the YAML and at any point where you would want to hide/remove sensititve data, you would look up the key in the YAML (or probably not directly from the YAML but some data file generated from it) to determine if it is sensitive. So a key would be either sensitive or not, but any particular value for that key would have the same sensitivity, which does not support the use cases brought up above in #176 (comment)

@Oberon00
Copy link
Member

My biggest concern going forward with this approach is that convention changes in a non backwards compatible way that would data that was once considered sensitive (and should still be considered) is no; then allowing for sensitive data into systems that shouldn't contain it.

It is true that we have changed the semantic conventions much more often than the protobuf. But that's because the protobuf is stable, while the semantic conventions are still experimental (with the sole exception of service.name IIRC). For sensitivity in particular, making a key non-sensitive would be impossible for the reason you mentioned. Instead, a new key would have to be used.

@svrnm
Copy link
Member

svrnm commented Sep 20, 2021

@MovieStoreGuy, thank you for opening this proposal: while we all love having our technical observability, the boundaries of privacy concerns are crossed quickly, so having it in mind with OpenTelemetry would be great!

From the discussion it appears that there is no "one-size-fit-all" solution, here's a few thoughts from my end:

Adding a "sensitive" field to the semantic conventions will only work for a very few fields, since "personal identifiable" is often not well defined:

  • @jamesmoessis already brought up the example with the net.peer.ip which is relevant if a person sits in front of the system and can easily be tracked if there is a machine.
  • Interestingly the same is true for the enduser.id: if a cronjob is executed by a system user designed exactly for that, there is no harm for collecting it, but the username at the frontend is a problem.
  • We all know that it's not best practice, but http.url, exception.message and others can contain(!) sensitive data
  • The most problematic case is data that can be connected to uniquely identify a person: Imagine you have collected geo information (country, city) in attributes of a span. Now there is an unrelated(!) system storing user information and unfortunately this one user is the only one living in that country,city in your system. If those two data points are connected, you still have a privacy violation.

This problem is not new and also reflected in many "official" definitions of PII, e.g. https://www.gsa.gov/reference/gsa-privacy-program/rules-and-policies-protecting-pii-privacy-act:

The term “PII,” as defined in OMB Memorandum M-07-1616 refers to information that can be used to distinguish or trace an individual’s identity, either alone or when combined with other personal or identifying information that is linked or linkable to a specific individual. The definition of PII is not anchored to any single category of information or technology. Rather, it requires a case-by-case assessment of the specific risk that an individual can be identified. In performing this assessment, it is important for an agency to recognize that non-PII can become PII whenever additional information is made publicly available - in any medium and from any source - that, when combined with other available information, could be used to identify an individual.

Now, of course one could argue that nothing of that is a concern for the OpenTelemetry standard, but while I think there can not be a simple & complete solution, there are multiple ways to tackle this:

  1. Calling out in the specification that some fields might contain PII and should be handled with care, either through the suggested mechanism of marking them as "sensitive" or with a simple paragraph like it already exists for enduser.id:

    Given the sensitive nature of this information, SDKs and exporters SHOULD drop these attributes by default and then provide a configuration parameter to turn on retention for use cases where the information is required and would not violate any policies or regulations.

  2. Give some guidance (within our outside of the spec) on how to implement things with privacy in mind.

  3. Providing some extensions especially for the collector that help to identify PII data and to remove or to anonymise it based on configurations (e.g. "strip the last bytes from net.peer.ip for service.name X, drop enduser.id if ...)

  4. There is an opportunity to allow end-users (!) to collect OTel data on their own, allowing them to verify and review the data that is emitted from their end-device (think about an OTel collector running on your laptop that receives all traces (et.al.) that are generated during your interaction with an instrumented application. Or, think about a service sending dumps of collected data in a standardised format you can feed into your local collector and process and analyse from there.)

@MovieStoreGuy
Copy link
Contributor Author

Providing some extensions especially for the collector that help to identify PII data and to remove or to anonymise it based on configurations (e.g. "strip the last bytes from net.peer.ip for service.name X, drop enduser.id if ...)

I really like the idea of this, if it were possible to include a privacy module inside the app exporter and the collector, where it has the ability to both observe and anonymise fields based on some provided configuration, it would address my immediate concerns of leaking this type of data to vendors whom aren't prepared to accept it.

I have vague idea this already exists as filter in the pipeline for the collector, I am not sure the same exists for the SDKs.
However, making this more explicit with regards to privacy feels like a very good start to addressing this concern.

@Oberon00
Copy link
Member

Oberon00 commented Sep 27, 2021

For the SDKs, we don't really have this, but there is an issue that would probably be a prerequesite for easier filtering: open-telemetry/opentelemetry-specification#1089. Actually the privacy filtering use case was already brought up there: open-telemetry/opentelemetry-specification#1089 (comment)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants