Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support data redaction, so that sensitive data in protos doesn't leak into places like log files #1160

Closed
jhump opened this issue Jan 20, 2016 · 18 comments

Comments

@jhump
Copy link
Contributor

jhump commented Jan 20, 2016

In every server language, it is trivial for someone to include a proto in an unstructured log line:
logger.info("something happened: %s", proto);

It is very common for such log output to be ingested into various systems (like exception tracking / alerting, full-text indexing for search) which suddenly makes it possible to accidentally leak sensitive or secret data into these systems (PII, account numbers and PANs, passwords, etc). Scraping all such sensitive data after it is accidentally leaked can be a huge pain, depending on the tools used. And identifying what should be redacted in the path of ingestion is error-prone. Pattern marching could identify things that look like social security numbers and credit card numbers, but phone numbers can get tricky (especially if supporting international phone number formats) and addresses moreso. Passwords? Forget about it...

So having a way to redact fields from operations that convert messages into strings (Object#toString in Java, for example) is extremely useful to prevent accidental logging of sensitive data. Even if this functionality doesn't belong in the core protobuf runtime libraries, I think there should at least be a hook point in the library so that redaction can be "plugged in" for a given deployment.

This problem is simpler when all logs are structured (e.g. not unstructured text output, log processors can then use custom message and field options to filter/modify the logs to prevent leaking sensitive data). But even when using structured logging for some things, there is likely always a need for unstructured log output from a server, even if just for debugging.


Strawman solution (using Java runtime as an example, but pattern can be applied to other runtimes/languages):

  • Add redacted fields to FieldOptions and MessageOptions in descriptor.proto.
  • TextFormat.Parser.Builder gets a new setting: setRedacted(boolean) (defaults to false, existing behavior). When true, redacted messages result in "{ <redacted> }" and redacted fields, if present, are shown as fieldname: <redacted>.
  • AbstractMessage#toString() updated to use redaction feature of TextFormat.

If adding redacted options into descriptor.proto is going too far, then TextFormat.Parser.Builder#setRedacted could instead take a GeneratedExtension<FieldOptions, Boolean> -- a custom field option that is queried to determine if field values should be skipped. This would require a way to configure a default option that can then be used from AbstractMessage#toString.

@jhump
Copy link
Contributor Author

jhump commented Jan 20, 2016

@lukaszx0 @cconroy @zellyn

@toddlipcon
Copy link

We've just implemented essentially the same idea for Apache Kudu (https://gerrit.cloudera.org/#/c/5553/) and it was relatively painful to do without modifying the Protobuf library.

Another idea that could work (at least in the C++ library implementation) would be to change Message::ShortDebugString and Message::String from constructing a new Printer each time to using a singleton Printer instance each. I believe Printer is itself thread-safe once configured (i.e can be used to print concurrently from multiple threads), and thus having a singleton printer would allow users to configure a custom field printer to handle their own redaction needs.

I haven't looked at the Java library yet to see if the same is doable.

@xfxyjwf xfxyjwf added the P3 label Jun 8, 2018
@ghost
Copy link

ghost commented Jun 11, 2019

@acozzette This seems like a recurring feature request

@elharo
Copy link
Contributor

elharo commented Oct 1, 2021

Can use custom options to do this: https://developers.google.com/protocol-buffers/docs/proto3#customoptions

@elharo elharo closed this as completed Oct 1, 2021
@toddlipcon
Copy link

I don't think that's a satisfactory workaround - as I mentioned above in #1160 (comment) one can do it, but you have to change all callsites away from using the normal DebugString functions to your own (in Kudu we wrote one called SecureDebugString). That was quite a pain and it's easy for people to forget.

@cconroy
Copy link
Contributor

cconroy commented Oct 1, 2021

Chiming in, I can confirm we still have this as a hard requirement at Square more than 5 years later and still have to maintain a fork of protoc to get this functionality.

@fowles
Copy link
Member

fowles commented Oct 1, 2021

Proto cannot be in the business of enforcing corporate policy in this way. There will always be ways to extract data from them. Any company building on it will require their own linting and technical solutions to ensure something end-to-end.

@runlilong
Copy link

Chiming in, I can confirm we still have this as a hard requirement at Square more than 5 years later and still have to maintain a fork of protoc to get this functionality.

https://go.dev/blog/protobuf-apiv2
Did not this blog solve the problem? Though I write the same code but didn't get what I want.
@cconroy

@jhump
Copy link
Contributor Author

jhump commented Nov 30, 2021

@runlilong, that blog is about the new reflection and descriptor support. The issue is that the latest protoc plugin still generates a String() string method on all generated structs that provides no hooks for redaction capabilities (and the same is true of other language runtimes and generated code, e.g. toString() methods in generated Java code). So this latest API provides zero support for the kind of redaction functionality requested.

Proto cannot be in the business of enforcing corporate policy in this way.

@fowles, that is fine. But it sure would be nice if the runtime libraries had a level of pluggability/configurability when it comes to the generated "stringify" functions, so that corporations could actually enforce their own policies. I think @toddlipcon's suggestion could be applied to other language runtimes, allowing code to set an alternate "printer" during program initialization that could handle custom policies.

@elharo
Copy link
Contributor

elharo commented Nov 30, 2021

This feels similar in spirit to #9114 though the details are quite different.

Bottom line: the protobuf libraries do not support custom input or output formats, nor is this a goal. The proto libraries are designed to consume and produce very specific, documented formats as used by gRPC and many other systems. Any other format — protobuf, JSON, or something else — can be created or consumed by writing code to filter the standard inputs and output as needed by a specific application. Indeed this is such a common thing to do that it's become a Meme about SWE careers at Google. However this code is in a separate layer and does not need to be committed to the core libraries in this repo.

@jhump
Copy link
Contributor Author

jhump commented Nov 30, 2021

@elharo, I don't agree that's the request. I think it would be fine for the implementation of this redacted format to be external to the protobuf runtime. The real thing we're talking about is providing some sort of override or hook into the automatic "stringification" of messages -- e.g. the Message::ShortDebugString method in C++, Message.String() method in Go, AbstractMessage.toString() method in Java, etc.

@zellyn
Copy link

zellyn commented Nov 30, 2021

@elharo with respect, I'm not sure you're understanding the request. The problem is that if you have an SSN or credit card number in a protobuf field in a language-specific proto object in Java/Go/whatever, and you accidentally print it, you see the sensitive information in the output/logs.

We're all fully capable of writing code on top of protobufs to do any manner of filtering and conversion — not only are several of us ex-Googlers, but Square also worked with Google on gRPC before it was released.

We've chatted with the protobuf team about the idea of a "sensitive" field option, and they said it made sense. Indeed, they mentioned that the credit-card-processing parts inside Google have a similar mechanism. Unfortunately, they said it was unlikely to get merged: mainly for organizational/process reasons.

I've pondered squirreling away and dropping fully-complete PRs for protoc, the C proto library, the Java proto library, and the Go proto library, and seeing if it would shift things, but that's a lot of speculative work.

A more incremental approach would be to first create a "blessed" field annotation (sensitive seems reasonable) — by "blessed" I mean comparable to deprecated — and teach protoc to understand it. Then adding support to language libraries could be done one at a time. But that would require a vote of confidence for the approach.

@jhump and I initially wrote up a proposal here, but further feedback is welcome.

@fowles
Copy link
Member

fowles commented Nov 30, 2021

The need and desire here is real. Unfortunately, we are pretty far away from being able to prioritize the level of undertaking this would require. It is a pretty large chunk of design and implementation across many languages and bindings.

@zellyn
Copy link

zellyn commented Nov 30, 2021

The need and desire here is real. Unfortunately, we are pretty far away from being able to prioritize the level of undertaking this would require. It is a pretty large chunk of design and implementation across many languages and bindings.

The design is pretty simple: a "sensitive" field option. The implementation can be contributed by the community, language by language. It only affects the text serialization, which, IIRC correctly from my time at Google, is plastered with "not stable; don't depend on this" warnings. All we'd need is a go-ahead, and the initial addition of "sensitive" to protoc and the descriptor proto.

@runlilong
Copy link

that blog is about the new reflection and descriptor support. The issue is that the latest protoc plugin still generates a String() string method on all generated structs that provides no hooks for redaction capabilities (and the same is true of other language runtimes and generated code, e.g. toString() methods in generated Java code). So this latest API provides zero support for the kind of redaction functionality requested.

@jhump If there is a hook to custom String() is the best. Unfortunately, there is not; for the redaction on the log, maybe we can do our custom redact actions with the new reflection feature during the log process. I get some help from this blog and the repository.

@jhump
Copy link
Contributor Author

jhump commented Jan 3, 2023

@fowles, does this commit suggest that this is now being looked at? 9238c48
If so, could we re-open this issue and then later re-close it as resolved once the arc of work is complete?

@zellyn
Copy link

zellyn commented Jan 3, 2023

Oh, nice catch Josh!

@fowles
Copy link
Member

fowles commented Jan 3, 2023

We are doing some internal work in this space. I don't know how much we will have to build on in the near term though.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

8 participants