protoc allows invalid UTF-8 in source and in option values whose type is string #9175

jhump · 2021-11-01T14:54:02Z

What version of protobuf and what language are you using?
The latest:

> protoc --version
libprotoc 3.19.1

What operating system (Linux, Windows, ...) and version?

OS X 11.6.1

What runtime / compiler are you using (e.g., python version or gcc version)

N/A

What did you do?

I compiled a file with invalid binary data in it. While the specs don't say that the source should be UTF-8 encoded, this issue in GitHub states it to be true: #1418. (Note: the docs should really be updated to explicitly state this, for clarity.)

In order for the source to be valid compilable source, I put it in a string literal for an option whose type is string. I actually created two files that are semantically equivalent:

// tmp.proto
syntax = "proto3";

package foo;

option java_outer_classname= "my?value";

The question mark above is actually binary value 0xbc. To play with the binary data of this example file, you can use xxd -r with this:

00000000: 7379 6e74 6178 203d 2022 7072 6f74 6f33  syntax = "proto3
00000010: 223b 0a0a 7061 636b 6167 6520 666f 6f3b  ";..package foo;
00000020: 0a0a 6f70 7469 6f6e 206a 6176 615f 6f75  ..option java_ou
00000030: 7465 725f 636c 6173 736e 616d 653d 2022  ter_classname= "
00000040: 6d79 bc76 616c 7565 223b 0a              my.value";.

The second file uses an escape sequence:

// tmp2.proto
syntax = "proto3";

package foo;

option java_outer_classname= "my\xbcvalue";

So tmp.proto includes invalid UTF8 input in the source. And tmp2.proto contains valid UTF8 in the source, but it defines a string constant that has invalid UTF8 data.

I compiled the files just to descriptor sets:

protoc -o tmp.protoset tmp.proto
protoc -o tmp2.protoset tmp2.proto

What did you expect to see

I expected protoc to:

Either replace the invalid data with the UTF replacement char � (U+FFFD)
Or complain that the input was malformed UTF-8.

For the first file, it should complain about the input in the source program itself.

For the second file, it should complain about the value for a string option not being valid input (since strings are expected to be UTF-8 or 7-bit ASCII according to the docs).

What did you see instead?

Compilation succeeds just fine with both files. Other than the file name, they produce identical descriptors:

> protoc --decode google.protobuf.FileDescriptorSet google/protobuf/descriptor.proto < tmp.protoset
file {
  name: "tmp.proto"
  package: "foo"
  options {
    java_outer_classname: "my\274value"
  }
  syntax: "proto3"
}

The \274 is an escaped byte, not unicode point 0xbc. This is not valid UTF8, but the java_outer_classname option is defined as type string.

The text was updated successfully, but these errors were encountered:

jhump · 2021-11-01T14:57:10Z

I suspect the invalid UTF-8 values for string options might be allowed due to #7364. But maybe it could also be due to the odd way that options are interpreted (they are first interpreted directly into "unknown bytes" and then an attempt is made to unmarshal them into the options message).

jhump · 2022-08-18T01:55:39Z

@googleberg, Hi! I see this was assigned about a month ago. Is this something that you expect will be fixed? If I looked into it, would you review and potentially accept a pull request for it?

perezd · 2022-08-18T15:57:55Z

I can answer on his behalf: Yes! We'd love to review if you'd be willing to take a stab.

googleberg · 2022-08-18T17:50:13Z

My apologies for the delay. Absolutely!

jhump · 2022-08-18T19:45:22Z

@perezd, @googleberg, cool, I'm looking at it now.

There are really two parts to this issue:

File input is valid UTF8
String constants contain valid UTF8

For the first part, I was going to add a flag to the tokenizer for enabling UTF8 verification. For now, only the parser will set it. I was worried that otherwise the "blast radius" might be too large. (This same class is also used for parsing the text format -- which probably should also require UTF8 input, but that also seems like a bigger risk with regards to backwards compatibility.) There is some subtlety, like if a multi-byte code point crosses the boundary of the tokenizer's buffer, which isn't so tricky to implement right as it is to verify in a test. But this first part is the easier part I think.

The second part looks more complicated. The check itself is easy (there's already a function to do it in the google/protobuf/stubs folder), and I think there are only two places to invoke it -- when processing/type-checking a field's default value and the code that interprets all other options. However, there is a lot of stuff already in the codebase around whether to check if strings are UTF8. That code looks to be referencing file and field options that don't actually exist in descriptor.proto (maybe part of a Google-internal migration)? It defers the UTF8 checks to the various runtimes -- so the C++ runtime can implement its own check as can the Java runtime. So addressing the second part seems less clear as to how to proceed (and maybe higher risk?). I'm not sure if there is an easy way to enable the check in protoc yet still easily allow it to be disabled, such as via an environment variable or maybe a pre-processor definition to turn it off at compile time.

Do you have any advice?

googleberg · 2022-08-18T21:20:46Z

@jhump I believe you are correct in your assessment. Figuring out how to apply UTF-8 enforcement only to those string-type option fields that are safe to guarantee seems murky at best. It appears that the enforcement would need to happen in descriptor.cc: bool DescriptorBuilder::OptionInterpreter::SetOptionValue

I wonder if we can use the explicit field option "enforce_utf8" on those fields. If so, I think we could simply modify descriptor.proto such that string-type options such as java_outer_classname would be declared like this:
optional string java_outer_classname = 8 [enforce_utf8 = true];

enforce_utf8 is documented as deprecated, but that was in the context of proto3. For proto2, it still seems useful. Let me check on that.

jhump · 2022-08-18T21:32:12Z

enforce_utf8 is documented as deprecated

@googleberg, like I mentioned, though the code refers to such an option, it does not actually exist in the open-source version. Take a look:
https://github.com/protocolbuffers/protobuf/blob/main/src/google/protobuf/descriptor.proto#L539

I think you may be looking at an internal version of descriptor.proto. Many comments related to enforcing UTF8 suggest it was for an internal migration (perhaps migrating to actually enforcing UTF8?).

The documentation (for both proto2 and proto3) states that strings must always be valid UTF8.

googleberg · 2022-08-19T16:37:39Z

@jhump Indeed, it is internal only but it might have been worth externalizing except that it only works for turning off UTF8 validation in proto3. It never allowed turning on UTF8 validation in proto2.

The problem with doing a blanket enforcement is that invalid UTF8 has almost certainly been put into string-type custom options within Google (despite the documentation).

We have some options currently under consideration would potentially allow for gradual introduction of enforcement for descriptor extensions. I'll look into the roadmap to see if we could leverage those.

github-actions · 2024-04-03T10:02:07Z

We triage inactive PRs and issues in order to make it easier to find active work. If this issue should remain active or becomes active again, please add a comment.

This issue is labeled inactive because the last activity was over 90 days ago.

github-actions · 2024-04-18T10:02:07Z

We triage inactive PRs and issues in order to make it easier to find active work. If this issue should remain active or becomes active again, please reopen it.

This issue was closed and archived because there has been no new activity in the 14 days since the inactive label was added.

jhump mentioned this issue Nov 1, 2021

protoparse: ParseFiles doesn't escape non-ascii characters in string values jhump/protoreflect#441

Open

elharo added documentation bug protoc labels Nov 2, 2021

jhump mentioned this issue Mar 25, 2022

Failed to parse invalid UTF8 jhump/protocompile-old#5

Closed

deannagarcia assigned googleberg Jul 14, 2022

jhump mentioned this issue Aug 19, 2022

validate UTF8 during tokenization #10430

Closed

jhump mentioned this issue Sep 16, 2022

lenient utf8 parser bufbuild/protocompile#34

Merged

Logofile self-assigned this Jan 3, 2024

github-actions bot added the inactive Denotes the issue/PR has not seen activity in the last 90 days. label Apr 3, 2024

github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Apr 18, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

protoc allows invalid UTF-8 in source and in option values whose type is string #9175

protoc allows invalid UTF-8 in source and in option values whose type is string #9175

jhump commented Nov 1, 2021 •

edited

jhump commented Nov 1, 2021 •

edited

jhump commented Aug 18, 2022

perezd commented Aug 18, 2022

googleberg commented Aug 18, 2022

jhump commented Aug 18, 2022 •

edited

googleberg commented Aug 18, 2022

jhump commented Aug 18, 2022

googleberg commented Aug 19, 2022

github-actions bot commented Apr 3, 2024

github-actions bot commented Apr 18, 2024

protoc allows invalid UTF-8 in source and in option values whose type is string #9175

protoc allows invalid UTF-8 in source and in option values whose type is string #9175

Comments

jhump commented Nov 1, 2021 • edited

jhump commented Nov 1, 2021 • edited

jhump commented Aug 18, 2022

perezd commented Aug 18, 2022

googleberg commented Aug 18, 2022

jhump commented Aug 18, 2022 • edited

googleberg commented Aug 18, 2022

jhump commented Aug 18, 2022

googleberg commented Aug 19, 2022

github-actions bot commented Apr 3, 2024

github-actions bot commented Apr 18, 2024

jhump commented Nov 1, 2021 •

edited

jhump commented Nov 1, 2021 •

edited

jhump commented Aug 18, 2022 •

edited