-
Notifications
You must be signed in to change notification settings - Fork 15.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
protoc allows invalid UTF-8 in source and in option values whose type is string #9175
Comments
I suspect the invalid UTF-8 values for string options might be allowed due to #7364. But maybe it could also be due to the odd way that options are interpreted (they are first interpreted directly into "unknown bytes" and then an attempt is made to unmarshal them into the options message). |
@googleberg, Hi! I see this was assigned about a month ago. Is this something that you expect will be fixed? If I looked into it, would you review and potentially accept a pull request for it? |
I can answer on his behalf: Yes! We'd love to review if you'd be willing to take a stab. |
My apologies for the delay. Absolutely! |
@perezd, @googleberg, cool, I'm looking at it now. There are really two parts to this issue:
For the first part, I was going to add a flag to the tokenizer for enabling UTF8 verification. For now, only the parser will set it. I was worried that otherwise the "blast radius" might be too large. (This same class is also used for parsing the text format -- which probably should also require UTF8 input, but that also seems like a bigger risk with regards to backwards compatibility.) There is some subtlety, like if a multi-byte code point crosses the boundary of the tokenizer's buffer, which isn't so tricky to implement right as it is to verify in a test. But this first part is the easier part I think. The second part looks more complicated. The check itself is easy (there's already a function to do it in the Do you have any advice? |
@jhump I believe you are correct in your assessment. Figuring out how to apply UTF-8 enforcement only to those string-type option fields that are safe to guarantee seems murky at best. It appears that the enforcement would need to happen in descriptor.cc: bool DescriptorBuilder::OptionInterpreter::SetOptionValue I wonder if we can use the explicit field option "enforce_utf8" on those fields. If so, I think we could simply modify descriptor.proto such that string-type options such as java_outer_classname would be declared like this: enforce_utf8 is documented as deprecated, but that was in the context of proto3. For proto2, it still seems useful. Let me check on that. |
@googleberg, like I mentioned, though the code refers to such an option, it does not actually exist in the open-source version. Take a look: I think you may be looking at an internal version of The documentation (for both proto2 and proto3) states that strings must always be valid UTF8. |
@jhump Indeed, it is internal only but it might have been worth externalizing except that it only works for turning off UTF8 validation in proto3. It never allowed turning on UTF8 validation in proto2. The problem with doing a blanket enforcement is that invalid UTF8 has almost certainly been put into string-type custom options within Google (despite the documentation). We have some options currently under consideration would potentially allow for gradual introduction of enforcement for descriptor extensions. I'll look into the roadmap to see if we could leverage those. |
We triage inactive PRs and issues in order to make it easier to find active work. If this issue should remain active or becomes active again, please add a comment. This issue is labeled |
We triage inactive PRs and issues in order to make it easier to find active work. If this issue should remain active or becomes active again, please reopen it. This issue was closed and archived because there has been no new activity in the 14 days since the |
What version of protobuf and what language are you using?
The latest:
What operating system (Linux, Windows, ...) and version?
OS X 11.6.1
What runtime / compiler are you using (e.g., python version or gcc version)
N/A
What did you do?
I compiled a file with invalid binary data in it. While the specs don't say that the source should be UTF-8 encoded, this issue in GitHub states it to be true: #1418. (Note: the docs should really be updated to explicitly state this, for clarity.)
In order for the source to be valid compilable source, I put it in a string literal for an option whose type is
string
. I actually created two files that are semantically equivalent:The question mark above is actually binary value
0xbc
. To play with the binary data of this example file, you can usexxd -r
with this:The second file uses an escape sequence:
So
tmp.proto
includes invalid UTF8 input in the source. Andtmp2.proto
contains valid UTF8 in the source, but it defines a string constant that has invalid UTF8 data.I compiled the files just to descriptor sets:
What did you expect to see
I expected
protoc
to:For the first file, it should complain about the input in the source program itself.
For the second file, it should complain about the value for a string option not being valid input (since strings are expected to be UTF-8 or 7-bit ASCII according to the docs).
What did you see instead?
Compilation succeeds just fine with both files. Other than the file name, they produce identical descriptors:
The
\274
is an escaped byte, not unicode point0xbc
. This is not valid UTF8, but thejava_outer_classname
option is defined as typestring
.The text was updated successfully, but these errors were encountered: