Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

editions: things don't work right with message_encoding=DELIMITED #16239

Closed
jhump opened this issue Mar 20, 2024 · 13 comments
Closed

editions: things don't work right with message_encoding=DELIMITED #16239

jhump opened this issue Mar 20, 2024 · 13 comments
Assignees
Labels

Comments

@jhump
Copy link
Contributor

jhump commented Mar 20, 2024

I'm not sure how widespread the issues are, but I've come across a handful in the Java and C++ runtimes. They seem to stem from the implementation "inheriting" a little too much baggage from the old proto2 groups implementation.

Let's take this simple file for example:

// test.proto
edition = "2023";

message Foo {
  string name = 1;
}

message Bar {
  Foo field1 = 1 [features.message_encoding = DELIMITED];
}

When I generate Java code for this, it generates accessors in the message and mutators in the message builder named getFoo and setFoo, instead of getField1 and setField1. Similarly, if I construct a message and then print using the message's toString() method (which uses the text format), it prints:

Foo {
  name: "abc"
}

Again, it's using the message name instead of the field name, which is how proto2 groups worked.

The C++ generated code looks correct, however it, too, has problems with the text format. If I try to simply decode the text format output of the Java runtime, I get an error:

input:1:5: Message type "Bar" has no field named "Foo".

But it also has an error if I try to use the field name, with a string that looks like field1{name:"abc"}:

input:1:7: Message type "Bar" has no field named "field1".

So something squirrelly is going on with the C++ text format and delimited-encoded fields, but it's not quite the same as in the Java runtime.

@jhump jhump added the untriaged auto added to all issues by default when created. label Mar 20, 2024
@jhump
Copy link
Contributor Author

jhump commented Mar 20, 2024

Another example that demonstrates the text-format issue in C++, but this time w/out needing any code gen -- just an example protobuf source file that doesn't compile but seems like it really should (thanks to the fact that protoc uses the text format to process message literal values in options):

// test.proto
edition = "2023";

import "google/protobuf/descriptor.proto";

extend google.protobuf.FileOptions {
  Foo ext1 = 10101 [features.message_encoding = DELIMITED];
  Bar ext2 = 20202;
}

message Foo {
  string name = 1;
  Foo next = 2 [features.message_encoding = DELIMITED];
}

message Bar {
  Foo field1 = 1 [features.message_encoding = DELIMITED];
}


// These two do NOT work. It also won't accept name permutations that resemble group
// behavior, such as "Next" or "Foo" for the first or "Field1" or "Foo" for the second.
// So I don't know how to get it to accept anything in the text format for field1.
// (I haven't traced through the code yet.)

// test.proto:27:17: Error while parsing option value for "ext1": Message type "Foo" has no field named "next".
option (ext1) = { next: {name: "xyz"} };
// test.proto:29:17: Error while parsing option value for "ext2": Message type "Bar" has no field named "field1".
option (ext2) = { field1:{name:"abc"} };

// These two, on the other hand, DO work.
option (ext1).next = {name:"xyz"};
option (ext2).field1 = {name:"abc"};

@mkruskal-google mkruskal-google self-assigned this Mar 20, 2024
@mkruskal-google mkruskal-google added editions and removed untriaged auto added to all issues by default when created. labels Mar 20, 2024
@mkruskal-google
Copy link
Member

mkruskal-google commented Mar 20, 2024

Wow great find, thanks Josh! This appears to have bypassed all of our testing because we were focused primarily on syntax->editions transformations, which preserve both the message and field names of groups. This will likely require additional feature(s) and some cross-language fixes to gencode/textpb to fix the behavior of DELIMITED.

This is a marginally worse state than proto2 (since you need to keep the message and field names synchronized), but it is behavior-preserving. I think this really just blocks us from flipping the default message_encoding and recommending DELIMITED

@jhump
Copy link
Contributor Author

jhump commented Mar 20, 2024

This will likely require additional feature(s) and some cross-language fixes to gencode/textpb to fix the behavior of DELIMITED.

Fixing it to use the field name would also introduce a backwards-compatibility issue -- if it previously required the message name Foo but later wants the field name foo. This could likely be addressed via allowing either the message or the field name, when encoding is delimited, but only when the message name is the field name with the first letter capitalized.

@mkruskal-google
Copy link
Member

mkruskal-google commented Mar 20, 2024

Yep, that's why we'd need to use features. With feature, we keep the old behavior of using the message name in various places. Otherwise we use the correct field name. That would allow DELIMITED to actually be used in new cases, without breaking old cases (where we would enable the feature).

Given that this behavior is part of our textproto spec, I think this needs to be a global feature. We could split out the language-dependent behavior, but I'm not sure that's worth it.

@mkruskal-google
Copy link
Member

We'll try to get this in by 27.0, but if it's going to take too long we may release edition 2023 and just ban desynchronization of field/message type for DELIMITED. It would make it very difficult to use, but it would preserve the old behavior without unexpected textproto breakages

@jhump
Copy link
Contributor Author

jhump commented Mar 21, 2024

BTW, one issue with always using the message name instead of the field name in the text format is that it prevents using the text format if one uses the same message type more than once:

edition = "2023";
message Foo {
    // both of these fields will want to use "Foo" as the field name in the text format
    Foo bar = 1 [features.message_encoding = DELIMITED];
    Foo baz = 2 [features.message_encoding = DELIMITED];
}

I think it would be better to only accept the message name when message_name == capitalize(field_name) and even then to also accept the (lower-cased) field name.

Also, the fact that C++ doesn't seem to accept any name for the delimited field when parsing the text format seems to be an orthogonal bug. Should I file a separate issue for that?

As far as language-specific features, are there any that are needed? The text format concerns seem global since the text format needs cross-language interoperability. And the accessor/mutator method naming in the Java gencode seems like a bug and is not a backwards compatibility concern: it already has to capitalize the field name (foo -> getFoo), so for proto2 groups that are migrated to editions, the gen code won't change (thanks to the way the field and message named were derived in proto2 groups).

@mkruskal-google
Copy link
Member

mkruskal-google commented Mar 21, 2024

Yea if C++ doesn't accept the message names that's just a plain non-conformance bug. Is that editions-only?

As far as the features, it's just a question of how we want to model it and burn it down over time. Groups are probably a small enough edge case that we could bundle all of this into a single feature. The gencode issue is worse than you realize though. While Foo would be backwards compatible, FooGroup would not. The field name becomes foogroup, and the Java getter would be getFoogroup (note capitalization issues on G).

@jhump
Copy link
Contributor Author

jhump commented Mar 22, 2024

I suspect all of the language implementations will have this issue when parsing the text format, though they are able to emit the text format just fine. The issue is that if they don't find a field with a name that matches what's in the text (because it matches a type name, for example) then it falls back to looking for the lower-case version of the text name. This only works if the field name == lower-case(type name).

I'm not exactly following what you're hoping to do to address this using features. I have my own proposal for a fix, but I suspect it may not align with your thinking. Here's my suggestion: main...jhump:jh/proposal-for-textpb-and-delimited-msg-fields
Basically:

  1. When parsing the text format, allow the type name to be used instead of the field name if it's a group (i.e. message encoding features == delimited). If lower-case(name) does not match a field, iterate through fields to find one with group/delimited encoding and a matching type name.
    • ALSO, allow the field name, too. This makes the text format more lenient by allowing the field name (instead of the type name) with proto2 groups, but it seems like the right way to eventually migrate to an edition that requires the field name (and is completely rid of the proto2-group-text-format warts).
    • Note that allowing the field name makes it more internally consistent: extensions that are groups were always indicated by the lower-case field name, not the type name. Only non-extension groups used the type name.
  2. When emitting the text format, only emit the type name when it looks like a proto2 group. Specifically, when the type is group (or message encoding delimited), the field's simple name == lower-case(type name), and the field and type are both defined in the same scope.

Curious where you object to the above or what else you were thinking with how to resolve this. If we're not too far apart, I'm happy to update that branch and open a pull request with a fix.

@mkruskal-google
Copy link
Member

mkruskal-google commented Mar 23, 2024

I should do a full audit, but according to our current text format spec group fields exclusively use the type name. My plan was to introduce a new feature that switches all parsers to use the field name exclusively (and fix any gencode issues) on any messages that set it, which wouldn't be backwards compatible but we could use editions to roll it out.

I think the end-state we want is for groups to be encoded the same as regular messages via the field name only. Since group types are no longer constrained to be from nested messages, there's all kinds of ambiguity that could pop up if we accept the type name. I think your approach is a pretty solid because it seems to be backwards compatible with minimal issues under editions. I do see some issues, but nothing necessarily blocking:

  1. your point 2. doesn't leave us any path to ever eliminate this problematic encoding. So we'd be carrying this tech debt forward where overlap between field and message name suddenly change the serialization behavior
  2. This also doesn't fix the gencode issue in Java (and possibly other languages), where we would likely need to use features to control this.
  3. We would need to implement this in every text format parser in advance of our 27.0 release, which is cutting it close

I think this approach could work for the text-format issue though. Did you track down more details about that separate issue in C++?

@mkruskal-google
Copy link
Member

Quick update: I'm writing up a doc for this, I'll try to get you access to it ASAP before we opensource it. My current feelings are that your approach is the best solution for 27.0 if we can get it in time. If we can't, we should just ban these edge cases entirely. Long-term, we'll likely need a feature to burndown the tech debt your approach is leaving untouched.

Still curious about this separate C++ issue, if there's another bug it would be good to consider it while we make this decision

@jhump
Copy link
Contributor Author

jhump commented Mar 26, 2024

Still curious about this separate C++ issue

I don't think there's anything specific to C++. The issue that I only observed in C++ was that it would not accept any name in the text format for the delimited field in the example. But after reading the implementation code, I understand why this is the case and I suspect this may plague the other implementations, which might use effectively the same logic to find a field name.

Basically, if it finds the field with the same name as what appears in the text format, it later clears it/disallows it if it's a group field. When the name in the text format does not match any field name, it simply computes the type name as lower-case(name) and searches for a field with that name. But when the type name and field names materially differ, it will never find any field this way.

@jhump
Copy link
Contributor Author

jhump commented Apr 1, 2024

@mkruskal-google, out of curiosity, is v27.0 meant to be the "official" release of editions? In other words, is that when edition "2023" is allowed by protoc without using the --experimental_editions flag?

If so, I think a solution to the issue I just filed (#16369) should be included in that release, too.

@mkruskal-google
Copy link
Member

Yes 27.0 will be the official release of editions and edition 2023, and the --experimental_editions flag should no longer be required in main. To close out this thread, we will be going with your proposal for 27.0 to handle delimited more smoothly but remain backwards compatible

gopherbot pushed a commit to protocolbuffers/protobuf-go that referenced this issue May 6, 2024
This is a result of the discussion in [1]. Before editions, a group defined a multiple things:

* a type
* a field
* an encoding scheme

With editions this has changed and groups no longer exist and the different parts have to be defined individually. Most importantly, the field and the type also had the same name (usually and CamelCase name). To keep compatibility with proto2 groups, [2] introduced a concept of group-like fields and adjusted the Text/JSON parsers to accept the type name instead of the field name for such fields. This means you can convert from proto2 groups to editions without changing the semantics.
Furthermore, to avoid suprises with group-like fields (e.g. when a user by coincident specified a field that is group-like) protobuf decided that group-like fields should always accept the type and the field name for group like fields. This also allows us to eventually emit the field name rather than the type name for group like fields in the future.

This change implements this decision in Go.


[1] protocolbuffers/protobuf#16239
[2] https://go.dev/cl/575916

Change-Id: I701c4cd228d2e0867b2a87771b6c6331459c4910
Reviewed-on: https://go-review.googlesource.com/c/protobuf/+/582755
Reviewed-by: Lasse Folger <lassefolger@google.com>
Reviewed-by: Mike Kruskal <mkruskal@google.com>
Commit-Queue: Michael Stapelberg <stapelberg@google.com>
LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com>
Reviewed-by: Michael Stapelberg <stapelberg@google.com>
Auto-Submit: Michael Stapelberg <stapelberg@google.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants