Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add proto2 default support to Ruby gem #3606

Closed
wants to merge 4 commits into from

Conversation

zanker
Copy link
Contributor

@zanker zanker commented Sep 7, 2017

(Sorry this is going to be long!)

I'm happy to split the PR up into multiple, but wasn't really sure what your preferences were on this. I intentionally didn't update the proto generator, since this only supports MRI and not the Java side too. I can add the Java one into this PR, or do that after.

One of the main problems with this I ran into was API compatibility. You need to have the syntax set before you do anything, because things like defaults or maps are set when messages are added initially, and a lot of it is easier to gate at add time rather than doing post add validation checks (like is done with enums).

I opted to make FileDescriptor an optional argument you pass to DescriptorPool, and that gets automatically added to anything you attach to it after. If you don't pass a FileDescriptor, it sets it to proto3 as that's the current syntax that's supported when generating so it maintains backwards compatibility.

It means Descriptor and EnumDescriptor end up with breaking changes because they require a FileDescriptor passed. Are instantiating these directly supported APIs? Same with FieldDescriptor, ideally it would be a required init so things like FieldDescriptor_default_set and FieldDescriptor_default can be gated by syntax. As it stands, it seems you can't access the parent from FieldDescriptor until after it's been added.

For setting defaults for enums, you have to use an int, because we can't guarantee an enum is added to the pool before we set our default. The only way I can see supporting adding-by-name is if we do it after the pool is initialized, and we add enums first, but I didn't want to bypass UPB for only default setting. Might be something that UPB3 makes easier @haberman?

While right now, you can only set default for add_field, I made it a hash rather than a single argument to give us the ability to add the other extension type fields without having to make more API changes.

The way hasbits are handled feels a little janky, but I was trying to reuse the existing UPB MSG_WRITERS rather than just copy/pasting the code and storing the offsets as-is. One thing @nerdrew noticed is that upb_msg_has seems to have a bug? It does DEREF(msg, hasbit / 8, char) | (1 << (hasbit % 8)) which will always end up being true, looks like this should be an AND not an OR? The method is never used that I could tell.

For hasbits, I made it so we always use them (even in proto3), but it gates them behind the upb_fielddef_haspresence so under proto3 it's only using them for sub messages. One semantic I'm not sure how you want to handle is unsetting a field though.

I added a clear_<field> method which unsets the hasbit, but it means doing <field>=nil doesn't unset it. If I remember right, in Java this also doesn't (you have to call clear explicitly). One option is to make it so <field>=nil is equivalent to clear_<field>. That feels like the better way of handling it in Ruby, but wanted to get your opinion on it.

For tests, I just added a basic_proto2.rb one. It looks like the proto2 tests for the other libraries are handled via compatibility_tests and copying the old version to preserve it historically. Since there wasn't a prior version that supported proto2, I just opted to inline another test. I can follow that pattern and have a fake v2 version under it if you want.

(cc @nerdrew @djudd-stripe)

@grpc-kokoro
Copy link

Thanks for your pull request. The automated tests will run as soon as one of the admins verifies this change is ok for us to run on our infrastructure.

1 similar comment
@grpc-kokoro
Copy link

Thanks for your pull request. The automated tests will run as soon as one of the admins verifies this change is ok for us to run on our infrastructure.

@bazel-io
Copy link

bazel-io commented Sep 7, 2017

Can one of the admins verify this patch?

@googlebot
Copy link

Thanks for your pull request. It looks like this may be your first contribution to a Google open source project. Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA).

📝 Please visit https://cla.developers.google.com/ to sign.

Once you've signed, please reply here (e.g. I signed it!) and we'll verify. Thanks.


  • If you've already signed a CLA, it's possible we don't have your GitHub username or you're using a different email address. Check your existing CLA data and verify that your email is set on your git commits.
  • If your company signed a CLA, they designated a Point of Contact who decides which employees are authorized to participate. You may need to contact the Point of Contact for your company and ask to be added to the group of authorized contributors. If you don't know who your Point of Contact is, direct the project maintainer to go/cla#troubleshoot.
  • In order to pass this check, please resolve this problem and have the pull request author add another comment and the bot will run again.

if (argc > 1) {
rb_raise(rb_eArgError, "Expected 0 or 1 argument, got %d", argc);
} else if (argc == 1) {
if (CLASS_OF(argv[0]) != cFileDescriptor) {
Copy link
Contributor Author

@zanker zanker Sep 7, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Something @eam pointed out, should this be CLASS_OF(argv[0]) != cFileDescriptor or should it be calling the Ruby rb_obj_is_kind_of or rb_obj_is_instance_of instead? I was mostly mirroring the existing code you had.

@zanker
Copy link
Contributor Author

zanker commented Sep 7, 2017

Oops, forgot to get myself added to our corporate CLA. Will get that done today.

@zanker
Copy link
Contributor Author

zanker commented Sep 7, 2017

I should be listed as authorized under Square's CLA with Google now.

@googlebot
Copy link

CLAs look good, thanks!

@googlebot googlebot added cla: yes and removed cla: no labels Sep 7, 2017
@zanker zanker force-pushed the zanker/syntax2 branch 4 times, most recently from 1bc43f3 to 4b57eca Compare September 15, 2017 15:48
@haberman
Copy link
Member

ok to test

@haberman
Copy link
Member

Thanks for the PR! Did you forget some changes to ruby_generator.cc?

@zanker
Copy link
Contributor Author

zanker commented Sep 25, 2017

Sorry about the delay, was out for a few days last week.

I wasn't able to find the file to generate those in, I added setting of the syntax to the generator. For setting defaults, should I just mimic what it does now? Attempt to generate a proto2 file, and if it uses something unsupported like extensions then error?

@@ -451,8 +466,18 @@ bool GenerateFile(const FileDescriptor* file, io::Printer* printer,
}
}

printer->Print("Google::Protobuf::DescriptorPool.generated_pool.build(\n");
printer->Indent();
printer->Print("Google::Protobuf::FileDescriptor.new.tap do |f|\n");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe the syntax could be a parameter to the constructor? The syntax is a fundamental property of a file, and in an actual .proto file it has to be the first statement.

Making the syntax a constructor parameter and then defining all messages/enums inside the block makes sense to me.

If we are planning to go farther down the DSL route, we'll be wanting to add the ability to specify options of files, messages, enums, etc. To do that we'll need to mirror the structure of actual proto descriptors as closely as possible. Files contain messages, and every message belongs to a file.

Copy link
Contributor Author

@zanker zanker Oct 9, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems reasonable, but could you clarify what you mean by "constructor parameter", so something like (super rough):

require 'other_pb.rb'

Google::Protobuf::FileDescriptor.start do
  syntax :proto2
  option :java_package,  "com.example.foo"
 
  pool.build do
    add_message ...
  end
end

Is what you're talking about?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was thinking more along the lines of:

Google::Protobuf::DescriptorPool.generated_pool.build do
  add_file(syntax => :proto2) do
    add_message ...
  end
end

I am not exactly sure yet how we should specify options. The problem is that options can get really tricky. You can define custom options that use your own types. But even weirder, you can use the option on the same type that is defining the option:

syntax = "proto2";

import "google/protobuf/descriptor.proto";

package pkg;

message M {
  // This will become an option on message M, but its value is also
  // an instance of the M message.  This is really tricky because it
  // implies that you need to be able to create an M message while
  // you are defining it.
  //
  // The protobuf library generally deals with this by encoding options
  // as serialized protos, and only parsing them once the message
  // types are defined.
  option (pkg.m) = {val: 999};
  optional int32 val = 1;
}

extend google.protobuf.MessageOptions {
  optional M m = 1234567;
}

Whatever path we go down with options, I want to make sure it can accommodate the trickiest cases, like the above. This is one of the reasons the DSL worries me: it's easy to design yourself into a corner if you're not careful.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Gotcha, that is fun to deal with. Do you have an example somewhere of what it looks like in another language?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can try compiling my snippet above into other languages to see how it is handled. Python might be a good model since it's similar to Ruby. Python defines the types, then registers extensions (including custom options), then sets options again. The options are specified as a serialized binary string, which will find any extensions that are registered:

_M = _descriptor.Descriptor(
  name='M',
  full_name='pkg.M',
  filename=None,
  file=DESCRIPTOR,
  containing_type=None,
  fields=[
    _descriptor.FieldDescriptor(
      name='val', full_name='pkg.M.val', index=0,
      number=1, type=5, cpp_type=1, label=1,
      has_default_value=False, default_value=0,
      message_type=None, enum_type=None, containing_type=None,
      is_extension=False, extension_scope=None,
      options=None),
  ],
  extensions=[
  ],
  nested_types=[],
  enum_types=[
  ],
  options=_descriptor._ParseOptions(descriptor_pb2.MessageOptions(), _b('\272\350\332\004\003\010\347\007')),
  is_extendable=False,
  syntax='proto2',
  extension_ranges=[],
  oneofs=[
  ],
  serialized_start=53,
  serialized_end=79,
)

DESCRIPTOR.message_types_by_name['M'] = _M
DESCRIPTOR.extensions_by_name['m'] = m

M = _reflection.GeneratedProtocolMessageType('M', (_message.Message,), dict(
  DESCRIPTOR = _M,
  __module__ = 'test_pb2'
  # @@protoc_insertion_point(class_scope:pkg.M)
  ))
_sym_db.RegisterMessage(M)

m.message_type = _M
google_dot_protobuf_dot_descriptor__pb2.MessageOptions.RegisterExtension(m)

_M.has_options = True
_M._options = _descriptor._ParseOptions(descriptor_pb2.MessageOptions(), _b('\272\350\332\004\003\010\347\007'))

This approach could work for Ruby. We should freeze the descriptor after we set options, so users don't try to add or mutate options on the descriptor after it has been built.

@haberman
Copy link
Member

haberman commented Oct 9, 2017

Sorry about the delay from me. There are some merge conflicts to resolve, but overall I think this is a promising approach. I think the biggest thing to nail down is how to express this in the DSL. I think we should mirror the structure of proto descriptors as closely as possible.

@zanker
Copy link
Contributor Author

zanker commented Nov 20, 2017

Sorry for the delay, been on vacations. I left Square, and won't be able to continue the work on this PR. When I talked to @nerdrew last, he was going to potentially take a look at it, but not sure if that's still the case. Sorry about that, thanks for your help on the PR.

If someone else ends up wanting to take it over the finish line, I'm happy to help walk them through the PR and how things are organized.

@zanker zanker closed this Nov 20, 2017
@djudd-stripe
Copy link

FWIW - I am unlikely to be able to take this up; we have a workaround which involves parsing protoc's descriptor_set_out to get annotation values, and so I can't justify a lot of time on it. Also my C is pretty rusty. Happy to provide more details on our workaround to anyone interested, though.

@djudd-stripe
Copy link

Also, @zanker thanks for updating us!

@hrsht
Copy link
Contributor

hrsht commented Jun 13, 2018

@haberman I am going to take over this work of adding proto2 support for Ruby from @zanker and hopefully see to completion. I plan on adding support for extensions and options in the coming weeks.

Meanwhile, do you think this PR is more or less ready to be merged for basic proto2 syntax support (except of course for changes in ruby_generator.cc to not error out for proto2 files)? If so, I can send out a new PR with all of the changes here, plus a couple bug fixes and the required changes in ruby_generator.cc. Please let me know.

Thanks!

cc: @Yurokle

@zanker
Copy link
Contributor Author

zanker commented Jun 13, 2018 via email

@@ -447,6 +428,22 @@ MessageLayout* create_layout(const upb_msgdef* msgdef) {

layout->fields = ALLOC_N(MessageField, nfields);

size_t hasbit = sizeof(void*) * 8;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@zanker do you remember the reason for setting this to sizeof(void*) * 8 and not 0?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought it had to do with pointer alignment, but apparently past me was smart and wrote a helpful comment. From encode_decode.c

// The offset we pass to UPB points to the start of the Message, rather than the start of where our data is stored.
// Which works out to the size of a pointer. Since we can't pass arbitrary pointers to UPB, we bump it by the size*8
// which lets us reuse the UPB writers for primitives.
int32_t hasbit = hasbit_off == MESSAGE_FIELD_NO_HASBIT ? -1 : hasbit_off + sizeof(void*) * 8;

The general gist of the presence checking, is that it's a bit hackish because I'm trying to reuse as much of the existing UPB code as possible. There's definitely nicer ways to do it, but you end up reimplementing more code.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, starting at sizeof(void*) * 8 sets the first hasbit at offset 64, which will allocate the 9th byte in the data part of MessageHeader. So effectively, there is no difference whether you start from 0, or 64 or any other value.

As for the code

int32_t hasbit = hasbit_off == MESSAGE_FIELD_NO_HASBIT ? -1 : hasbit_off + sizeof(void*) * 8;

it should be sizeof(MessageHeader) instead of sizeof(void*) as you want the offset to the data part of the message type and skip the number of bytes needed for MessageHeader

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you're able to get it to work by simplifying it, then I would definitely do so. You should also add some tests that encode and decode a message then check the has_foo? fields to verify. The main thing, is that it's all setup to align to the offsets used by: https://github.com/zanker/protobuf-1/blob/af38326c9a7ceab44b148a80ed8e1e734821733f/ruby/ext/google/protobuf_c/upb.c#L3282, not by MessageLayout.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ahh.. I see. Looks like that has changed in the recent version of UPB. https://github.com/google/protobuf/blob/master/ruby/ext/google/protobuf_c/upb.c#L4311

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah! Awesome. That should make life easier then.

It looks like the version of UPB got upgraded with the unknown field change, and so you now have things like upb_sethasbit and upb_readhasbit. That should make it a lot easier and let you clean up the code. A lot of the issues I had were that I had to reimplement them to match what UPB did.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't know if I can use upb_sethasbit or upb_readhasbit without switching to using UPB for message layout and the in-memory representation of a message. And that to me sounds like more work that this PR is intended for.

Between, @haberman is the eventual plan to use UPB for message layout and the in-memory representation?

@hrsht
Copy link
Contributor

hrsht commented Jun 21, 2018

@haberman ping!

@xfxyjwf
Copy link
Contributor

xfxyjwf commented Jun 21, 2018

@hrsht Can you create a new issue and/or pull request for this?

@hrsht
Copy link
Contributor

hrsht commented Jun 22, 2018

@xfxyjwf created #4816. Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

8 participants