Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Vote For New Feature: Add Extra fields at the bottom of a Message #175

Closed
AudriusButkevicius opened this issue May 11, 2016 · 27 comments
Closed

Comments

@AudriusButkevicius
Copy link

AudriusButkevicius commented May 11, 2016

I think this is impossible, but just want to double check.

I am interested in generating additional fields on the structs that gogoproto generates, yet I don't want these fields to be serialized into the write protocol, no matter what the value is.

The use case is as follows:
We slice a file into list of blocks, each block is potentially of a variable size.
We'd like to store the offset of the block within the file on the block struct, yet it's not something we'd want to send over the write as this is something that can be easily computed on the other side, especially if we can have 10 million blocks in a single message.

A few hacks I can think of:

  1. Postprocess generated code
  2. Wrapper struct that embeds generated struct, and boilerplate code that performs conversion after unmarshalling off the wire..
  3. Message option to generate these?
@awalterschulze
Copy link
Member

What is this boilerplate code in option 2?
Maybe a small example would help explain it better?

@AudriusButkevicius
Copy link
Author

AudriusButkevicius commented May 11, 2016

To be honest, now that I've thought about it, it's very unreasonable, due to blocks being quite deep within other messages, it would be something like this, producing a stupid amount of garbage:

I'd imagine the following proto:

message WireUpdate {
  int64 Version = 1;
  repeated WireFile Files = 2;
}

message WireFile {
  string FileName = 1;
  int64 Size = 2;
  repeated WireBlock Blocks = 3;
}

message WireBlock {
  bytes Hash = 1;
  int64 Size = 2;
}

Boilerplate code:

type Update struct {
  Version int64
  Files []File
}

type File struct {
  FileName string
  Size int64
  Blocks []Block
}

type Block struct {
  generated.WireBlock
  Offset uint64
}

func WireUpdateToUpdate(update generated.WireUpdate) Update {
  u := Update{
    Version: update.Version,
    Files: make([]File, len(update.Files)),
  }
  for i := update.Files {
    u.Files[i] = WireFiletoFile(update.Files[i])
  }
  return u
}

func WireFiletoFile(file generated.WireFile) Update {
  f := File{
    FileName: file.Name,
    Size: file.Size,
    Blocks: make([]Block, len(file.Blocks))
  }
  offset := 0
  for i, b := update.Blocks{
    u.Blocks[i] = Block{
      WireBlock: b,
      Offset: offset,
    }
    offset += b.Size
  }
  return f
}

Our current XDR generator allows just ignoring the fields, and we have a helper method which populates offsets on demand, only when they are needed, so the boilerplate code doesn't necessarily need to compute the offset at the point it converts, and could use a helper method for computing them later, but given we're looping through them, and producing a lot of garbage anyway, I just added computation in the boilerplate code.

@awalterschulze
Copy link
Member

Yes that is a lot of boilerplate and a lot of waste.

Looks like this would have also been able to help
#125
Since then you could declare WireBlock as a customtype.

You could put Offset as a field into the WireBlock message, but not send it.
And then have the helper method compute them later.
But then the API is a little misleading.

So if we have a message option which allows you to add some string at the end of the message fields.

message WireBlock {
  option (gogoproto.extra_message_stuff="\nOffset int64\n";
  bytes Hash = 1;
  int64 Size = 2;
}

This would generate go code:

type WireBlock {
  Hash []byte
  Size int64
  Offset int64
}

But then all other generated methods would ignore these fields.
So for example Equal would not Compare the Offset field.

Any more ideas?

@AudriusButkevicius
Copy link
Author

For my specific case nor Compare nor Equal are used, so having something like you suggested would be perfect.

@awalterschulze
Copy link
Member

Ok I don't have time now, but I also want to hear about another use case for this before creating a feature.
So lets let this issue hang out for a while.
I hope thats ok?

@awalterschulze awalterschulze changed the title Feature: Ignored fields Vote For New Feature: Add Extra fields at the bottom of a Message May 12, 2016
@AudriusButkevicius
Copy link
Author

AudriusButkevicius commented May 12, 2016

So I am willing to work on this feature, yet some guidance where this could be wedged in would be nice, as this code is still a maze to me.

I guess I need three locations:
Where to add the option needs to be added
Where options get parsed
Where code gets generated

@awalterschulze
Copy link
Member

awalterschulze commented May 13, 2016

I think their might be usecases where we want to specifically specify each extra field, but I don't know. So I am a little hesitant to move forward at this point.

@AudriusButkevicius
Copy link
Author

Right, I'll fork and implement it anyway, as it will save me from a lot of hassle, so if you could give me some guidance that (where it needs to be plugged in) would be nice.

I will make a PR proposing the feature, but I do not expect you to merge it, if you feel there are downsides.

@awalterschulze
Copy link
Member

That sounds great :)

Ok so this is where you needs to add the option
https://github.com/gogo/protobuf/blob/master/gogoproto/gogo.proto

Its also nice to add a little helper function here
https://github.com/gogo/protobuf/blob/master/gogoproto/helper.go

The options get parsed by protoc, they are extensions on top of the descriptor. Its all legal protocol buffers.

Then I would dive into the generator
https://github.com/gogo/protobuf/blob/master/protoc-gen-gogo/generator/generator.go
Since hopefully that is the only place you'll need to edit generator code.
This is probably the place where you'll use your helper function and print in the string
https://github.com/gogo/protobuf/blob/master/protoc-gen-gogo/generator/generator.go#L2090

@AudriusButkevicius
Copy link
Author

Tyvm

@awalterschulze
Copy link
Member

Pull Request #179

Please if anyone likes this feature, vote for it by making a comment.

@awalterschulze
Copy link
Member

If anyone has a comment about a change or extra use case here, please make also make a comment.

@stevvooe
Copy link
Contributor

Couldn't most of these use cases be solved through struct embedding?

@AudriusButkevicius
Copy link
Author

I've provided explanation and the boiler plate code involved. Embedding does not work if you struct with extra field has to be embeded in another proto message

@stevvooe
Copy link
Contributor

Embedding does not work if you struct with extra field has to be embeded in another proto message

In general, protobuf types don't behave well when mutually embedded but I guess I am not seeing how the example demonstrates that embedding is problematic.

To tell you the truth, it seems like this could be solved by simply sending the offset of the block over. Most of the problems goes away when you just send the extra field. Also, the offset can be extremely helpful in identifying corruption or bugs in your chunking algorithm. I've done the analysis and experimentation on such protocols and removing the offset provides a minimal, if not unrealizable, optimization, that trades off a lot of complexity. In this presented case, there is merely a 20% savings in size. When you add compression and varint encoding, this is even less. Either way, this seems like a lot of work for an optimization that will most definitely not be a bottleneck in this protocol. The bottleneck will be calculating hashes.

My 2 cents. Sorry for jumping in here.

@AudriusButkevicius
Copy link
Author

We use the same wire format for storing stuff in the database, not only sending it on the write, which means wasting disk space too.

We've been doing fine without offsets stored or sent anywhere for 3 years so even though I appreciate your analysis and experimentation, I don't think it has any meaning in our context when we genuinely know we don't need it.

Also, your suggested merely 20% saving translates to something less "merely" for the guy with 20TB of data (https://data.syncthing.net/).

@stevvooe
Copy link
Contributor

@AudriusButkevicius No need for the dismissive tone. I'm just trying to suggest a simpler solution that may save you time and avoid a solution that looks that can create other, more subtle problems.

We use the same wire format for storing stuff in the database, not only sending it on the write, which means wasting disk space too.

You could remove the field before writing to the database. The records could also be compressed and the extra storage requirements may be negligible.

We've been doing fine without offsets stored or sent anywhere for 3 years so even though I appreciate your analysis and experimentation, I don't think it has any meaning in our context when we genuinely know we don't need it.

I've been doing this for awhile, as well, so this isn't just "experimentation".

In general, I'm sure it is workable, but the current design from this issue has no way to verify the complete file hash or the offsets of the individual blocks, unless you hashed the entire structure. If offset and block layout were incorrect, it would be impossible to detect without comparing to the source file. For example, if File.Size did not match the sum of block.Size, there would be no way to detect which block introduced the error without the offset. I'm not sure how common this is in practice.

Also, your suggested merely 20% saving translates to something less "merely" for the guy with 20TB of data (https://data.syncthing.net/).

This is an upper bound of 20% for metadata, with probably way less than 10% in practice. Once you take into account all data, the savings is negligible (like <0.0001%, with a 1MB block size). This really only provides benefits when total data is at the exabyte scale, but at this scale, the metadata is still massive, even without the offset, so relative benefit may be low. If you have smaller block sizes, removing the offset will provide more and more benefit.

Either way, I think relying on protobuf generation for this solution isn't a great approach when a package-local struct would suffice.

@AudriusButkevicius
Copy link
Author

AudriusButkevicius commented May 16, 2016

You could remove the field before writing to the database. The records could also be compressed and the extra storage requirements may be negligible.

But that would produce tons of garbage, as I'd have to copy every single block, or use unsafe hacks for it not to consume more memory.

The data is already compressed, but you have to understand that these block structs take up probably 99% of what's in the database.

This really only provides benefits when total data is at the exabyte scale, but at this scale, the metadata is still massive, even without the offset, so relative benefit may be low. If you have smaller block sizes, removing the offset will provide more and more benefit.

Our block size is 128kb, and we exchange the whole of metadata every time we connect - literally, every time, yes it's not ideal but that's not a topic for discussion here.

If you are on a mobile somewhere in southern africa where data rates are crazy expensive, and you happen to move between antennas, every byte counts.

In general, I'm sure it is workable, but the current design from this issue has no way to verify the complete file hash or the offsets of the individual blocks, unless you hashed the entire structure. If offset and block layout were incorrect, it would be impossible to detect without comparing to the source file.

We have no verification. The database we use provides checksumming itself, so we are sure the data that lives there should not be corrupt, as well as TCP checksums should ensure integrity of the packets.

Either way, I think relying on protobuf generation for this solution isn't a great approach when a package-local struct would suffice.

As I already said above, the amount of garbage local structs would produce is unacceptable. Users run our software on stupidly small devices and yet expect acceptable performance (PI's with 256MB of RAM handling 4TB drive arrays). I am sure that by doing this we'd put the entry bar higher in terms of how much memory you need in order to handle a working set of files sized Y.

@awalterschulze
Copy link
Member

Hey where in Southern Africa are you?
I am in South Africa, Stellenbosch myself, well not at the moment, but generally.

Sorry, but now I am going to make this a serious suggestion.
Please feel free to tell me how I am wrong. Even with a dismissive tone :)

How about just putting the Offset field in the WireBlock and not populating it?

Or you could have two protobuf definitions.
One without the Offset (as is), this provides the API and another with the offset, this is used by the client.
Because you define these protobuf definitions in a compatible way you can Unmarshal the API (server) protobufs into the client protobufs. And now the client can simply populate the offsets.

@awalterschulze
Copy link
Member

I am suggesting this because maintaining a fork is not the funnest thing to do and it might be a while before I see more usecases for this feature.

@AudriusButkevicius
Copy link
Author

So I am not in south africa, but it's not a rare occurrence of people with stupidly slow links complaining how much bandwidth the software consumes.

As far as I understand not populating a field still consumes at least 3 bytes (2 bytes for the type information, 1 byte for a value of 0 if I understand correctly) on the wire.

Regarding having 2 messages, it either becomes a one way trip or produces garbage:

  1. Offsetless on the wire to offsetful struct, ok
  2. Offsetful struct to offsetful on the write - not ok, extra bytes that I don't want
  3. Offsetful struct to offsetless struct to offsetless on the wirte - not ok, produces garbage by having to construct offsetless struct.

@awalterschulze
Copy link
Member

Are you using proto3?

@awalterschulze
Copy link
Member

Because then you can have even less bytes for zero populated fields.

@AudriusButkevicius
Copy link
Author

Yes, it is proto3, but I assume zero populated fields are still not free?

@awalterschulze
Copy link
Member

zero populated fields are free in proto3.
They do not send any bytes.

@AudriusButkevicius
Copy link
Author

Ok, so I guess I can just zero them out.

Case closed then.

Also, you've mentioned that I could define two identical structs, one with less fields to perform partial deserialization. Does this work on proto3, and are there any specific requirements (such as matching field names/ids?)

@awalterschulze
Copy link
Member

The fieldnumber and types should be the same.
It is basically the backwards compatibility garauntee that protobufs gives you.
But remember that if you deserialize with proto3 and serialize again the unrecognized fields will be removed.

server.proto

message Update {
  int64 Version = 1;
  repeated File Files = 2;
}

message File {
  string FileName = 1;
  int64 Size = 2;
  repeated Block Blocks = 3;
}

message Block {
  bytes Hash = 1;
  int64 Size = 2;
}

client.proto

message Update {
  int64 Version = 1;
  repeated File Files = 2;
}

message File {
  string FileName = 1;
  int64 Size = 2;
  repeated Block Blocks = 3;
}

message Block {
  bytes Hash = 1;
  int64 Size = 2;
  int64 Offset = 3;
}

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants