Vote For New Feature: Add Extra fields at the bottom of a Message #175

AudriusButkevicius · 2016-05-11T19:42:17Z

I think this is impossible, but just want to double check.

I am interested in generating additional fields on the structs that gogoproto generates, yet I don't want these fields to be serialized into the write protocol, no matter what the value is.

The use case is as follows:
We slice a file into list of blocks, each block is potentially of a variable size.
We'd like to store the offset of the block within the file on the block struct, yet it's not something we'd want to send over the write as this is something that can be easily computed on the other side, especially if we can have 10 million blocks in a single message.

A few hacks I can think of:

Postprocess generated code
Wrapper struct that embeds generated struct, and boilerplate code that performs conversion after unmarshalling off the wire..
Message option to generate these?

awalterschulze · 2016-05-11T19:47:06Z

What is this boilerplate code in option 2?
Maybe a small example would help explain it better?

AudriusButkevicius · 2016-05-11T20:05:33Z

To be honest, now that I've thought about it, it's very unreasonable, due to blocks being quite deep within other messages, it would be something like this, producing a stupid amount of garbage:

I'd imagine the following proto:

message WireUpdate {
  int64 Version = 1;
  repeated WireFile Files = 2;
}

message WireFile {
  string FileName = 1;
  int64 Size = 2;
  repeated WireBlock Blocks = 3;
}

message WireBlock {
  bytes Hash = 1;
  int64 Size = 2;
}

Boilerplate code:

type Update struct {
  Version int64
  Files []File
}

type File struct {
  FileName string
  Size int64
  Blocks []Block
}

type Block struct {
  generated.WireBlock
  Offset uint64
}

func WireUpdateToUpdate(update generated.WireUpdate) Update {
  u := Update{
    Version: update.Version,
    Files: make([]File, len(update.Files)),
  }
  for i := update.Files {
    u.Files[i] = WireFiletoFile(update.Files[i])
  }
  return u
}

func WireFiletoFile(file generated.WireFile) Update {
  f := File{
    FileName: file.Name,
    Size: file.Size,
    Blocks: make([]Block, len(file.Blocks))
  }
  offset := 0
  for i, b := update.Blocks{
    u.Blocks[i] = Block{
      WireBlock: b,
      Offset: offset,
    }
    offset += b.Size
  }
  return f
}

Our current XDR generator allows just ignoring the fields, and we have a helper method which populates offsets on demand, only when they are needed, so the boilerplate code doesn't necessarily need to compute the offset at the point it converts, and could use a helper method for computing them later, but given we're looping through them, and producing a lot of garbage anyway, I just added computation in the boilerplate code.

awalterschulze · 2016-05-12T08:15:18Z

Yes that is a lot of boilerplate and a lot of waste.

Looks like this would have also been able to help
#125
Since then you could declare WireBlock as a customtype.

You could put Offset as a field into the WireBlock message, but not send it.
And then have the helper method compute them later.
But then the API is a little misleading.

So if we have a message option which allows you to add some string at the end of the message fields.

message WireBlock {
  option (gogoproto.extra_message_stuff="\nOffset int64\n";
  bytes Hash = 1;
  int64 Size = 2;
}

This would generate go code:

type WireBlock {
  Hash []byte
  Size int64
  Offset int64
}

But then all other generated methods would ignore these fields.
So for example Equal would not Compare the Offset field.

Any more ideas?

AudriusButkevicius · 2016-05-12T08:21:29Z

For my specific case nor Compare nor Equal are used, so having something like you suggested would be perfect.

awalterschulze · 2016-05-12T08:44:48Z

Ok I don't have time now, but I also want to hear about another use case for this before creating a feature.
So lets let this issue hang out for a while.
I hope thats ok?

AudriusButkevicius · 2016-05-12T09:13:42Z

So I am willing to work on this feature, yet some guidance where this could be wedged in would be nice, as this code is still a maze to me.

I guess I need three locations:
Where to add the option needs to be added
Where options get parsed
Where code gets generated

awalterschulze · 2016-05-13T09:36:41Z

I think their might be usecases where we want to specifically specify each extra field, but I don't know. So I am a little hesitant to move forward at this point.

AudriusButkevicius · 2016-05-13T09:40:04Z

Right, I'll fork and implement it anyway, as it will save me from a lot of hassle, so if you could give me some guidance that (where it needs to be plugged in) would be nice.

I will make a PR proposing the feature, but I do not expect you to merge it, if you feel there are downsides.

awalterschulze · 2016-05-13T09:50:36Z

That sounds great :)

Ok so this is where you needs to add the option
https://github.com/gogo/protobuf/blob/master/gogoproto/gogo.proto

Its also nice to add a little helper function here
https://github.com/gogo/protobuf/blob/master/gogoproto/helper.go

The options get parsed by protoc, they are extensions on top of the descriptor. Its all legal protocol buffers.

Then I would dive into the generator
https://github.com/gogo/protobuf/blob/master/protoc-gen-gogo/generator/generator.go
Since hopefully that is the only place you'll need to edit generator code.
This is probably the place where you'll use your helper function and print in the string
https://github.com/gogo/protobuf/blob/master/protoc-gen-gogo/generator/generator.go#L2090

AudriusButkevicius · 2016-05-13T09:51:27Z

Tyvm

awalterschulze · 2016-05-15T05:17:16Z

Pull Request #179

Please if anyone likes this feature, vote for it by making a comment.

awalterschulze · 2016-05-15T05:18:10Z

If anyone has a comment about a change or extra use case here, please make also make a comment.

stevvooe · 2016-05-16T19:04:27Z

Couldn't most of these use cases be solved through struct embedding?

AudriusButkevicius · 2016-05-16T19:08:33Z

I've provided explanation and the boiler plate code involved. Embedding does not work if you struct with extra field has to be embeded in another proto message

stevvooe · 2016-05-16T21:03:45Z

Embedding does not work if you struct with extra field has to be embeded in another proto message

In general, protobuf types don't behave well when mutually embedded but I guess I am not seeing how the example demonstrates that embedding is problematic.

To tell you the truth, it seems like this could be solved by simply sending the offset of the block over. Most of the problems goes away when you just send the extra field. Also, the offset can be extremely helpful in identifying corruption or bugs in your chunking algorithm. I've done the analysis and experimentation on such protocols and removing the offset provides a minimal, if not unrealizable, optimization, that trades off a lot of complexity. In this presented case, there is merely a 20% savings in size. When you add compression and varint encoding, this is even less. Either way, this seems like a lot of work for an optimization that will most definitely not be a bottleneck in this protocol. The bottleneck will be calculating hashes.

My 2 cents. Sorry for jumping in here.

AudriusButkevicius · 2016-05-16T21:13:26Z

We use the same wire format for storing stuff in the database, not only sending it on the write, which means wasting disk space too.

We've been doing fine without offsets stored or sent anywhere for 3 years so even though I appreciate your analysis and experimentation, I don't think it has any meaning in our context when we genuinely know we don't need it.

Also, your suggested merely 20% saving translates to something less "merely" for the guy with 20TB of data (https://data.syncthing.net/).

stevvooe · 2016-05-16T22:04:53Z

@AudriusButkevicius No need for the dismissive tone. I'm just trying to suggest a simpler solution that may save you time and avoid a solution that looks that can create other, more subtle problems.

We use the same wire format for storing stuff in the database, not only sending it on the write, which means wasting disk space too.

You could remove the field before writing to the database. The records could also be compressed and the extra storage requirements may be negligible.

We've been doing fine without offsets stored or sent anywhere for 3 years so even though I appreciate your analysis and experimentation, I don't think it has any meaning in our context when we genuinely know we don't need it.

I've been doing this for awhile, as well, so this isn't just "experimentation".

In general, I'm sure it is workable, but the current design from this issue has no way to verify the complete file hash or the offsets of the individual blocks, unless you hashed the entire structure. If offset and block layout were incorrect, it would be impossible to detect without comparing to the source file. For example, if File.Size did not match the sum of block.Size, there would be no way to detect which block introduced the error without the offset. I'm not sure how common this is in practice.

Also, your suggested merely 20% saving translates to something less "merely" for the guy with 20TB of data (https://data.syncthing.net/).

This is an upper bound of 20% for metadata, with probably way less than 10% in practice. Once you take into account all data, the savings is negligible (like <0.0001%, with a 1MB block size). This really only provides benefits when total data is at the exabyte scale, but at this scale, the metadata is still massive, even without the offset, so relative benefit may be low. If you have smaller block sizes, removing the offset will provide more and more benefit.

Either way, I think relying on protobuf generation for this solution isn't a great approach when a package-local struct would suffice.

AudriusButkevicius · 2016-05-16T22:42:07Z

You could remove the field before writing to the database. The records could also be compressed and the extra storage requirements may be negligible.

But that would produce tons of garbage, as I'd have to copy every single block, or use unsafe hacks for it not to consume more memory.

The data is already compressed, but you have to understand that these block structs take up probably 99% of what's in the database.

This really only provides benefits when total data is at the exabyte scale, but at this scale, the metadata is still massive, even without the offset, so relative benefit may be low. If you have smaller block sizes, removing the offset will provide more and more benefit.

Our block size is 128kb, and we exchange the whole of metadata every time we connect - literally, every time, yes it's not ideal but that's not a topic for discussion here.

If you are on a mobile somewhere in southern africa where data rates are crazy expensive, and you happen to move between antennas, every byte counts.

In general, I'm sure it is workable, but the current design from this issue has no way to verify the complete file hash or the offsets of the individual blocks, unless you hashed the entire structure. If offset and block layout were incorrect, it would be impossible to detect without comparing to the source file.

We have no verification. The database we use provides checksumming itself, so we are sure the data that lives there should not be corrupt, as well as TCP checksums should ensure integrity of the packets.

Either way, I think relying on protobuf generation for this solution isn't a great approach when a package-local struct would suffice.

As I already said above, the amount of garbage local structs would produce is unacceptable. Users run our software on stupidly small devices and yet expect acceptable performance (PI's with 256MB of RAM handling 4TB drive arrays). I am sure that by doing this we'd put the entry bar higher in terms of how much memory you need in order to handle a working set of files sized Y.

awalterschulze · 2016-05-19T18:51:15Z

Hey where in Southern Africa are you?
I am in South Africa, Stellenbosch myself, well not at the moment, but generally.

Sorry, but now I am going to make this a serious suggestion.
Please feel free to tell me how I am wrong. Even with a dismissive tone :)

How about just putting the Offset field in the WireBlock and not populating it?

Or you could have two protobuf definitions.
One without the Offset (as is), this provides the API and another with the offset, this is used by the client.
Because you define these protobuf definitions in a compatible way you can Unmarshal the API (server) protobufs into the client protobufs. And now the client can simply populate the offsets.

awalterschulze · 2016-05-19T18:51:57Z

I am suggesting this because maintaining a fork is not the funnest thing to do and it might be a while before I see more usecases for this feature.

AudriusButkevicius · 2016-05-19T19:07:21Z

So I am not in south africa, but it's not a rare occurrence of people with stupidly slow links complaining how much bandwidth the software consumes.

As far as I understand not populating a field still consumes at least 3 bytes (2 bytes for the type information, 1 byte for a value of 0 if I understand correctly) on the wire.

Regarding having 2 messages, it either becomes a one way trip or produces garbage:

Offsetless on the wire to offsetful struct, ok
Offsetful struct to offsetful on the write - not ok, extra bytes that I don't want
Offsetful struct to offsetless struct to offsetless on the wirte - not ok, produces garbage by having to construct offsetless struct.

awalterschulze · 2016-05-19T19:16:14Z

Are you using proto3?

awalterschulze · 2016-05-19T19:16:36Z

Because then you can have even less bytes for zero populated fields.

AudriusButkevicius · 2016-05-19T19:26:29Z

Yes, it is proto3, but I assume zero populated fields are still not free?

awalterschulze · 2016-05-19T19:37:25Z

zero populated fields are free in proto3.
They do not send any bytes.

AudriusButkevicius · 2016-05-19T19:46:06Z

Ok, so I guess I can just zero them out.

Case closed then.

Also, you've mentioned that I could define two identical structs, one with less fields to perform partial deserialization. Does this work on proto3, and are there any specific requirements (such as matching field names/ids?)

awalterschulze · 2016-05-20T14:03:46Z

The fieldnumber and types should be the same.
It is basically the backwards compatibility garauntee that protobufs gives you.
But remember that if you deserialize with proto3 and serialize again the unrecognized fields will be removed.

server.proto

message Update {
  int64 Version = 1;
  repeated File Files = 2;
}

message File {
  string FileName = 1;
  int64 Size = 2;
  repeated Block Blocks = 3;
}

message Block {
  bytes Hash = 1;
  int64 Size = 2;
}

client.proto

message Update {
  int64 Version = 1;
  repeated File Files = 2;
}

message File {
  string FileName = 1;
  int64 Size = 2;
  repeated Block Blocks = 3;
}

message Block {
  bytes Hash = 1;
  int64 Size = 2;
  int64 Offset = 3;
}

awalterschulze changed the title ~~Feature: Ignored fields~~ Vote For New Feature: Add Extra fields at the bottom of a Message May 12, 2016

awalterschulze added enhancement help wanted labels May 12, 2016

awalterschulze closed this as completed May 20, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Vote For New Feature: Add Extra fields at the bottom of a Message #175

Vote For New Feature: Add Extra fields at the bottom of a Message #175

AudriusButkevicius commented May 11, 2016 •

edited

awalterschulze commented May 11, 2016

AudriusButkevicius commented May 11, 2016 •

edited

awalterschulze commented May 12, 2016

AudriusButkevicius commented May 12, 2016

awalterschulze commented May 12, 2016

AudriusButkevicius commented May 12, 2016 •

edited

awalterschulze commented May 13, 2016 •

edited

AudriusButkevicius commented May 13, 2016

awalterschulze commented May 13, 2016

AudriusButkevicius commented May 13, 2016

awalterschulze commented May 15, 2016

awalterschulze commented May 15, 2016

stevvooe commented May 16, 2016

AudriusButkevicius commented May 16, 2016

stevvooe commented May 16, 2016

AudriusButkevicius commented May 16, 2016

stevvooe commented May 16, 2016

AudriusButkevicius commented May 16, 2016 •

edited

awalterschulze commented May 19, 2016

awalterschulze commented May 19, 2016

AudriusButkevicius commented May 19, 2016

awalterschulze commented May 19, 2016

awalterschulze commented May 19, 2016

AudriusButkevicius commented May 19, 2016

awalterschulze commented May 19, 2016

AudriusButkevicius commented May 19, 2016

awalterschulze commented May 20, 2016

Vote For New Feature: Add Extra fields at the bottom of a Message #175

Vote For New Feature: Add Extra fields at the bottom of a Message #175

Comments

AudriusButkevicius commented May 11, 2016 • edited

awalterschulze commented May 11, 2016

AudriusButkevicius commented May 11, 2016 • edited

awalterschulze commented May 12, 2016

AudriusButkevicius commented May 12, 2016

awalterschulze commented May 12, 2016

AudriusButkevicius commented May 12, 2016 • edited

awalterschulze commented May 13, 2016 • edited

AudriusButkevicius commented May 13, 2016

awalterschulze commented May 13, 2016

AudriusButkevicius commented May 13, 2016

awalterschulze commented May 15, 2016

awalterschulze commented May 15, 2016

stevvooe commented May 16, 2016

AudriusButkevicius commented May 16, 2016

stevvooe commented May 16, 2016

AudriusButkevicius commented May 16, 2016

stevvooe commented May 16, 2016

AudriusButkevicius commented May 16, 2016 • edited

awalterschulze commented May 19, 2016

awalterschulze commented May 19, 2016

AudriusButkevicius commented May 19, 2016

awalterschulze commented May 19, 2016

awalterschulze commented May 19, 2016

AudriusButkevicius commented May 19, 2016

awalterschulze commented May 19, 2016

AudriusButkevicius commented May 19, 2016

awalterschulze commented May 20, 2016

AudriusButkevicius commented May 11, 2016 •

edited

AudriusButkevicius commented May 11, 2016 •

edited

AudriusButkevicius commented May 12, 2016 •

edited

awalterschulze commented May 13, 2016 •

edited

AudriusButkevicius commented May 16, 2016 •

edited