Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Exploring Serialization via Protobuf and Others #150

Closed
prestonvanloon opened this issue Jun 5, 2018 · 17 comments
Closed

Exploring Serialization via Protobuf and Others #150

prestonvanloon opened this issue Jun 5, 2018 · 17 comments
Labels
Enhancement New feature or request

Comments

@prestonvanloon
Copy link
Member

prestonvanloon commented Jun 5, 2018

This issue exists to track progress on exploration of other serialization strategies for sharding and Ethereum. We'll likely want to move this into a new repository once work has been started.

Motivation

With RLP and other serialization mechanisms for Ethereum, it feels a bit like reinventing the wheel when there may be a more supported open source library.

The main motivation for RLP:

The alternative to RLP would have been using an existing algorithm such as protobuf or BSON; however, we prefer RLP because of (1) simplicity of implementation, and (2) guaranteed absolute byte-perfect consistency.

The question we try to answer is whether or not this is an issue that is not already solved by protocol buffers or other mechanisms.

Challenges with Hashing in Different languages

Key/value maps in many languages don't have an explicit ordering, and floating point formats have many special cases, potentially leading to the same data leading to different encodings and thus different hashes.

See RLP design rationale for more context.

Google Protobuf

Protocol buffers are Google's language-neutral, platform-neutral, extensible mechanism for serializing structured data – think XML, but smaller, faster, and simpler. You define how you want your data to be structured once, then you can use special generated source code to easily write and read your structured data to and from a variety of data streams and using a variety of languages.

How to test consistency across all languages?

One option is to write a gRPC service definition and implement the test in each popular language. The test would be easy to extend to another language, provided that it implements the service.

gRPC server for each language

Example service defintion:

service SerializerTest {
  rpc TestHash(Block) returns (Hash) {}
}

message Block {
  Header header = 1;
  repeated Header uncles = 2;
  repeated Transaction transactions = 3;

  message Header {
    bytes parent_hash = 1;
    bytes uncles_hash = 2;
    ...
  }
}

message Transaction {
  uint64 nonce = 1;
  uint64 price = 2;
  ...
}

// Hash result 
message Hash {
  bytes hash = 1;
  Block block = 2;
}

The request proto has an object resymboling a block then the service response with the resulting hash. The test then compares this against the actual hash.

The test can and should be populated with real Ethereum blocks that have been mined and their associated hash. This provides solid evidence that these test cases are valid.

Why set up this infrastructure of gRPC services?

The main idea is that we can run these tests against each language with an agnostic client, in isolation.

Why gRPC?

Due to its low boilerplate, code generation, and structured payload.

List of official supported languages

  • c++
  • java
  • python
  • go
  • ruby
  • c#
  • node.js
  • android java
  • objective C
  • PHP
  • Dart

List of 3rd party supported languages

There are probably many more languages...

How does the test client work?

The test client will act as a command line tool and most likely read from a series of config files.

We can imagine at least config for service to hit and another config for the test cases.

The client will send the test proto to each of the services listed, in parallel. At the end of test execution, the client will print and/or write a report of pass/fail for test cases.

Example output of the client:

./run_tests

Running 5 test cases

Test 1
Java - PASS
Go - PASS
JavaScript - FAIL - Wanted hash ... got ...
Python - PASS

Test 2
...

Example services config:

services = [
   ["java", "127.0.0.1:5001"],
   ["go", "127.0.0.1:5002"],
   # ...
]

Example test protos:

TODO: Real blocks with hashs in proto supported format.

What about service orchestration?

Maybe using docker compose?

It would be annoying to start many gRPC services locally without a single command.

What about benchmarks?

Benchmarks are important, but we already know RLP is not as good in terms of performance for serialization.

We can add language specific benchmarks after we answer the question: will this work at all?

@prestonvanloon
Copy link
Member Author

cc: @rawfalafel

@terencechain terencechain added the Enhancement New feature or request label Jun 5, 2018
@terencechain terencechain added this to To do in Documentation and Tooling via automation Jun 5, 2018
@rauljordan rauljordan changed the title Exploring serialization via protobuf and others Exploring Serialization via Protobuf and Others Jun 5, 2018
@rauljordan
Copy link
Contributor

I am getting more sold on protobufs, especially with how it leaves decoding up to each client.

What about service orchestration?
Maybe using docker compose?
It would be annoying to start many gRPC services locally without a single command.

Do you have examples of other projects using gRPC that do this via docker compose? Orchestration seems to be the only big question that arises from this proposal.

@prestonvanloon
Copy link
Member Author

My thoughts on orchestration: we build the containers for each service then set up something to manage those containers.

I'm not too familiar with docker compose, but we need something that achieves the following:

  1. Builds image containers for each gRPC service
  2. Starts all of the containers on a port mapping (i.e. they could all serve on :5000 in their container)
  3. Runs the test suite or allows the test suite to be run locally

Here's an example of how I envision this workflow:

# Build all of the service container images
./build

# Start up the test service infrastructure
./start

# Then run the tests against those services
./run_tests

These shell scripts (or whatever) above would read a config file to outline port mappings for the test.
It might look like the "service config" that I mentioned in the original post.

@rawfalafel
Copy link
Contributor

We discussed this on gitter but I'll recap here:

Protobuf was originally evaluated and passed on as a serialization mechanism because it doesn't provide byte-perfect consistency. With protobuf, the same object can be encoded multiple ways, and different encodings can be deserialized into the same object.

@prestonvanloon mentioned that this isn't an issue once a proposer commits to a chunkRoot, which is true, but this is still an issue after a transaction gets broadcast and before a proposer commits. Moreover, RLP is the de facto encoding scheme throughout the entire Ethereum protocol, and the assumption is that the encoding scheme provides byte-perfect consistency.

Hate to be the nay-sayer, especially because I'd like to see a faster encoding scheme replace RLP as well, but I don't think we can use protobuf as is.

@tfalencar
Copy link

tfalencar commented Jun 8, 2018

Did you guys consider fleece? Seems to have the properties needed, while being much simpler than protobuff.

https://github.com/couchbaselabs/fleece/blob/master/README.md

@prestonvanloon
Copy link
Member Author

@tfalencar No we haven't, but a quick 15 second scan of this project and I found this:

Can I use it in $LANGUAGE? [where $LANGUAGE not in ("C++", "C")]
Not currently. ...

To be a reasonable replacement for RLP, it should preferably work for all modern languages.

With that said, nothing is out of question for this. If you have ideas or would like to explore fleece and share your results then the community would be interested!

@terencechain
Copy link
Member

It might be worth to revisit this again now sharding's breaking away from main chain to beacon chain, it's more feasible to switch over to protobuf from rlp with a different consensus protocol. The likely case is to use protobuf to replace RLP with blob serialization

@adamdrake
Copy link

I've been exploring this topic as well, with the thinking of using FlatBuffers over ProtocolBuffers. The main benefit (IMO) is that FlatBuffers allows for accessing the serialized data in a record without having to unpack it first. There are very large performance implications of this, of course.

https://google.github.io/flatbuffers/

Thoughts @prestonvanloon?

@mratsim
Copy link

mratsim commented Aug 2, 2018

Another potential alternative: Cap'n Proto https://capnproto.org/ by the guy who implemented Protobuf at Google in the first place.

It seems to fit:

But doesn’t that mean the encoding is platform-specific?

NO! The encoding is defined byte-for-byte independent of any platform. However, it is designed to be efficiently manipulated on common modern CPUs. Data is arranged like a compiler would arrange a struct – with fixed widths, fixed offsets, and proper alignment. Variable-sized elements are embedded as pointers. Pointers are offset-based rather than absolute so that messages are position-independent. Integers use little-endian byte order because most CPUs are little-endian, and even big-endian CPUs usually have instructions for reading little-endian data.

@rauljordan
Copy link
Contributor

Also Cap'n proto has tons of language support - perhaps we can put together a small repo where we play around with these different schema-based serialization protocols across their different language implementations?

@prestonvanloon prestonvanloon moved this from To do to In progress in Documentation and Tooling Sep 18, 2018
@zjshen14
Copy link

@rawfalafel, thanks for raising the concern of byte-perfect consistency on protobuf. I'm exploring a similar problem recently. Do you still recall the concrete example of the same object can be encoded multiple ways, and different encodings can be deserialized into the same object, so that I could evaluate if it will affect my case or not. Thanks!

@zjshen14
Copy link

@prestonvanloon have you conducted the cross-language experiments? How's consistency? Thanks!

@rawfalafel
Copy link
Contributor

Yep, take an encoding of a protobuf object and reorder the fields. They should proto.Marshal should encode those two encodings into the same object.

@rawfalafel
Copy link
Contributor

BTW, we're exploring a new encoding described here: https://github.com/ethereum/beacon_chain/blob/master/ssz/README.md

@zjshen14
Copy link

zjshen14 commented Sep 20, 2018

Yep, take an encoding of a protobuf object and reorder the fields. They should proto.Marshal should encode those two encodings into the same object.

@rawfalafel, thanks for taking my question. I want to clarify if you mean

message Foo {
  uint64 a = 1;
  bytes b = 2;
}

And

message Foo {
  bytes b = 1;
  uint64 a = 2;
}

Will be marshaled into the same bytes? If my understanding is correct, I've some followup questions:

  • What's the use case or reordering? I assume the convention is that when we define a proto msg, we usually don't break the field order.

  • And even one step back, does the order matters? In the case above, Foo hasn't been meaningfully changed, so that it should be okay to have the same serialized footprint in the memory, no?

@rawfalafel
Copy link
Contributor

Protobuf allows fields to be encoded in any order to facilitate merging two messages.

And even one step back, does the order matters? In the case above, Foo hasn't been meaningfully changed, so that it should be okay to have the same serialized footprint in the memory, no?

Honest nodes should never encode in a different order. The problem though is when a malicious user purposefully encodes in the wrong order. In this scenario, the same message can have multiple encodings, and therefore multiple hashes, and break consensus.

@rauljordan
Copy link
Contributor

This seems to have been resolved as every team is going for simple serialize at the moment - thoughts on closing this @prestonvanloon?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Enhancement New feature or request
Projects
No open projects
Development

No branches or pull requests

8 participants