Exploring Serialization via Protobuf and Others #150

prestonvanloon · 2018-06-05T01:18:52Z

This issue exists to track progress on exploration of other serialization strategies for sharding and Ethereum. We'll likely want to move this into a new repository once work has been started.

Motivation

With RLP and other serialization mechanisms for Ethereum, it feels a bit like reinventing the wheel when there may be a more supported open source library.

The main motivation for RLP:

The alternative to RLP would have been using an existing algorithm such as protobuf or BSON; however, we prefer RLP because of (1) simplicity of implementation, and (2) guaranteed absolute byte-perfect consistency.

The question we try to answer is whether or not this is an issue that is not already solved by protocol buffers or other mechanisms.

Challenges with Hashing in Different languages

Key/value maps in many languages don't have an explicit ordering, and floating point formats have many special cases, potentially leading to the same data leading to different encodings and thus different hashes.

See RLP design rationale for more context.

Google Protobuf

Protocol buffers are Google's language-neutral, platform-neutral, extensible mechanism for serializing structured data – think XML, but smaller, faster, and simpler. You define how you want your data to be structured once, then you can use special generated source code to easily write and read your structured data to and from a variety of data streams and using a variety of languages.

How to test consistency across all languages?

One option is to write a gRPC service definition and implement the test in each popular language. The test would be easy to extend to another language, provided that it implements the service.

gRPC server for each language

Example service defintion:

service SerializerTest {
  rpc TestHash(Block) returns (Hash) {}
}

message Block {
  Header header = 1;
  repeated Header uncles = 2;
  repeated Transaction transactions = 3;

  message Header {
    bytes parent_hash = 1;
    bytes uncles_hash = 2;
    ...
  }
}

message Transaction {
  uint64 nonce = 1;
  uint64 price = 2;
  ...
}

// Hash result 
message Hash {
  bytes hash = 1;
  Block block = 2;
}

The request proto has an object resymboling a block then the service response with the resulting hash. The test then compares this against the actual hash.

The test can and should be populated with real Ethereum blocks that have been mined and their associated hash. This provides solid evidence that these test cases are valid.

Why set up this infrastructure of gRPC services?

The main idea is that we can run these tests against each language with an agnostic client, in isolation.

Why gRPC?

Due to its low boilerplate, code generation, and structured payload.

List of official supported languages

c++
java
python
go
ruby
c#
node.js
android java
objective C
PHP
Dart

List of 3rd party supported languages

rust

There are probably many more languages...

How does the test client work?

The test client will act as a command line tool and most likely read from a series of config files.

We can imagine at least config for service to hit and another config for the test cases.

The client will send the test proto to each of the services listed, in parallel. At the end of test execution, the client will print and/or write a report of pass/fail for test cases.

Example output of the client:

./run_tests

Running 5 test cases

Test 1
Java - PASS
Go - PASS
JavaScript - FAIL - Wanted hash ... got ...
Python - PASS

Test 2
...

Example services config:

services = [
   ["java", "127.0.0.1:5001"],
   ["go", "127.0.0.1:5002"],
   # ...
]

Example test protos:

TODO: Real blocks with hashs in proto supported format.

What about service orchestration?

Maybe using docker compose?

It would be annoying to start many gRPC services locally without a single command.

What about benchmarks?

Benchmarks are important, but we already know RLP is not as good in terms of performance for serialization.

We can add language specific benchmarks after we answer the question: will this work at all?

The text was updated successfully, but these errors were encountered:

prestonvanloon · 2018-06-05T01:19:53Z

cc: @rawfalafel

rauljordan · 2018-06-05T21:39:07Z

I am getting more sold on protobufs, especially with how it leaves decoding up to each client.

What about service orchestration?
Maybe using docker compose?
It would be annoying to start many gRPC services locally without a single command.

Do you have examples of other projects using gRPC that do this via docker compose? Orchestration seems to be the only big question that arises from this proposal.

prestonvanloon · 2018-06-06T11:14:21Z

My thoughts on orchestration: we build the containers for each service then set up something to manage those containers.

I'm not too familiar with docker compose, but we need something that achieves the following:

Builds image containers for each gRPC service
Starts all of the containers on a port mapping (i.e. they could all serve on :5000 in their container)
Runs the test suite or allows the test suite to be run locally

Here's an example of how I envision this workflow:

# Build all of the service container images
./build

# Start up the test service infrastructure
./start

# Then run the tests against those services
./run_tests

These shell scripts (or whatever) above would read a config file to outline port mappings for the test.
It might look like the "service config" that I mentioned in the original post.

rawfalafel · 2018-06-07T02:35:23Z

We discussed this on gitter but I'll recap here:

Protobuf was originally evaluated and passed on as a serialization mechanism because it doesn't provide byte-perfect consistency. With protobuf, the same object can be encoded multiple ways, and different encodings can be deserialized into the same object.

@prestonvanloon mentioned that this isn't an issue once a proposer commits to a chunkRoot, which is true, but this is still an issue after a transaction gets broadcast and before a proposer commits. Moreover, RLP is the de facto encoding scheme throughout the entire Ethereum protocol, and the assumption is that the encoding scheme provides byte-perfect consistency.

Hate to be the nay-sayer, especially because I'd like to see a faster encoding scheme replace RLP as well, but I don't think we can use protobuf as is.

tfalencar · 2018-06-08T23:22:09Z

Did you guys consider fleece? Seems to have the properties needed, while being much simpler than protobuff.

https://github.com/couchbaselabs/fleece/blob/master/README.md

prestonvanloon · 2018-06-08T23:27:40Z

@tfalencar No we haven't, but a quick 15 second scan of this project and I found this:

Can I use it in $LANGUAGE? [where $LANGUAGE not in ("C++", "C")]
Not currently. ...

To be a reasonable replacement for RLP, it should preferably work for all modern languages.

With that said, nothing is out of question for this. If you have ideas or would like to explore fleece and share your results then the community would be interested!

terencechain · 2018-07-05T14:59:18Z

It might be worth to revisit this again now sharding's breaking away from main chain to beacon chain, it's more feasible to switch over to protobuf from rlp with a different consensus protocol. The likely case is to use protobuf to replace RLP with blob serialization

adamdrake · 2018-07-06T01:26:26Z

I've been exploring this topic as well, with the thinking of using FlatBuffers over ProtocolBuffers. The main benefit (IMO) is that FlatBuffers allows for accessing the serialized data in a record without having to unpack it first. There are very large performance implications of this, of course.

https://google.github.io/flatbuffers/

Thoughts @prestonvanloon?

mratsim · 2018-08-02T14:49:37Z

Another potential alternative: Cap'n Proto https://capnproto.org/ by the guy who implemented Protobuf at Google in the first place.

It seems to fit:

But doesn’t that mean the encoding is platform-specific?

NO! The encoding is defined byte-for-byte independent of any platform. However, it is designed to be efficiently manipulated on common modern CPUs. Data is arranged like a compiler would arrange a struct – with fixed widths, fixed offsets, and proper alignment. Variable-sized elements are embedded as pointers. Pointers are offset-based rather than absolute so that messages are position-independent. Integers use little-endian byte order because most CPUs are little-endian, and even big-endian CPUs usually have instructions for reading little-endian data.

rauljordan · 2018-08-03T03:56:04Z

Also Cap'n proto has tons of language support - perhaps we can put together a small repo where we play around with these different schema-based serialization protocols across their different language implementations?

zjshen14 · 2018-09-20T07:50:39Z

@rawfalafel, thanks for raising the concern of byte-perfect consistency on protobuf. I'm exploring a similar problem recently. Do you still recall the concrete example of the same object can be encoded multiple ways, and different encodings can be deserialized into the same object, so that I could evaluate if it will affect my case or not. Thanks!

zjshen14 · 2018-09-20T07:53:47Z

@prestonvanloon have you conducted the cross-language experiments? How's consistency? Thanks!

rawfalafel · 2018-09-20T07:58:32Z

Yep, take an encoding of a protobuf object and reorder the fields. They should proto.Marshal should encode those two encodings into the same object.

rawfalafel · 2018-09-20T08:00:42Z

BTW, we're exploring a new encoding described here: https://github.com/ethereum/beacon_chain/blob/master/ssz/README.md

zjshen14 · 2018-09-20T16:16:53Z

Yep, take an encoding of a protobuf object and reorder the fields. They should proto.Marshal should encode those two encodings into the same object.

@rawfalafel, thanks for taking my question. I want to clarify if you mean

message Foo {
  uint64 a = 1;
  bytes b = 2;
}

And

message Foo {
  bytes b = 1;
  uint64 a = 2;
}

Will be marshaled into the same bytes? If my understanding is correct, I've some followup questions:

What's the use case or reordering? I assume the convention is that when we define a proto msg, we usually don't break the field order.
And even one step back, does the order matters? In the case above, Foo hasn't been meaningfully changed, so that it should be okay to have the same serialized footprint in the memory, no?

rawfalafel · 2018-09-21T04:41:57Z

Protobuf allows fields to be encoded in any order to facilitate merging two messages.

And even one step back, does the order matters? In the case above, Foo hasn't been meaningfully changed, so that it should be okay to have the same serialized footprint in the memory, no?

Honest nodes should never encode in a different order. The problem though is when a malicious user purposefully encodes in the wrong order. In this scenario, the same message can have multiple encodings, and therefore multiple hashes, and break consensus.

rauljordan · 2018-10-09T16:05:18Z

This seems to have been resolved as every team is going for simple serialize at the moment - thoughts on closing this @prestonvanloon?

terencechain added the Enhancement New feature or request label Jun 5, 2018

terencechain added this to To do in Documentation and Tooling via automation Jun 5, 2018

rauljordan changed the title ~~Exploring serialization via protobuf and others~~ Exploring Serialization via Protobuf and Others Jun 5, 2018

arcolife mentioned this issue Jul 23, 2018

update .proto rpc calls / data structures in line with engg codebase truechain/truechain-consensus-core#36

Open

prestonvanloon mentioned this issue Jul 24, 2018

Clients Can Join the Network, Sync, and Propagate Messages #108

Closed

5 tasks

prestonvanloon moved this from To do to In progress in Documentation and Tooling Sep 18, 2018

prestonvanloon closed this as completed Oct 11, 2018

Documentation and Tooling automation moved this from In progress to Done Oct 11, 2018

anarcher mentioned this issue Jan 10, 2019

Enable binary serialization of BlockOperation (+ add upgrade command) bosnet/sebak#904

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Exploring Serialization via Protobuf and Others #150

Exploring Serialization via Protobuf and Others #150

prestonvanloon commented Jun 5, 2018 •

edited

prestonvanloon commented Jun 5, 2018

rauljordan commented Jun 5, 2018

prestonvanloon commented Jun 6, 2018

rawfalafel commented Jun 7, 2018

tfalencar commented Jun 8, 2018 •

edited

prestonvanloon commented Jun 8, 2018

terencechain commented Jul 5, 2018

adamdrake commented Jul 6, 2018

mratsim commented Aug 2, 2018 •

edited

rauljordan commented Aug 3, 2018

zjshen14 commented Sep 20, 2018

zjshen14 commented Sep 20, 2018

rawfalafel commented Sep 20, 2018

rawfalafel commented Sep 20, 2018

zjshen14 commented Sep 20, 2018 •

edited

rawfalafel commented Sep 21, 2018

rauljordan commented Oct 9, 2018

Exploring Serialization via Protobuf and Others #150

Exploring Serialization via Protobuf and Others #150

Comments

prestonvanloon commented Jun 5, 2018 • edited

Motivation

Challenges with Hashing in Different languages

Google Protobuf

How to test consistency across all languages?

gRPC server for each language

Why set up this infrastructure of gRPC services?

Why gRPC?

How does the test client work?

What about service orchestration?

What about benchmarks?

prestonvanloon commented Jun 5, 2018

rauljordan commented Jun 5, 2018

prestonvanloon commented Jun 6, 2018

rawfalafel commented Jun 7, 2018

tfalencar commented Jun 8, 2018 • edited

prestonvanloon commented Jun 8, 2018

terencechain commented Jul 5, 2018

adamdrake commented Jul 6, 2018

mratsim commented Aug 2, 2018 • edited

rauljordan commented Aug 3, 2018

zjshen14 commented Sep 20, 2018

zjshen14 commented Sep 20, 2018

rawfalafel commented Sep 20, 2018

rawfalafel commented Sep 20, 2018

zjshen14 commented Sep 20, 2018 • edited

rawfalafel commented Sep 21, 2018

rauljordan commented Oct 9, 2018

prestonvanloon commented Jun 5, 2018 •

edited

tfalencar commented Jun 8, 2018 •

edited

mratsim commented Aug 2, 2018 •

edited

zjshen14 commented Sep 20, 2018 •

edited