Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tendermint data structures and serialization #654

Open
1 of 4 tasks
thanethomson opened this issue Oct 28, 2020 · 17 comments
Open
1 of 4 tasks

Tendermint data structures and serialization #654

thanethomson opened this issue Oct 28, 2020 · 17 comments
Labels
serialization technical debt Issues that are important, but not urgent, that should eventually receive attention

Comments

@thanethomson
Copy link
Contributor

thanethomson commented Oct 28, 2020

After working through #639, we're pretty convinced that it's best to maintain 2 categories of data structures:

  1. Domain types, where we should not be able to construct incorrect data structures.
  2. Serialization types, which facilitate serialization/deserialization to/from other formats such as Protobuf (for future wire-level communication and signature generation) and JSON (for use in the RPC). We could consider these to be "unvalidated" data structures, where their domain type counterparts would be the "validated" versions of the data.

Previously we've mixed our domain types and serialization types when it comes to JSON, and we'd like to split them up. A rough plan to implement this would be the following:

  • Domain types, with constructors that do not allow for incorrect construction. These should continue to live in the tendermint crate.
  • Serialization types, with TryFrom converters to corresponding domain types. These should probably be in a single crate together (we could rename the proto crate at some point to something more generic, and feature-guard each set of data structures).
@greg-szabo greg-szabo added this to the v0.17.0 milestone Nov 6, 2020
@greg-szabo
Copy link
Member

I was able to generate JSON structs (or Serialization types, as the issue calls it) from the OpenAPI definition in Tendermint Go. We need to decide how/where we want to store them.

Currently, I've created a json folder for the compiled structs and named the crate tendermint-json which is in line with the protobuf folder structure. But I'm happy to name them differently.

The OpenAPI generator creates a full crate for the json structs (including a Cargo.toml and src folder) so merging the JSON structs with the protobuf structs would be counterintuitive.

The generation is also done differently. Instead of a json-compiler crate, the generation can be done with a one-liner docker call. I propose we keep that and not write additional Rust code for JSON generation.

@thanethomson
Copy link
Contributor Author

I was able to generate JSON structs (or Serialization types, as the issue calls it) from the OpenAPI definition in Tendermint Go. We need to decide how/where we want to store them.

Nice! 👍

Currently, I've created a json folder for the compiled structs and named the crate tendermint-json which is in line with the protobuf folder structure. But I'm happy to name them differently.

Wouldn't it be better, in the long run, to have a single crate (e.g. tendermint-serialization or tendermint-wire) that contains both the Protobuf and serde-compatible structs and put each set of structures behind a feature flag?

@greg-szabo
Copy link
Member

greg-szabo commented Nov 9, 2020

That's what I thought originally, but the openapi compiler is quite opinionated: it generates a full directory structure of a complete crate.
Creating a full tendermint-wire crate would look something like this:

  1. Start from no directory
  2. Run docker openapi-generator ... -o wire command to generate the JSON structs and the tendermint-wire crate into the wire folder.
  3. Run proto-compiler to generate the protobuf types into the wire/src/prost folder (similar to what it does now).
  4. Manually add the new files domaintype.rs and error.rs into wire/src.
  5. Manually edit lib.rs to invoke the protobuf-generated files. This file is quite complex already in the proto crate and unfortunately not generated automatically.

The last two steps make a normally automated process somewhat manual/cumbersome. We could try coming up with an automated method (especially for the lib.rs generation) but I feel it's weird.

If you like this idea, I'll look into generating the lib.rs for protobuf files automatically.

The real problem is not automation but that all files are overwritten by openapi during generation, because it populates the whole crate folder. In the prost generation we only overwrite the "prost" folder within the source code, which is designated for this automated generation.

Edit: also makes it impossible to update the JSON structs without recreating the protobuf structs. Not unusable but definitely inconvenient.

@thanethomson thanethomson added the technical debt Issues that are important, but not urgent, that should eventually receive attention label Dec 2, 2020
@thanethomson thanethomson removed this from the v0.17.0 milestone Dec 2, 2020
@greg-szabo
Copy link
Member

I've created a diagram from all the public structs in tendermint-rs that are serializable: https://imgur.com/a/YhWa7Hb

Current state

We have three different types of serialization implemented for our structs.

green: implicit serialization using annotations

This is the basic thinking in serde for serialization: annotate your struct with a derive and possibly a few modifiers (like renaming some fields or skipping a few for serialization) and let the serialization take place recursively: all fields are serialized through their own serialization methods.

The problem is that this way no validation is done for the fields during deserialization.

yellow: explicit serialization using custom derive

This is the advanced thinking in serde for serialization: custom implementation for the Serialize and Deserialize traits. The benefit is that validation can be implemented for the struct but in a lot of cases, the code is copied over from other structs implementing similar serialization/deserialization.

The problem is that the serialization is "hidden" together with the struct: it's not clear if it's implemented or if the serialization code could be reused for other structs. Also, this method is explicit: if a field in the struct has its own serialization/deserialization implemented, this method will ignore it and use the default serialization/deserialization instead. (For example a hexstring->Vec deserialization might serialize into a bytes array, instead of a string.)

blue: explicit serialization using raw types / JSON types

This is the advanced thinking we would like to get to in this issue. The struct is serialized/deserialized using TryFrom/Into traits, through a more open JSON struct. This has the benefit, that it requires minimal annotations on the domain struct and it does validation of the incoming data during conversion. Current implementation is using the protobuf structs defined in the tendermint-proto crate as underlying JSON structs.

Desired end-state

  • Separate JSON structs crate (or other separation) implemented
  • All structs are "blue": they have the TryFrom/Into implemented for validation and the serde(tryfrom/into) annotations.

Ongoing work

  • The current implementation mostly works. We have seen a few issues surfacing that we were able to resolve in one-off issues. A complete overhaul would yield only minor benefits in the short term.
  • We can use the implemented serializers modules in tendermint_proto::serializers and tendermint::serializers to gather the different serialization methods and keep them reusable. The goal should be to implement all new serialization using these modules.
  • We can decide on separating the tendermint_proto module's serialization related work into a separate crate later.
  • The most important work would be to get rid of "yellow" structs: custom, not reusable implementations. The serializers modules can help with that a lot.

I propose we take these last few items as take-aways and make this issue an "ongoing work". Possibly we can close it and refer back to it any time we implement new serialization or touch current serialization.

@tony-iqlusion
Copy link
Collaborator

tony-iqlusion commented May 31, 2021

We have a request to add JSON support to cosmos-sdk-rs: cosmos/cosmos-rust#83

Any advice you can give on the best way to do that in a greenfield capacity? Is there anything we can/should reuse from tendermint-rs?

@thanethomson
Copy link
Contributor Author

Any advice you can give on the best way to do that in a greenfield capacity? Is there anything we can/should reuse from tendermint-rs?

That's a tough one to answer. Our experience with Protobuf/JSON serialization has involved far more labour than we initially anticipated.

The core challenges with serialization are best understood when thinking about the problem from different users' perspectives:

  1. Developers need an easy way to debug wire-level data structures. JSON is the preferred format for this.
  2. Developers need ways of storing static representations of data structures for testing purposes. Again, JSON is the preferred format for this.
  3. Implementers of wire-level protocols need a reliable, common specification for data structures and RPC endpoints. Ideally a shared one. Protobuf seems to be the best option for this.
  4. Users of our software want performant software that keeps their costs down, and Protobuf provides greater performance and has lower bandwidth requirements than JSON.

Since we want to fulfill all these needs, we ideally want to use the best of both Protobuf and JSON. It's unfortunate (and strange) that Protobuf doesn't meet all of these needs.

Another issue is that, in tendermint-rs, we don't have simple 1:1 mappings of Protobuf <-> JSON data structures. We have all kinds of custom serializations for different fields. Our proto-compiler is a way of reducing the amount of data structure duplication across Protobuf/JSON.

Ultimately such a situation lends itself to maintaining two distinct sets of data structures just for serialization (one set for Protobuf and another for JSON), as well as mappings between the two types. Or at least mappings from each set to our domain types and then back again. With domain types, we would thus end up with 3 sets of data structures we need to maintain.

It's a lot of code (and a lot of slog work) if you have lots of data structures, but it's probably the best route to meet all of our users' needs.

@mzabaluev
Copy link
Contributor

mzabaluev commented May 6, 2022

The same problem exists in ibc-rs: for a lot of structures derived from protobuf descriptions, serde is customized to produce human-friendly JSON. It would be tedious to implement serde by hand, or maintain a copy of every struct and define conversions, just to get serde and other trait implementations that are not supported out of the box in the types generated by prost-build.

I see two ways to ultimately resolve this.

The pbjson-types approach: single crate, all(?) included

Each proto module gets a "canonical" Rust crate with all necessary additional trait implementations derived or otherwise auto-generated. Domain types, in many cases, can embed the generated struct types to easily derive the implementations.

Advantages:

  • Everyone gets a single set of struct types representing protobuf messages to work with at the basic level. No conversions required between parallel sets of types.

Disadvantages:

  • This does not look logistically feasible: even with serde alone, there may not be a single serialization schema that every application agrees on. When an application wants any other derives that are not currently generated (e.g. schemars, or just throwing in a derive(Eq)), they have to convince the maitainers of the "canonical" crate to add those.
  • Rich crates, rich ground for fragmentation: already there are pbjson-types, prost-wkt, and possibly others on the most basic level of Google WKTs, not counting the homespun variety thereof that this project currently uses.

The prost-types approach: slim prost crate, possible sidecar generators

The "canonical" line of generated crates (in our stack it's tendermint-proto, cosmos-sdk-proto, etc.) is solely dedicated to protobuf and reuse prost-types (and possibly other ecosystem crates) with vanilla prost-build code generation. The applications are free to generate additional proto-derived types as necessary for their specific needs using custom code generators, and define, preferably also generated, conversions between these additional types and their canonical counterparts. protoc-gen-prost demonstrates feasibility of sidecar code generation plugins and their use in buf.

Advantages:

  • There exists a single Rust crate corresponding to each proto module that's dedicated to use in the prost/tonic stack. No disagreements as to which additional functionality this crate should support, the answer is "um, none?"

Disadvantages:

  • The sidecar types add to code complexity and decrease convenience. As there is a need to define idiomatic/domain types to actually work with protobuf message data, this can be mitigated by making the sidecar types private and wrapping them in the domain types that need to be substantially hand-written anyway.

@xla
Copy link
Contributor

xla commented May 16, 2022

@mzabaluev This write-up is very helpful. On the side of the rich approach another potential outcome could be an extracted and agreed upon "compiler" stack which would determine the richness of extra functionality and types and could be shared between projects.

@mzabaluev
Copy link
Contributor

@xla If I understand this right, it potentially results in vertical stacks of incompatible, partially redundant "proto+" types, as exemplified by prost-wkt vs pbjson-types. Maybe we can at least agree on one set of richly generated types for Tendermint/Cosmos-SDK/IBC and use that throughout, but then the maintenance responsibilities on the "proto+" crates should be spelled out and include all the rich functionality required by the dependent projects.

In our case that would mean, for example, that serialization generated in cosmos-rust has to correspond to the JSON format choices made by ibc-rs.

@tony-iqlusion
Copy link
Collaborator

Maybe we can at least agree on one set of richly generated types for Tendermint/Cosmos-SDK/IBC and use that throughout

Radical idea: have a single repo just for the *-proto crates where tendermint-proto, cosmos-sdk-proto, and ibc-proto can all live, so that changes to conventions or WKT-related dependencies can be made in a cross-cutting and consistent manner.

In our case that would mean, for example, that serialization generated in cosmos-rust has to correspond to the JSON format choices made by ibc-rs.

We have no JSON support at all as yet, so we're totally open to supporting whatever conventions ibc-rs wants to use.

@xla
Copy link
Contributor

xla commented May 17, 2022

Radical idea: have a single repo just for the *-proto crates where tendermint-proto, cosmos-sdk-proto, and ibc-proto can all live, so that changes to conventions or WKT-related dependencies can be made in a cross-cutting and consistent manner.

This approach is sensible and is in line with my angle of a shared compiler stack.

@mzabaluev Following through with the consolidation would result in consistent, non-redundant, compatible types and would address your concerns.

@thanethomson Keen to hear your thoughts on this?

@mzabaluev
Copy link
Contributor

We have no JSON support at all as yet, so we're totally open to supporting whatever conventions ibc-rs wants to use.

The format choices were made to work with files used by the relayer. They may be somewhat ad-hoc and not necessarily even follow a consistent convention; I can't tell at present if this serialization is universally usable (which is, to say, a common problem with serde).

More importantly, outsourcing this format away from ibc-rs would complicate the maintenance loop for Hermes. Consider the situation when we decide to make a breaking change in the schema. Currently it only requires a version bump for the relayer and its libraries, easy to manage in ibc-rs. If this schema is maintained in cosmos-rust or the future all-protos repository, we'd have to wait for a new major version there, which looks like an unnecessary entanglement; after all, it's just the relayer's own file format.

@mzabaluev
Copy link
Contributor

Radical idea: have a single repo just for the *-proto crates where tendermint-proto, cosmos-sdk-proto, and ibc-proto can all live, so that changes to conventions or WKT-related dependencies can be made in a cross-cutting and consistent manner.

Versioning this will be... interesting. Would there be a synchronized release cycle for changes in Tendermint, Cosmos SDK, and IBC specifications?

Somewhat less radical, I think, would be to have a repo with a buf.gen.yaml and possible compiler plugins, to reuse across the Rust projects generating their *-proto crates in their disparate workspaces. Could this be used to realize the "compiler" idea brought up by @xla?

@thanethomson
Copy link
Contributor Author

thanethomson commented May 17, 2022

See also #1128 for an additional complication 🙂

I've been thinking about this for a while now and it's still not totally obvious what the best possible solution here is. It's probably a good idea to break the problem down into sub-problems.

Problems

Problem 1: Protobuf file (.proto) dependency management

This is something that higher-level libraries, like cosmos-rust and ibc-rs encounter. This is also something that buf aims to solve.

According to @mzabaluev, buf unfortunately doesn't work for our purposes.

Problem 2: Rust type generation from .proto files

Do we do this on-the-fly, or do we check the generated Rust code into version control? The drawback of on-the-fly code generation is that IDE support for this is patchy and degrades developer experience.

Problem 3: JSON serialization

As per the original comment in this issue, because Tendermint Core doesn't offer a simple, consistent mapping of Protobuf fields to JSON fields in the RPC, and because there are some types that don't overlap between the two sets of types, we need to explicitly cater for JSON serialization in our Rust code.

In the Rust case, do we apply the serde annotations to the Protobuf stubs, or to our domain types (creates problems with validation), or do we create a separate hand-rolled set of types specifically for handling JSON serialization?

Problem 4: Validation

We introduced the idea of "domain types" in order to facilitate validation of serialization types. This requires hand-rolling conversions between the serialization and domain types.

Problem 5: Compatibility with multiple versions of Tendermint Core

As per #1128, ibc-rs will eventually need to support interaction with multiple versions of Tendermint Core to facilitate IBC over heterogeneous networks. There's no way around this, as we can't always ensure the entire ecosystem upgrades to a new version of Tendermint Core simultaneously.

We currently have two possibilities in terms of versioning tendermint-rs:

  1. Release multiple sets of crates, each set catering to a different version of Tendermint Core (e.g. the tendermint crate would be superseded by tendermint-v034 and tendermint-v035 crates, and similarly for all the other crates).
  2. Release a single, semantically versioned set of crates with support for different versions of Tendermint Core embedded within modules (e.g. tendermint::v034 and tendermint::v035, with common functionality within a tendermint::common module).

Either of these options requires having to support multiple (as many as the versions of Tendermint Core we support) different sets of .proto files, proto-generated Rust stubs, JSON serialization formats, domain types and conversions between all of these types.

Personally, I'd prefer option 2 because it provides for the case where we have common domain types across versions of Tendermint Core that potentially have different serialization types and substantially increases the possibility of code reuse.

Possible solutions

Perhaps I'm missing something important here in the discussion, so please let me know if I am. The major possible solutions I'm seeing right now are as follows.

Solution 1: Canonical set of -proto crates in a single repo

As mentioned in #654 (comment). Does not explicitly address the JSON serialization problem (although it could) and would be tricky to implement with option 2 mentioned in problem 5.

This solution also potentially imposes additional coordination overhead among the projects.

Solution 2: Do away with -proto crates altogether

Not a solution. I originally thought it was a solution, but it doesn't solve the .proto dependency management problem (problem 1).

@mzabaluev
Copy link
Contributor

mzabaluev commented May 18, 2022

According to @mzabaluev, buf unfortunately doesn't work for our purposes.

@thanethomson You were probably referring to my comment in an ibc-rs status update meeting yesterday. That only reflected on the current difficulties I encountered while working on informalsystems/hermes#2213. If each of the projects whose .proto files we require for relayer's own code generation published those files as BSR modules, our work could be made much easier. I have filed an issue about that: cosmos/ibc-go#1345

@mzabaluev
Copy link
Contributor

Release multiple sets of crates, each set catering to a different version of Tendermint Core (e.g. the tendermint crate would be superseded by tendermint-v034 and tendermint-v035 crates, and similarly for all the other crates).

I think it should be possible to link two different versions of the tendermint crate using aliases or intermediary crates. However, there would be a lot of code duplication in the relayer.

Release a single, semantically versioned set of crates with support for different versions of Tendermint Core embedded within modules (e.g. tendermint::v034 and tendermint::v035, with common functionality within a tendermint::common module).

I support this. Perhaps for better ergonomics towards "ordinary" Tendermint developers, you don't need a sub-path for the latest supported version and common code, only the modules supporting older versions should be named (and feature-gated) v034 and so on.

For code generated by prost-build though, there is no way around generating a separate set of types from the .proto sources of every distinct Tendermint version.

@tony-iqlusion
Copy link
Collaborator

Release a single, semantically versioned set of crates with support for different versions of Tendermint Core embedded within modules (e.g. tendermint::v034 and tendermint::v035, with common functionality within a tendermint::common module).

+1 for this, especially if the Tendermint version(s) in use were gated by crate features.

It seems like this would also make it possible to remove the v1beta1 from the respective modules if it could be assumed there was only one such namespace per Tendermint version.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
serialization technical debt Issues that are important, but not urgent, that should eventually receive attention
Projects
None yet
Development

No branches or pull requests

5 participants