Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

better API design for schema based formats #4

Closed
hohwille opened this issue Nov 5, 2023 · 2 comments
Closed

better API design for schema based formats #4

hohwille opened this issue Nov 5, 2023 · 2 comments
Labels
enhancement New feature or request

Comments

@hohwille
Copy link
Member

hohwille commented Nov 5, 2023

With mmm-marshall we support a variety of structured data formats. While JSON and YAML are fine and do not require external information, it is more tricky for formats such as grpc/protobuf or avro. Those rely on a given schema e.g. to map property names to numeric values and vice versa.
To support grpc we started changing our APIs in various places but this approach seems odd and flawed. Especially ugly is that grpc requires the size of an object before it is written.
This way with every new such format that we want to support, we might need to do unforeseeable breaking API changes.

Instead we should redesign our API now such that this can be prevented in the future and we can even remove the already introduced changes for grpc in mmm-property and mmm-bean that seem kind of misplaced.

The general idea is that we need to receive additional information about the object (e.g. bean) that is currently read or written.
Implementations for e.g. grpc can use this information to find the according ID mapping (e.g. accessing cached and lazily loaded *.protofiles).

@hohwille
Copy link
Member Author

After thinking this through there are two separate aspects:

  1. mapping of IDs to names and vice versa
  2. computing the size of an object upfront

Actually 1. is rather complex. If you do not want to go for a one-fits-all approach and (e.g. only supported via generated code or define a strict way to map schemas from *.proto files read from classpath) things can get a little tricky.
Should we assign IDs to properties? If you are using our new bean approach with interfaces this allows multi-inheritance. How would you manage that IDs are unique for all potential descendants? On the other hand if every object that can be serialized has to provide its own custom mapping all is flexible but then a lot of manual work would be required to implement this for every bean. In the end we have new boilerplate code but actually the entire point of our beans is to prevent boilerplate code.
Also if we have to provide the object to be marshalled or unmarshalled upfront, and that object itself somehow is the key to derive the ID mapping - how would you implement a custom marshaller for the same object producing a different payload (e.g. we are reusing properties in criteria that currently can also be (un)marhsalled but here the playload is not representing the value of the property but a reference to it so rather its name). In that case we would have to pass a different object (wrapper or the marshalling iteself).
I came to the idea of making a trade-off: What if we offer the ability to build a unique ID mapping for (technically an array of) all available properties of all beans in the app providing a service and allowing to retrieve that list on a service consumer during bootstrapping (result can be hashed/versioned and cached to minimize overhead). Then we have a global ID mapping for that service and can use that for any involved bean/object. To even optimize this a little further, we can ship with a predefined list of common property names (id, name, key, value, type, argument, parent, child, root, path, file, revision, version, state, data, payload, url, uri, street, country, state, city, zip, mail, etc.) that do not have to be negotiated and transferred from service provider to consumer. Also it would be ensured that such common properties will always have IDs that fit into one byte of payload. The trade-off is that it is very likely that custom properties will require some extra byte as we can quickly grow beyond 2047 IDs in a large data-model. However, compared to actually transmitting names as strings like "CustomerContractNumber", two bytes are still a significant improvement. Also there is still the option to implement a custom mapping for the object in cases where a generic approach is not suitable (e.g. you want to fulfill the contract of a given *.proto file or you want to optimize every byte and ensure all IDs are as minimal as possible).
Further thinking in this direction could lead to a suitable trade-off between simple and generic usage and advanced usage with higher effort but full control.

For 2. I think it is very simple: If you want to fully go for gRPC you would use googles approach and not our marshaling. Google goes for code generation and this way ensures the code is efficient and consistent in computing the correct size before the actual object is written. Also this is an asymmetric problem that only occurs when marshaling data but not during unmarshaling.
My personal opinion is that gRPC design is flawed by this stupid decision to include the size of every object upfront as extra information. This fully makes sense for strings to support an efficient encoding and know where it terminates but does not make sense of arrays and objects. Skipping them can still be done easily without knowing the size as proven by our proprietary custom mRPC implementation.
The major issue we had with computing the size on the fly is that we currently create a buffer (ByteArrayOutputStream) for every array or object and after that is terminated we know the size and can write the real data to the actual stream. When nesting arrays and objects this is causing huge overhead. However, if we only create a single buffer and additionally store a "stack" of the start of arrays/objects with their positions in the buffer, we can easily optimize this and take away the burden to implement support for size computation into beans and what ever kind of objects to be marshaled.
Here are the links to see how gRPC support is currently impacting our beans and properties:
https://github.com/m-m-m/bean/blob/83f76fa4fe7995a7ecf2d0e7292e088e3f211544/core/src/main/java/io/github/mmm/bean/AbstractBean.java#L336-L352
https://github.com/m-m-m/bean/blob/83f76fa4fe7995a7ecf2d0e7292e088e3f211544/core/src/main/java/io/github/mmm/bean/property/BeanProperty.java#L114-L120
We think this is flawed and ugly and want to get rid of this.
With the optimized approach to compute the sizes on the fly, we can still support gRPC for those who think it is great without flawing our design for a feature not in our main focus.

@hohwille
Copy link
Member Author

All done and implemented now (For the record: the previous comment was posted by me 1+ weeks ago but seemed to got stuck in my browser so it only went out today after completing this story).
For the record: the central collecting of all properties and a handshake protocol to exchange this list is not yet implemented, but can be done with a custom implementation of StructuredIdMappingProvider.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

1 participant