New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Pluggable serialization #151
Comments
In Crux one can already configure Kafka to control what goes on the wire, as one can provide an extra Kafka properties file as a Crux options. This is also meant to enable various authentication and other Kafka options (including serialisation) which we cannot foresee. It's not necessarily elegant, but doable. This obviously needs to be configured consequently throughout all participating systems, there's no central authority. By default, Crux sends data over Kafka using Nippy in the messages, and have Kafka apply Snappy compression on this. Kafka has end-to-end compression, see: https://kafka.apache.org/documentation/#design_compression The idea (or hope) is that these things together should be a "good enough" default for most use cases. This is partly to avoid the confusion and chance of configuring Crux in an inconsistent way, as mentioned above. And a case can also be made for the opposite, that the Kafka topics should contain raw edn (or even more general, JSON) as this makes the data easier to consume and deal with without relying on Crux or specific libraries. Kafka could still be compressing this. This can also already be configured. We should add a default edn implementation of the Kafka serdes so support this more easily. It's worth pointing out that Crux also stores the documents locally in the KV store in Nippy format, and this is currently not configurable. The content hashes Crux use for the documents is also based on this format, so Nippy is indirectly touching many things. Just a few reflections and thoughts, not sure if this advances or solves the issue directly though. Keeping it open for further discussion. |
…t's easy to configure it without needing other dependencies. #151.
Another reflection, something like protocol buffers could be added to Crux's transaction topic, but not easily to the document topic, as the messages there simply are maps without schema. A user could of course conceivably have a very strict set of documents with schema they allow to be transacted into Crux, and have protocol buffers supporting that. But it points to the fact that the transaction topic (which is much smaller) and the document topic would potentially have to be treated differently if one goes down this path. |
Maybe something on top of crux would be more sensible/usable. Working kind of like, if keys x, y ,z are in the map, also store it in kafka topic q using serializer f. This way you you have some categories of data in sync, and easily usable both from crux and with anything using kafka? |
Just my 2c on some stuff: First, I think protobufs are known to be a lot slower than alternative binary serialization protocols like flatbuffers. Second: idk how useful/feasible pluggable serialization would be. I feel it should be more coupled with the txlog backend choice to be useful as a general pluggable part (serialization for crux-jdbc should be different than crux-kafka or crux-rocksdb, for instance). Disclaimer: I am not familiar at all with crux, but those came to my mind when reading here. EDIT: flatbuffers benchmark (on c++): https://google.github.io/flatbuffers/flatbuffers_benchmarks.html |
[copying my question here, I expand a bit so I will not copy your answer here]
Do you think it would be worthwhile allowing plugable serialization? Nippy is good enough sometimes but if the user wants to get his hand dirty he can get map key caching, varints and other things like that with other formats. Also it might be important for interop with other languages potentially (at a low level).
I am mostly thinking about protobuf usage, as much as I dislike it it can be quite efficient and it's widely used. I am not advocating to use that but I think allowing to control what serialization format is used is very important, even at a low level since it can have a real impact on storage/bw costs.
Then it's the responsibility of the user to know/use the correct, potentially custom "codec" against a cluster.
It could be a pluggable bit, behind protocols like other parts of crux.
The text was updated successfully, but these errors were encountered: