Code generated compression API #6571

paulovap · 2021-04-17T09:45:15Z

paulovap
Apr 17, 2021
Maintainer

At the company I work, we did several experiments comparing JSON, proto and Fbs using real data.

Flatbuffers/Flexbuffers unquestionably have a huge advantage on some observed metrics like parsing time, partial access time, and even on writing the buffers (although the last one the improvements are a bit more shy).

But one thing that Fbs/Flex is consistently outperformed by a wide margin is compressed data size. On constrained environments such as mobile, this plays a major role in the decision making of which data encoding to use.

Obviously you should pick the most appropriate tool for the metrics you want to optimize. But my question now is: Can we make Fbs/Flex compression better enough to compete with text+gz?

I recall @aardappel mentioning code generated compression API could be a way to improve on this area, but I have no clue what this would look like.

Ideas?

aardappel · 2021-04-19T21:12:20Z

aardappel
Apr 19, 2021
Maintainer

Yes, it is a good point that once you're invested in an eco-system it would be great to use it even for cases it is not particularly good at, simply to leverage your investment across code-bases.

I was originally against the idea of the Object API for example, simply because it is exactly what FlatBuffers was meant to replace. Until I realized that its a great tool for people who have efficient FlatBuffer usage elsewhere.

Similarly, the big problem with compression is offsets and vtables, but doing away with them would defeat the point of FlatBuffers. But could there be an auxiliary encoding that is not random access but is all about compressibility is an interesting question!

I would not use any alignment.
I would do away with vtables, and instead encode a table as a byte saying how many sequential fields follow, or maybe a negative value saying how many fields are skipped. So a table that has only its second field set (a byte with value X, would maybe be encoded as -1, 1, X, 0 with 0 being the end marker.
All values would be encoded in-line as a tree, instead of using offsets. So a table whose only set field is the 3rd one, which is a sub-table with the contents of the above example would look like: -2, -1, 1, X, 0, 0
It may be worth encoding certain integer types (and sizes of strings in particular) as LEBs.
Floating point numbers are notoriously incompressible, would be good to think of better representations for common cases.

Then, you code-gen efficient functions to map to and from such a buffer to a standard FlatBuffer.

This format is interesting because it stores no types, so it should be able to win against Protobuf for size, but of course will be utterly unparsable without a schema just like FlatBuffers itself.

I don't think doing the same for FlexBuffers makes as much sense, since FlexBuffers does need types, and is already more compact in some cases than FlatBuffers. Making the above changes to FlexBuffers would result in an encoding that is too similar to the existing FlexBuffers, and in fact could be much worse if it doesn't support sharing of key values.

An interesting tradeoff is uncompressed size vs compressed. If we'd go thru the trouble of making a format that is very compressible, we might also make the format small without even needing compression, which may be useful in some cases. Like the above example of using LEBs: if it doesn't hurt the compressed version, it would certainly help the compressed version.

2 replies

paulovap Apr 20, 2021
Maintainer Author

I was originally against the idea of the Object API for example, simply because it is exactly what FlatBuffers was meant to replace. Until I realized that its a great tool for people who have efficient FlatBuffer usage elsewhere.

Yes, exactly! I guess the same applies for the "wiring protocol" where some systems are willing to sacrifice a little bit of parsing to have a shorter message! Especially on situations where the cost of parsing is so much lesser than I/O. I think there is market for receiving a fbs in a "compact protocol" over the endpoints, parsed into regular fbs on clients, to be used or saved on disk/db.

Actually Thrift has a interesting approach of modularizing protocol, so you can plugin different ones: Compact, efficient, json etc.

All values would be encoded in-line as a tree, instead of using offsets. So a table whose only set field is the 3rd one, which is a sub-table with the contents of the above example would look like: -2, -1, 1, X, 0, 0

My observation is that deduplicating strings are real saver on some real life data, so I am wondering if inlining everything outweighs writing strings only once.

In overall I really like your suggestions. I am still wondering if this can be accomplished with changing runtime only, by passing a protocol writer FlatBuffersBuilder and so on. But that would be just detail.

aardappel Apr 22, 2021
Maintainer

At least in the C++ world, the use of string de-dup in FlatBufferBuilder is very much a niche feature. Supporting string sharing would again need offsets so that would go counter to compression goals. Though I suppose you could have the first use of a string be inline, and the second use be an offset.. but now you need to write a bit to indicate whether something is an offset. For simplicity I'd argue to leave sharing out entirely.

I don't think this can be transparent parametrization of FlatBufferBuilder, it changes too much, and would make the code a mess. Also I don't know that we want to duplicate all the functionality to be able to construct such a buffer directly.. having back & forth translation functions is a lot easier. In fact, in C++ you potentially could write these on top of the mini-reflection functionality instead of using more codegen.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Code generated compression API #6571

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment 2 replies

{{title}}

{{title}}

{{title}}

Select a reply

Code generated compression API #6571

paulovap Apr 17, 2021 Maintainer

Replies: 1 comment · 2 replies

aardappel Apr 19, 2021 Maintainer

paulovap Apr 20, 2021 Maintainer Author

aardappel Apr 22, 2021 Maintainer

paulovap
Apr 17, 2021
Maintainer

Replies: 1 comment 2 replies

aardappel
Apr 19, 2021
Maintainer

paulovap Apr 20, 2021
Maintainer Author

aardappel Apr 22, 2021
Maintainer