-
Notifications
You must be signed in to change notification settings - Fork 3.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
The fundamental difference between PB and FB is what ? #4372
Comments
The key difference is that FlatBuffers doesn't de-serialize at all, ever. It allows you to access the serialized data in-place. FlatBuffers only supports writing to a Your timing doesn't test anything. You're testing the time it takes to write to the stream. You should instead test the time from where you create the You can't time de-serialization because it is always |
Hmm, That is why time the deserialization in my results for the FlatBuffers is basically always return almost to Here, I give a set of test results for how much disk space for serializing text file by means of Flatbuffers and protobuf.
I will test for accessing 1 field, 2 fileds, 14 fields, and all fields [totally 27 filelds within an object] under different cases. Last but not least, Would be better if Flatbuffers can provide a compressed API interface, and then send the compressed data to network for transmission, What do you think ? And Flink on the way to use compression (e.g. snappy) for full check/savepoints. https://issues.apache.org/jira/browse/FLINK-6773 |
FYI. I put a report testing here from my work recently as following as a reference. And the input data comes from Kafka production environments. I have two sets of data, one's size is 36,879 KB, another size is 145,157 KB respectively. I parsed those input data into totally 27 fields and below are the results under different cases which including access 1, 2, 14 and 27 fields. Time is timed by millisecond. |
You can use any existing compressor on a FlatBuffer, which will give some savings, but it is not optimal because offsets are relatively random and thus not very compressible. A special purpose compression schema could be invented, but I don't think anyone has tried that yet. FlatBuffers was on purpose designed for speed of access over size. |
One option is to use an encoder like |
@zhangminglei, can you publish your benchmark code and data? |
@aardappel one challenge I've gotten when suggesting the use of flatbuffers is that they're not optimized for size, and therefore (since IO, not CPU, typically dominates resources) "why would we want to do that?" I have to admit I'm not sure how to approach that. Is there a way to tune it for size? |
@binary132 if wireformat size is your concern above all else, then indeed I would not choose FlatBuffers. You can run a compressor on top of FlatBuffers, but since offsets don't compress well, this doesn't gain as much as you'd get from using, say, Protobuf. I guess FlatBuffers was designed for use cases where memory usage and speed matter, such as games loading lots of data, or high performance RPC between services in a data center (where cpu can often be a bigger bottleneck than the network!) Other than using a good compressor, not sure what to do if you already bought in to FlatBuffers and you want it to be smaller. I could imagine a special purpose transform that would make FlatBuffers more compressible, not sure if that's worth it. |
I did some investigation and found that the JSON representation of a Flatbuffer compress much better than the binary version. For example, take two serialized Flatbuffers that contain identical contents, just one in JSON and the other in standard Flatbuffers binary format: $ ls
savegame.json savegame.bin
$ wc -c *
881167 savegame.json
179216 savegame.bin
1060383 total
$ cat savegame.bin | gzip -9 | wc -c
53737
$ cat savegame.json | gzip -9 | wc -c
22415 # <== !! So therefore if deserialization speed is not important to you, but you want to minimize space, it may be beneficial to serialize as JSON and compress that. |
@dpacbach Also note how big the uncompressed JSON is. So with this path, you're going to decompress into 5x more memory, then run a JSON parser on all of that (which is slow, and likely allocates more copied of all of that).. that's a big price to pay in efficiency when FlatBuffers can be accessed as-is. |
I recently used FB to refactor my project [ It is written by scala / java ] I did before. After refactoring over, I want to test the performance of both. Really good performance a lot I got, but here I wont quantify the specific figures. My question is about what is the fundamental difference about the TWO guys. My understanding as the following, please look at my understanding whether correct or incorrect ?
FB performance is strong enough because it based on memory of the serialization framework compared to PB is not based on memory as we must call object.build().writeTo(output) for encoding and decoding with parseFrom from a byte array or from an inputstream. It does not support the API for like calling fbb.dataBuffer() in FB. When FB call fbb.dataBuffer(), then it serialize the object to memory as a bytebuffer. After finishing this, FB will call object.getRootAsXXXX(byteBuffer) do a deserialization from a bytebuffer then. But as for PB, must call parseFrom do a deserialization. So, that is key reason makes the two guys have a relatively large difference in performance. PB is not based on memory, and needs deserialize object from an inputstream. As for FB, just deserialize object from a bytebuffer is enough. I understand right ?
Another question I would like to know is, does FB support an API that writes it's data to an output stream ? I think it does not support now. Why ? just becasue it is based on a memory ?
Other than that, I counted my performance testing time for FB is that write the bytebuffer to an output stream. Testing code like following. I calculte my time including write to the output stream and I think it is incorrect in a way. I am not very sure about what I tested here. Maybe just calculate the time on how much time generate a bytebuffer from
fbb.dataBuffer()
method is enough as FB is based on memory. So, I dont need including the time from a bytebuffer to an outputstream. But as for PB, I need do it withwriteTo
method.Anyway, use the following code to do serialization tests, still can get a good performance, Despite the testing might incorrect in a way. But still can show the power of FB.
Below is FB decode time calculation
As for counting PB test time, it is more easier to test serialization time is that I just test writeTo method consume on how much time is enough. And as to deserialization, just test parseFrom method consume on how much time is also enough.
PB performance testing code looks like following.
@aardappel @rw It would be better if both you can take a look on what i think about those two awesome guys FB and PB. I am very very appreciate it! Thanks! And if I am wrong, please helps me out here.
The text was updated successfully, but these errors were encountered: