New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Python protobuf performance comparison #880
Comments
Something you can try:
I haven't done Python benchmark before but less than 4MB/s parsing throughput does seem ridiculously slow. @haberman do you have any Python performance numbers? |
Thanks @xfxyjwf for your suggestions. 1.- I have actually used both packed=true and false in the schema for the array field (the one that contains the data) and the difference in deserialization speed is very small. What changes is the size of the serialized buffer, which is significantly smaller when using packing. 2.- The array data is actually of type double, int32 is used to store the shape of the array, but this is normally just a couple of integers. 3.- I'm going to try the Python wrapped C++ version and I will post results when ready. |
Any updates on this? Also comparison with protobuf 3.0 would be nice. We are seeing here quite slow deserialisations in python. |
@monkeybutter thanks for the comparison--very helpful. Simply put, I'm shocked that it would be so much slower. @xfxyjwf has anyone revisited this? I would love for this to get a bump. I like protobuf, and it seems to have good support, but an order of magnitude difference serializing our data from Python is most likely going to be a deal breaker. |
@jeffrey-cochran Sorry, no one has looked into this particular case. I actually expect protobuf python performance to be reasonably fast because that's what Google uses in Youtube, TensorFlow, etc. Though all of them are using the python C++ implementation now, so the issue observed here is probably for the pure python implementation only and none of the other performance sensitive users have observed it. |
I've been doing some comparisons between different serialization formats of n-dimensional arrays (numpy arrays). The comparison contains formats as numpy native serialization to disk, hdf5, bcolz and protocol buffers 2. Results can be seen here:
https://gist.github.com/monkeybutter/b91004077be5d73a478a
I wonder why the numbers I get from protocol buffers are so high. It seems as both the parsing and conversion into a numpy object are orders of magnitude higher than the rest. Is there other ways of packaging data that can improve these numbers?
The text was updated successfully, but these errors were encountered: