Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Python protobuf performance comparison #880

Closed
monkeybutter opened this issue Oct 14, 2015 · 5 comments
Closed

Python protobuf performance comparison #880

monkeybutter opened this issue Oct 14, 2015 · 5 comments

Comments

@monkeybutter
Copy link

I've been doing some comparisons between different serialization formats of n-dimensional arrays (numpy arrays). The comparison contains formats as numpy native serialization to disk, hdf5, bcolz and protocol buffers 2. Results can be seen here:

https://gist.github.com/monkeybutter/b91004077be5d73a478a

I wonder why the numbers I get from protocol buffers are so high. It seems as both the parsing and conversion into a numpy object are orders of magnitude higher than the rest. Is there other ways of packaging data that can improve these numbers?

@xfxyjwf
Copy link
Contributor

xfxyjwf commented Oct 14, 2015

Something you can try:

  1. Mark all repeated field as packed.
  2. Use fixed32 instead of int32
  3. Use Python C++ Implementation (described at the end of README.md)

I haven't done Python benchmark before but less than 4MB/s parsing throughput does seem ridiculously slow.

@haberman do you have any Python performance numbers?

@monkeybutter
Copy link
Author

Thanks @xfxyjwf for your suggestions.

1.- I have actually used both packed=true and false in the schema for the array field (the one that contains the data) and the difference in deserialization speed is very small. What changes is the size of the serialized buffer, which is significantly smaller when using packing.

2.- The array data is actually of type double, int32 is used to store the shape of the array, but this is normally just a couple of integers.

3.- I'm going to try the Python wrapped C++ version and I will post results when ready.

@erwindassen
Copy link

Any updates on this? Also comparison with protobuf 3.0 would be nice. We are seeing here quite slow deserialisations in python.

@jeffrey-cochran
Copy link

@monkeybutter thanks for the comparison--very helpful. Simply put, I'm shocked that it would be so much slower. @xfxyjwf has anyone revisited this? I would love for this to get a bump. I like protobuf, and it seems to have good support, but an order of magnitude difference serializing our data from Python is most likely going to be a deal breaker.

@xfxyjwf
Copy link
Contributor

xfxyjwf commented Nov 1, 2016

@jeffrey-cochran Sorry, no one has looked into this particular case. I actually expect protobuf python performance to be reasonably fast because that's what Google uses in Youtube, TensorFlow, etc. Though all of them are using the python C++ implementation now, so the issue observed here is probably for the pure python implementation only and none of the other performance sensitive users have observed it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants