Python protobuf performance comparison #880

monkeybutter · 2015-10-14T03:14:11Z

I've been doing some comparisons between different serialization formats of n-dimensional arrays (numpy arrays). The comparison contains formats as numpy native serialization to disk, hdf5, bcolz and protocol buffers 2. Results can be seen here:

https://gist.github.com/monkeybutter/b91004077be5d73a478a

I wonder why the numbers I get from protocol buffers are so high. It seems as both the parsing and conversion into a numpy object are orders of magnitude higher than the rest. Is there other ways of packaging data that can improve these numbers?

xfxyjwf · 2015-10-14T18:11:14Z

Something you can try:

Mark all repeated field as packed.
Use fixed32 instead of int32
Use Python C++ Implementation (described at the end of README.md)

I haven't done Python benchmark before but less than 4MB/s parsing throughput does seem ridiculously slow.

@haberman do you have any Python performance numbers?

monkeybutter · 2015-10-14T22:40:53Z

Thanks @xfxyjwf for your suggestions.

1.- I have actually used both packed=true and false in the schema for the array field (the one that contains the data) and the difference in deserialization speed is very small. What changes is the size of the serialized buffer, which is significantly smaller when using packing.

2.- The array data is actually of type double, int32 is used to store the shape of the array, but this is normally just a couple of integers.

3.- I'm going to try the Python wrapped C++ version and I will post results when ready.

erwindassen · 2016-09-29T09:00:31Z

Any updates on this? Also comparison with protobuf 3.0 would be nice. We are seeing here quite slow deserialisations in python.

jeffrey-cochran · 2016-11-01T18:13:00Z

@monkeybutter thanks for the comparison--very helpful. Simply put, I'm shocked that it would be so much slower. @xfxyjwf has anyone revisited this? I would love for this to get a bump. I like protobuf, and it seems to have good support, but an order of magnitude difference serializing our data from Python is most likely going to be a deal breaker.

xfxyjwf · 2016-11-01T18:36:04Z

@jeffrey-cochran Sorry, no one has looked into this particular case. I actually expect protobuf python performance to be reasonably fast because that's what Google uses in Youtube, TensorFlow, etc. Though all of them are using the python C++ implementation now, so the issue observed here is probably for the pure python implementation only and none of the other performance sensitive users have observed it.

…buffers#880)

xfxyjwf added question python performance labels Jan 21, 2016

gerben-s closed this as completed Mar 9, 2017

adellahlou pushed a commit to adellahlou/protobuf that referenced this issue Apr 20, 2023

CLI: Annotate virtual oneof fields as string literal unions (protocol…

c7ecc92

…buffers#880)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Python protobuf performance comparison #880

Python protobuf performance comparison #880

monkeybutter commented Oct 14, 2015

xfxyjwf commented Oct 14, 2015

monkeybutter commented Oct 14, 2015

erwindassen commented Sep 29, 2016

jeffrey-cochran commented Nov 1, 2016

xfxyjwf commented Nov 1, 2016

Python protobuf performance comparison #880

Python protobuf performance comparison #880

Comments

monkeybutter commented Oct 14, 2015

xfxyjwf commented Oct 14, 2015

monkeybutter commented Oct 14, 2015

erwindassen commented Sep 29, 2016

jeffrey-cochran commented Nov 1, 2016

xfxyjwf commented Nov 1, 2016