Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Protobuf lacks framed stream of messages #54

Closed
brianolson opened this issue Oct 15, 2014 · 10 comments
Closed

Protobuf lacks framed stream of messages #54

brianolson opened this issue Oct 15, 2014 · 10 comments
Assignees
Labels
enhancement inactive Denotes the issue/PR has not seen activity in the last 90 days. P3 python

Comments

@brianolson
Copy link

Lots of applications want a stream of protobuf messages in a file or a network stream.

It could be as simple as exposing the internal utility functions to write a varint to a stream. An application could then write a varint length prefix and then the blob of serialized protobuf.

@xfxyjwf
Copy link
Contributor

xfxyjwf commented Oct 15, 2014

To be clear, protobuf does support framed stream of messages. Are you talking specifically about protobuf in Python? In C++ and Java, you can use CodedInputStream/CodedOutputStream to read/write varint or any other protobuf wire format data.

@brianolson
Copy link
Author

Yes, Python. Can we get a stable interface to the underlying operations in Python officially blessed?

@hzeller
Copy link
Contributor

hzeller commented Oct 16, 2014

On 15 October 2014 09:20, Brian Olson notifications@github.com wrote:

Lots of applications want a stream of protobuf messages in a file or a
network stream.

It could be as simple as exposing the internal utility functions to write
a varint to a stream.

Doesn't CodedOutputStream already provide that ?

@cbsmith
Copy link

cbsmith commented Feb 4, 2015

It does. Have you seen the Python implementation?

@brianolson
Copy link
Author

CodedOutputStream is in the C++ library. In Python I think what I want is buried in google.protobuf.internal.encoder._EncodeVarint and google.protobuf.internal.decoder._DecodeVarint
I think it would be useful to promote these to part of the public API. If the equivalent is already in the public API in C++ then I think that means there's no reason not to in Python.

@acozzette acozzette added the P3 label Jun 8, 2018
TeBoring pushed a commit to TeBoring/protobuf that referenced this issue Jan 19, 2019
…lbuffers#54)

* Changed schema for JSON test to be defined in a .proto file.

Before we had lots of code to build these schemas manually,
but this was verbose and made it difficult to add to the
schema easily.  Now we can just write a .proto file and
adding fields is easy.

To avoid making the tests depend on upbc (and thus Lua)
we check in the generated schema.

* Made protobuf-compiler a dependency of "make genfiles."

* For genfiles download recent protoc that can handle proto3.

* Only use new protoc for genfiles.
@Ubehebe
Copy link
Contributor

Ubehebe commented Apr 10, 2019

I recently implemented a Bazel persistent worker in Python. The lack of varint-delimited reading/writing APIs was an obstacle. I worked around it by using the private APIs.

This is case of two Google products not working well together. Is it possible to publish these APIs?

@mishas
Copy link
Contributor

mishas commented Jun 6, 2020

I would like to revive this thread by sharing our experience.

We have a distributed system, which communicates using protobuf messages.
To make things a bit faster, some of the processes aggregate those messages and send them in batches (i.e. hold a buffer for 30 seconds, and sends out a list of everything they got within those 30 seconds).

Since those aggregation points aggregate a LOT of protobufs, it's important for the aggregation to be fast.

We've tried different ways of creating a list of protobufs, and found out that the following way is the fastest.
For each one of our message types, we also have a MessageTypeList message, which has only one repeated field of type MessageType.
We found out that using that is by far the fastest way, although it has its drawbacks:

  • For one, this creates a huge mess in protos (we have to have every message type twice - once the message itself, and the other one is a list of that message).
  • Secondly, the output list is larger in size than a real stream.

Here's our benchmark and results:

my_proto.proto:

message MyMsg {
    int32 n = 1;
}

message MyMsgList {
    repeated MyMsg msgs = 1;
}

benchmark.py:

from google.protobuf.internal import encoder
import my_proto_pb2

def serialize_list(l):
    msgs = my_proto_pb2.MyMsgList(msgs=l)
    return msgs.SerializeToString()

def serialize_stream(l):
    iob = io.BytesIO()
    for x in l:
        encoder._EncodeVarint(iob.write, x.ByteSize())
        iob.write(x.SerializeToString())
    return iob.getvalue()

Results for a list of 1M protos:
With pure python protos:

  • Done serialize_list in 13.030245 seconds
  • List size 5983486 bytes
  • Done serialize_stream in 7.377367 seconds (0.57x time)
  • List size 4983486 bytes, (16.71% smaller)

As you can see, with pure proto, the stream code not only creates a smaller output, but also almost twice as fast, but we must do better, so we use CPP protos:

With CPP python protos:

  • Done serialize_list in 0.650137 seconds
  • List size 5983486 bytes
  • Done serialize_stream in 1.002155 seconds (1.54x time)
  • List size 4983486 bytes, (16.71% smaller)

Here you can see that the stream code is more than 1.5 times slower :(.

Can we please get streams as part of this package, as it seems doing it any other way will not give us good enough speed.

saurabhs2501 added a commit to saurabhs2501/protobuf that referenced this issue Sep 8, 2020
eme-p added a commit to eme-p/protobuf that referenced this issue Sep 21, 2020
@elharo elharo assigned haberman and unassigned anandolee Oct 1, 2021
@ericsalo ericsalo assigned ericsalo and unassigned haberman Sep 1, 2022
@ericsalo
Copy link
Member

ericsalo commented Sep 1, 2022

Now that Python is implemented on top of upb, this has become a upb issue. First up is to implement a proto text parser, which is something I am doing now. Initial implementation will be limited to continuous buffers, after that we will look into adding support for stream I/O. I can't say yet when (or even whether) this may float to the top of the work queue but it is definitely on my radar so reassigning this to myself.

bithium pushed a commit to bithium/protobuf that referenced this issue Sep 4, 2023
…lbuffers#54)

* Changed schema for JSON test to be defined in a .proto file.

Before we had lots of code to build these schemas manually,
but this was verbose and made it difficult to add to the
schema easily.  Now we can just write a .proto file and
adding fields is easy.

To avoid making the tests depend on upbc (and thus Lua)
we check in the generated schema.

* Made protobuf-compiler a dependency of "make genfiles."

* For genfiles download recent protoc that can handle proto3.

* Only use new protoc for genfiles.
Copy link

We triage inactive PRs and issues in order to make it easier to find active work. If this issue should remain active or becomes active again, please add a comment.

This issue is labeled inactive because the last activity was over 90 days ago.

@github-actions github-actions bot added the inactive Denotes the issue/PR has not seen activity in the last 90 days. label Jun 23, 2024
@haberman
Copy link
Member

Python is officially getting a public API for length-prefixed streams of messages: #16965

It is not released yet, but it will be included in the next minor version.

If you have any performance issues with this API, please open a separate issue for it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement inactive Denotes the issue/PR has not seen activity in the last 90 days. P3 python
Projects
None yet
10 participants