Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

proposal: protodelim: add delimited-message serialization support package #1382

Closed
neild opened this issue Nov 4, 2021 · 7 comments
Closed

Comments

@neild
Copy link
Contributor

neild commented Nov 4, 2021

Several other protobuf implementations include first-class support for (de)serializing varint-delimited messages:

Both of these handle messages encoded as a varint size in bytes, followed by exactly that many bytes of a wire-format encoded message.

Perhaps we should provide a similar API.

A possible sketch:

// Package protodelim marshals and unmarshals size-delimited messages.
//
// Delimited format allows a single file or stream to contain multiple
// messages. A delimited message is a varint encoding the message size
// followed by the wire encoding of a message of exactly that size.
package protodelim

// MarshalOptions configures the marshaler.
type MarshalOptions struct {
        proto.MarshalOptions
}

// MarshalAppend appends a single size-delimited, wire-format message to b, returning the result.
func (o MarshalOptions) MarshalAppend(b []byte, m proto.Message) ([]byte, error) {}

// WriteTo writes a single size-delimited, wire-format message to w.
func (o MarshalOptions) WriteTo(w io.Writer, m proto.Message) error {}

// UnmarshalOptions configures the unmarshaler.
type UnmarshalOptions struct {
        proto.UnmarshalOptions
}

// Unmarshal parses a single size-delimited, wire-format message from b
// and places the result in m. It returns the unconsumed portion of the buffer.
// The provided message must be mutable (e.g., a non-nil pointer to a message).
func (o UnmarshalOptions) Unmarshal(b []byte, m proto.Message) ([]byte, error) {}

// ReadFrom parses a single size-delimited, wire-format message from r
// and places the result in m.
// The provided message must be mutable (e.g., a non-nil pointer to a message).
//
// ReadFrom returns an io.EOF error only if no bytes are read.
// If an EOF happens after reading some but not all the bytes,
// ReadFrom returns a non-io.EOF error.
// If r returns an error having read all bytes in the message, the error is dropped.
func (o UnmarshalOptions) ReadFrom(r io.Reader, m proto.Message) error {}

I propose including WriteTo/ReadFrom functions operating on an io.Writer/io.Reader, because reading from a stream seems like a common use case for delimited messages.

The C++ and Java APIs do not provide any way to limit the size of messages read. Perhaps we could have a protodelim.MarshalOptions.MaxSize option to avoid trying to process arbitrarily large messages.

@dsnet
Copy link
Member

dsnet commented Nov 4, 2021

Should this instead be:

type MarshalOptions proto.MarshalOptions

Usage wise, it would be the difference between:

protodelim.MarshalOptions{proto.MarshalOptions{...}}

and

protodelim.MarshalOptions{...}

golang/go#9859 would be nice here.

Another bike-shedding nit, given that Unmarshal has a slightly different signature than what we're used to, perhaps call it UnmarshalNext given that it unmarshals the next one in the sequence?

@dsnet
Copy link
Member

dsnet commented Nov 4, 2021

Implementation-wise for ReadFrom, does this mean that we keep passing it a 1-byte buffer until we read the entirety of the length varint? We need to guarantee that we never read more bytes than necessary from the io.Reader.

@neild
Copy link
Contributor Author

neild commented Nov 4, 2021

The reason for the separate struct type is to allow us to add delimiter-specific options if we need them. e.g.,

type UnmarshalOptions struct {
  proto.UnmarshalOptions

  // MaxSize is the maximum size in wire-format bytes of a single message.
  // Unmarshaling a message larger than MaxSize will return an error.
  MaxSize int
}

Implementation-wise for ReadFrom, does this mean that we keep passing it a 1-byte buffer until we read the entirety of the length varint?

Yes, ReadFrom needs to read a byte at a time until it finds the length prefix, although it could try to get clever by seeing if the io.Reader has a Peek method.

We could alternatively have a Reader type that takes ownership of the io.Reader and does its own buffering.

@neild
Copy link
Contributor Author

neild commented Nov 4, 2021

MaxSize seems like a good safety feature (don't try to read 1GiB of protos!), but opens the question of what to do when exceeding the limit.

Return a distinguishable error?

Guarantee that we've only read the size prefix, so the user can skip the next N bytes if they want? (Requires the error contain the number of bytes to skip.)

Have another option to skip the too-large message? Probably don't want to skip it by default, because the user may not want us to try to read that 1GiB of protos.

Instead of a MaxSize option, we could detect that the io.Reader is an *io.LimitedReader and produce an error if the message would overrun the limit.

@dsnet
Copy link
Member

dsnet commented Nov 4, 2021

although it could try to get clever by seeing if the io.Reader has a Peek method.

Another option: compress/flate.Reader specializes for any io.Reader that also implements io.ByteReader.

Instead of a MaxSize option, we could detect that the io.Reader is an *io.LimitedReader and produce an error if the message would overrun the limit.

I'd recommend against specializing for *io.LimitedReader since there are other "limited reader"-like implementations in stdlib, such as net/http.MaxBytesReader.

@dsnet
Copy link
Member

dsnet commented Nov 4, 2021

Another argument for checking for io.ByteReader is that it works with binary.ReadUvarint. The ReadUvarint uses the same format as varints in protobuf.

@neild
Copy link
Contributor Author

neild commented Dec 13, 2021

I'm not really happy with any of the ways for providing a read source that I've come up with.

A concrete *bufio.Reader lets us efficiently read length prefixes, and also lets us skip an allocation and a copy for messages that fit in the buffer via (*bufio.Reader).Peek. However, accepting a concrete type is limiting.

Defining our own interface { io.Reader ; io.ByteReader ; Peek(int) ([]byte, error) } works, but is unusual and seems quite coupled to *bufio.Reader.

A plain io.Reader is hopelessly inefficient. We can use type assertions to check for ReadByte and Peek methods, but then the user gets terrible performance if they don't know that we're really looking for a ByteReader. (I'm also not thrilled about an interface->interface type assertion on every read, although it's probably not terribly important.)

An interface { io.Reader ; io.ByteReader } with a type assertion to look for Peek is less coupled to *bufio.Reader, but we're again silently downgrading performance if you don't give us the right type.

Another possibility might be:

package protodelim

// NewReader returns a Reader which reads messages from r.
// The Reader may read more data than necessary from r.
func NewReader(r io.Reader, opts UnmarshalOptions) *Reader {}

func (r *Reader) Read(m proto.Message) error {}

Then we wrap r in a *bufio.Reader if necessary.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants