Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

proposal: io: add ReadPeeker and implement Peek in bytes.Reader #63548

Closed
mhr3 opened this issue Oct 14, 2023 · 37 comments
Closed

proposal: io: add ReadPeeker and implement Peek in bytes.Reader #63548

mhr3 opened this issue Oct 14, 2023 · 37 comments
Labels
Milestone

Comments

@mhr3
Copy link

mhr3 commented Oct 14, 2023

Is there any reason why bytes.Reader doesn't have a Bytes() or Next(int) method similar to bytes.Buffer?

Such a method would allow zero-copy usage of bytes.Reader, cause currently all public methods require a separate buffer to copy the data into. Plus these methods already exist on bytes.Buffer, so gophers are already familiar with them.

@mhr3 mhr3 added the Proposal label Oct 14, 2023
@gopherbot gopherbot added this to the Proposal milestone Oct 14, 2023
@ianlancetaylor
Copy link
Contributor

The point of bytes.Reader is that you have a function that takes an io.Reader, and you have a []byte, and you want to pass the []byte to the function. Therefore, bytes.Reader has all the methods that that function might want to call. That function isn't going to call a Bytes or Next method, because those aren't implemented by other io.Reader values.

To put it another way, if you feel a need to call a Bytes or Next method, why not just work with the original []byte directly?

@mhr3
Copy link
Author

mhr3 commented Oct 14, 2023

It's a common optimisation pattern to check for specific types inside functions that work with interfaces, therefore any function which accepts an io.Reader could have a fast-path for bytes.Reader if it were possible to get to the byte slice from the Reader.

@ianlancetaylor
Copy link
Contributor

Can you point us to some existing code that would take advantage of this optimization? Thanks.

@mhr3
Copy link
Author

mhr3 commented Oct 15, 2023

Pretty much any streaming parser, even stdlib's bufio.Reader could have a special case for the reader being bytes.Reader avoiding copying everything.

@Jorropo
Copy link
Member

Jorropo commented Oct 15, 2023

You could implement this today using .WriteTo, it has a few ergonomics issue, mainly require moving some processing in a callback and does not work if you don't plan on consuming the full data. However it is already widely deployed solution for zero-copy.
Depending how disruptive you allow it to be you can always refactor your data pipeline to be inverted and fit in there.

@mhr3
Copy link
Author

mhr3 commented Oct 15, 2023

So I've tried to showcase this in master...mhr3:go:bytes-reader-next

I've added bytes.Reader.Next and adapted bufio.Reader to use the byte slice directly, while not exactly pretty (and I'm not proposing stdlib should do this), it shows how such an API could be used and mainly this is not possible without this new API.

Also result of benchstat of a very simple benchmark in the bufio package (master vs this branch):

name                old time/op    new time/op    delta
ReaderReadString-8    50.8ns ± 0%    44.9ns ± 0%  -11.74%  (p=0.000 n=10+10)

name                old alloc/op   new alloc/op   delta
ReaderReadString-8      144B ± 0%      144B ± 0%     ~     (all equal)

name                old allocs/op  new allocs/op  delta
ReaderReadString-8      1.00 ± 0%      1.00 ± 0%     ~     (all equal)

While preparing this benchmark result, I also realised strings.Reader should have the same Next() method.

@mhr3 mhr3 changed the title proposal: bytes: Reader should allow zero-copy working with the byte slice proposal: bytes: Reader should allow zero-copy operations with its byte slice Oct 15, 2023
@dsnet
Copy link
Member

dsnet commented Oct 16, 2023

The github.com/dsnet/compress module looks for types that implement Buffered, Peek, and Discard methods to perform zero-copy decompression from some input. This provided quite a substantial speed-up for decompression.

At present, the bufio.Reader and bytes.Buffer types provide some mechanism to at least implement zero-copy optimizations, but bytes.Reader provides no API to get at the underlying []byte. Adding it would make the API less safe, but not any less safe than bytes.Buffer and bufio.Reader. Also, it would make the bytes.Reader API inconsistent with strings.Reader.

That said, it'd be nice if there was a unified API for copy-less reads. Since Buffered, Peek, and Discard methods already exist, maybe we should rally around those?

@mhr3
Copy link
Author

mhr3 commented Oct 16, 2023

I think the Next() method suits better for this (given no IO needs to happen, so there's no potential for an error happening, plus the simplicity), but ultimately I don't have a strong opinion, as long as there's a way to do this - Bufferred() + Peek() + Discard() would do the trick too.

@rsc
Copy link
Contributor

rsc commented Oct 27, 2023

This proposal has been added to the active column of the proposals project
and will now be reviewed at the weekly proposal review meetings.
— rsc for the proposal review group

@rsc rsc changed the title proposal: bytes: Reader should allow zero-copy operations with its byte slice proposal: bytes: add Reader methods Buffered, Peek, and Discard, matching bufio.Reader Nov 2, 2023
@rsc rsc changed the title proposal: bytes: add Reader methods Buffered, Peek, and Discard, matching bufio.Reader proposal: bytes: add Reader methods Buffered, Peek, and Discard Nov 2, 2023
@rsc
Copy link
Contributor

rsc commented Nov 2, 2023

@dsnet, is Buffered really necessary? Can we use just Peek and Discard?

@dsnet
Copy link
Member

dsnet commented Nov 2, 2023

Looking through my code, I couldn't convince myself Buffered was strictly necessary, but it's useful for performance since it tells us how much we can peek. Without Buffered, callers will just have to make a guess at how much is peekable, but then what's the right amount?

The Discard method is not strictly necessary if the implementation of Read when passed with a buffer retrieved from Peek is essentially a noop. This is a similar optimization technique to the AvailableBuffer+Write pattern.

Thus, there is more utility to Buffered then there is to Discard.

@dsnet
Copy link
Member

dsnet commented Nov 2, 2023

We could alter the behavior of bufio.Reader.Peek to allow -1 to mean "peek at everything that is already buffered and don't read any additional bytes; this never fails" and codify this behavior in the interface.

@dsnet
Copy link
Member

dsnet commented Nov 2, 2023

Here's a concrete proposal:

  • We define a new io.ReadPeeker interface type:
    type ReadPeeker interface {
        io.Reader
    
        // Peek returns the next n bytes without advancing the reader.
        // To advance the reader, call Read with the bytes retrieved from Peek.
        // The returned bytes is only valid until the next call
        // that changes the reader state, such as by calling a Read method.
        //
        // If Peek returns fewer than n bytes,
        // it also returns an error explaining why the read was short.
        // The error is ErrBufferFull if n is larger than b's buffer size.
        // If the provided n is negative, then it returns the entire contents of the buffer
        // without performing any read operations on any possible underlying Reader.
        // Peek never returns an error for negative counts.
        Peek(n int) ([]byte, error)
    }
  • We add io.ErrBufferFull which is copied over from today's bufio.ErrBufferFull.
  • We make bufio.ErrBufferFull identical to io.ErrBufferFull.
  • We modify bufio.Reader.Peek to support the new negative count semantics.
  • We add bytes.Buffer.Peek.
  • We add bytes.Reader.Peek.

I decided to put io.Reader in the io.ReadPeeker interface since the documentation of Peek references how it behaves relative to Read.

@mhr3
Copy link
Author

mhr3 commented Nov 2, 2023

I believe Peek should definitely be coupled with Discard (/Advance/Skip), special behaviour for passing a peeked buffer into Read sounds very strange. Would rather have the interface specify that a peeked buffer is only valid until next Read().

@josharian
Copy link
Contributor

Related: https://dave.cheney.net/paste/gophercon-sg-2023.html / https://github.com/pkg/json/blob/main/reader.go which draws on https://philpearl.github.io/post/reader/

I wonder whether, in addition to new bytes.Reader methods and an io interface, there should be something in package io that uses the optimized path if it exists and if not has a good fallback, maybe along the lines of Dave Cheney's byteReader.

@rsc
Copy link
Contributor

rsc commented Nov 7, 2023

It doesn't feel like we've found the right path forward yet. This all feels very kludgy.

@josharian
Copy link
Contributor

@rsc agreed re: the proposed solutions. But do you think that the problem at least is clear?

@mhr3
Copy link
Author

mhr3 commented Nov 8, 2023

As @dsnet pointed out, the two methods that are necessary for these use cases are:

  • knowing the length of the peekable buffer
  • consuming part (or all) of it

bufio.Reader satisfies these in the form of three methods: Buffered and Peek+Discard
bytes.Buffer allows this with Len and Next
pkg/json.byteReader (see above) can use len(window()) and release()
bytes.Reader already has Len, but is missing the consume part

Therefore, how about we scope this down to just adding bytes.Reader.Next(int) []byte. No new interfaces for peeked reads for now, just that one method, which aligns it closer with bytes.Buffer.

And one last point - strings.Reader also has Len, so why not add Next there too? The purpose here is to have zero-copy access to the slice - given that strings.Reader references a string and not a byte slice, for Next to be safe it would have to make a copy in the first place, that's why this proposal is NOT seeking changes to strings.Reader.

@dsnet
Copy link
Member

dsnet commented Nov 8, 2023

I don't think a next call is sufficient.

Some form of peek+discard pattern is needed by many applications beyond just optimization purposes.

For example, in compress/flate the implementation tries to only decompress exactly the DEFLATE stream and not one byte further. Similar type of problems exist for JSON as well, where the json.Decoder.Buffered API exist only exists because we may have read beyond the last JSON token. Ideally, we can avoid that API to begin with since we can discard exactly the right number of bytes.

@mhr3
Copy link
Author

mhr3 commented Nov 8, 2023

Well, bytes.Reader is also an io.Seeker.

@dsnet
Copy link
Member

dsnet commented Nov 8, 2023

An io.Seeker is unfortunately too powerful, which means it's not possible to wrap a number of readers and convert it efficiently into an io.Seeker. For example, you can't wrap a gzip.Reader and convert it into an io.Seeker since decompression moved in a forward direction unless you buffer everything read so far.

On the other hand, most implementation of an io.Reader already have some form of internal buffer on hand. The peek+discard pattern is exposing that fact.

@mhr3
Copy link
Author

mhr3 commented Nov 8, 2023

My point was that Len+Next+io.Seeker definitely satisfies what you'd need to do. I understand that you'd still want to have a proper interface for these PeekReaders, but a consensus on that wasn't reached, so I'm suggesting something simpler.

@dsnet
Copy link
Member

dsnet commented Nov 8, 2023

I think the discrepancy stems from the fact that I'm trying to tackle this problem for io.Readers in general, while this issue was originally narrowly scoped to just the bytes.Reader concrete type (which supports io.Seeker).

I'm arguing against bytes.Reader.Next because it's too limited in flexibility. Next becomes redundant if we provide some Peek+Discard-like interface. I'd rather see us come up with a generalized approach rather than ad-hoc solutions for each concrete type.

@rsc
Copy link
Contributor

rsc commented Dec 6, 2023

I agree we need an established convention here.
In #63548 (comment), @dsnet suggested adding only Peek, and have the Read implementation optimize away copying from a buffer to itself. That keeps the method count small, and if the Read implementation doesn't optimize away the copying, no real harm done. It's just an optimization. This seems reasonable to me.

Does anyone see any strong objections to ReadPeeker?

@dsnet
Copy link
Member

dsnet commented Dec 12, 2023

Should we rename this issue if we're thinking more broadly about a generic interface?

@rsc
Copy link
Contributor

rsc commented Dec 13, 2023

Looks like bufio.Reader.Read does not do the Peek optimization. That's fine - a future version can. Are there any Readers with the Peek optimization now?

@rsc rsc changed the title proposal: bytes: add Reader methods Buffered, Peek, and Discard proposal: io: add ReadPeeker and implement Peek in bytes.Reader Dec 13, 2023
@rsc
Copy link
Contributor

rsc commented Jan 10, 2024

This discussion has died out a bit over the holidays. Does ReadPeeker address the original need raised in the proposal? Is it good enough?

Also maybe it should be PeekReader.

@Merovius
Copy link
Contributor

TBH I'm not super happy with the idea of adding Peek to bytes.Reader. Currently, if I pass a bytes.Reader to a function, I can be somewhat confident that the underlying bytes don't get modified. That guarantee would be lost. And I don't really understand why use cases where the performance gains of Peek, where required, can't just be gained by passing a bytes.Buffer instead of a bytes.Reader. Having two separate types, one which is immutable and one which isn't, seems good enough?

@dsnet
Copy link
Member

dsnet commented Jan 10, 2024

IIUC, you're arguing against the addition of Peek on bytes.Reader, but not necessarily the interface method itself, right?

I also feel uncomfortable exposing a degree of type unsafety to bytes.Reader that didn't exist before [*] and makes the API asymmetrical with regard to strings.Reader. I wouldn't mind seeing Peek on bytes.Buffer since it already exposes the underlying bytes through various methods.

That said, I'd still like to see a common interface for copy-less and exact read operations. There is both a performance and semantic benefit to this API. A number of APIs (e.g., "compress/flate") assert for io.ByteReader to ensure it never reads past the end of a variable-length segment. Yet other APIs (e.g., "encoding/json") expose API to access the buffered data after the end of a JSON value. The later API is faster, but annoying to use when piecing readers together. The Peek method would make such usages both faster and cleaner.

[*] Technically, you can access the underlying the []byte by calling bytes.Reader.WriteTo with an io.Writer that leaks the provided buffer. You'd need to jump through a few hoops to do this though and it'd be an obvious code smell.

@Merovius
Copy link
Contributor

IIUC, you're arguing against the addition of Peek on bytes.Reader, but not necessarily the interface method itself, right?

Correct.

@Jorropo
Copy link
Member

Jorropo commented Jan 11, 2024

@Merovius we can already get a reference to the buffer of a bytes.Reader without using unsafe:

type badWriter struct{}

func (badWriter) Write(b []byte) (int, error) {
 return copy(b, "bonjour"), nil
}

func main() {
 b := []byte("hello !")
 r := bytes.NewReader(b)
 _, err := r.WriteTo(badWriter{})
 if err != nil { panic(err) }
 fmt.Println(string(b))
}

I'm not arguing this code is somehow valid, but that Peek doesn't do anything new.

@rsc
Copy link
Contributor

rsc commented Jan 18, 2024

Looking back at #63548 (comment), it is a bit annoying for clients that Peek can return ErrBufferFull or whatever for a "too big" n where "too big" is not defined. So code trying to use this seems like it has to always wrap Peek in a potential fall back to Read. That seems awkward. What if an implementation can only Peek 2 bytes? Or 1024 bytes but you need 1536?

If we had a way to get the maximum Peek value maybe that would be less awkward, but looking back at the original proposal with Buffered and Discard, in bufio.Reader, Buffered is not an upper bound on the amount you can Peek, because Peek will try to refill the buffer if it needs to. So Buffered wouldn't work as that method.

This all seems not quite there yet.

@rsc
Copy link
Contributor

rsc commented Feb 7, 2024

It seems like maybe we should decline this and wait for more inspiration, maybe try again in a year or so.

@dsnet
Copy link
Member

dsnet commented Feb 14, 2024

So code trying to use this seems like it has to always wrap Peek in a potential fall back to Read. That seems awkward.

Looking through my programs using a PeekReader-like API, this isn't a problem since the public API just takes in an io.Reader and then type-asserts to a PeekReader, which is either used for performance and/or guarantees on how far can be read. In either case, a fall back implementation was always present regardless.

If we had a way to get the maximum Peek value maybe that would be less awkward

I'm not sure what the utility of this API would be as an implementation usually doesn't care about what the maximum peek value may be, but just that it can peek enough bytes (perhaps everything available) and lets the underlying PeekReader deal with buffering.

Also, being overly concerned with the "maximum" Peek amount seems misguided as it somewhat implies that the maximum is static. Suppose there was an API to obtain the maximum, but the underlying PeekReader might reduce its buffer size dynamically due to memory pressure. How is that change signaled after the fact? Thus, I think the inability to get the maximum Peek amount is a feature, not a bug.

@rsc
Copy link
Contributor

rsc commented Feb 14, 2024

So code trying to use this seems like it has to always wrap Peek in a potential fall back to Read. That seems awkward.

This still seems awkward. I don't think we've found the right API yet.

@rsc
Copy link
Contributor

rsc commented Feb 14, 2024

Based on the discussion above, this proposal seems like a likely decline.
— rsc for the proposal review group

@rsc
Copy link
Contributor

rsc commented Mar 1, 2024

No change in consensus, so declined.
— rsc for the proposal review group

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
Status: Declined
Development

No branches or pull requests

8 participants