New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Backport of safe readInt from streaming-bytestring #309
Conversation
54df7a3
to
0130a39
Compare
This version detects integer overflow and returns 'Nothing'. With lazy bytestrings it will not new chunks after an "unreasonable" number (more than ~32kB) of leading zeros and will instead return 'Nothing'. Tests check that maxBound and minBound are handled correctly possibly with explicit signs (for maxBound), and that overflowing by 1, 10 and many appended zeros all fail.
0130a39
to
b74fb60
Compare
Data/ByteString/Lazy/Char8.hs
Outdated
-- zero digits when trying to decode a number. When reading the first | ||
-- non-zero digit would require requesting a new chunk and ~32KB of | ||
-- leading zeros have already been read, the conversion is aborted and | ||
-- 'Nothing' is returned. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This basically means that the result of readInt
depends on the chunk structure of the string. This is undesirable from my perspective, because chunks are implicit and should be indistinguishible from user perspective.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But there's never any latent lazy IO in the middle of a chunk. The distinction is only important under extremely unusual circumstances, 32k or more of leading zeros read so far, and the current chunk (more than one might have been consumed) has ended. My take was that once the chunk is already in memory, most of the cost has been paid, and we may as well keep reading.
Granted, with streaming-bytestring
the chunk structure and monadic effects are more explicit, and I could perhaps even tweak the policy to ignore chunk boundaries, and only stop when a new monadic effect is required to fetch the next chunk.
Here the chunk structure is our only signal that more data may need to be brought into memory. Also, and perhaps more importantly, checking the zero count limit at a smaller granularity can have a noticeable performance impact. Chunk boundaries give us an opportunity to make the check infrequently.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I do not suggest to check zero count limit at a smaller granularity. I'd rather do not check it at all. Otherwise a) the behaviour of readInt
depends on the internal structure of chunks, b) lazy readInt
behaves differently from strict readInt
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see, that's a tenable policy. Your take is that unlike the streaming-bytestring
variant, here you'd prefer to just keep reading unconditionally. If someone is unlucky enough to read an Int
from an unbounded stream (network socket) of zeros, then they should have taken steps to limit the amount of data that might be consumed...
So it boils down to whether protection against unbounded reads is part of the "safety" in this new safer readInt
. I can live with removing the limit, though for users who do want to avoid spinning forever the work-around is a bit messy.
One would have to use something like readInt . take 32768
, but then under normal conditions with the integer taking just a few bytes of the input one is left with not the rest of the bytestring, but the rest of the first 32k, which is not particularly convenient.
If the bound is wanted, doing it around readInt
is messy and inefficient, so I think your suggestion amounts to not offering the (lazy I/O) bound. If that's acceptable (it is what you get with readInteger
at the moment) then OK, if others agree that's what the API should do.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So long as "lazy" bytestrings are not actually of the lazy-io variety, but are rather data already in memory, arranged in chunks with perhaps some deferred pure computations in the tail, one might argue that it is OK to not bound the readInt
input buffer.
But even with pure lazy ByteStrings pathology is still possible, and avoided by this PR:
λ> import qualified Data.ByteString.Lazy.Char8 as L8
λ> let x = L8.repeat '0'
λ> L8.readInt x
Nothing
λ> L8.readInteger x
Just [... time goes by ...]^CInterrupted.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, length (L8.repeat '0')
diverges as well, but nobody sets an upper bound for it.
Yes, my take is that streaming-bytestring
has more explicit notion of chunks and effects, does not have a strict counterpart and is not a core package, which allows to be more "opinionated" in design decisions. But in bytestring
I would prefer lazy API be invariant modulo internal representation and mimic strict API as close as possible. @sjakobi @hsyl20 thoughts?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No feedback seems likely, unless more prodding will help... How shall we proceed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for the delay! I'll try to make up my mind until Sunday.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for arguing your points so clearly, @Bodigrim and @vdukhovni, and apologies for the late response! :)
I agree with @Bodigrim's points:
But in
bytestring
I would prefer lazy API be invariant modulo internal representation and mimic strict API as close as possible.
Data/ByteString/Lazy/Char8.hs
Outdated
-- in n' `seq` c' `seq` JustS n' c' | ||
|
||
{-# INLINABLE readInt #-} | ||
readInt = start |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we possibly share more with strict readInt
implementation? Maybe move some functions to internal modules or whatever.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is possible to do that, but I am not convinced it is worthwhile. I see more value in keeping the entire implementation in one place. The common bits are I think not large enough to worry about repetition.
The one thing I'm inclined to share is the maxBound and minBound-related constants. I can move those, if we agree on that. The accumulator functions I think are likely better optimised in place, and not worth exporting.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I moved the constants into Data.ByteString.Internal
. If you feel really strongly about moving the digits
loop, that can be done, but if it were only up to me, I would not.
53370e1
to
ec1393c
Compare
I'll squash the fixups once we agree that all's well. [ Or whoever decides to "merge" can do it. By the way, please, please avoid "merge" commits, rebase and fast-forward only. ] |
6c26a95
to
693012b
Compare
Should I squash now, or wait for reviews to complete? |
I'll squash it myself, when ready for merge. |
I've been trying to get a coverage report that's not empty, so far with no luck... All the produced HTML files are identical stubs with no data for any functions at all. [ Edit: Scratch that, finally made progress |
FWIW, to get non-empty coverage reports, I had to copy the "library" section from "bytestring.cabal" into "byteesting-test", removed all the ".." paths and just symlinked With that, I finally got a non-empty coverage report. It looks like, coverage reports cover the "library" dependency modules of the test executables, so without a library dependency, all the reports are empty. If we want to do this regularly, it'll take some more delicate surgery of the |
I hope that with #316 we'll be able to abandon |
8edf683
to
edf517f
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM and thanks, @vdukhovni! :)
@vdukhovni https://travis-ci.com/github/haskell/bytestring/jobs/427223096#L298
|
I pushed a fix for that. Perhaps you're looking at an outdated snapshot? |
I re-run Travis build, it still fails. |
edf517f
to
22ca0ec
Compare
Sorry about that. I failed to actually commit the fixes. Should be good now. |
Arrrrgh, I'm terribly sorry, merged with rebase instead of squash. My bad. Created a branch for 0.11 series: https://github.com/haskell/bytestring/tree/bytestring-0.11 |
The commits did not get squashed... Oh well. It does not matter I guess... |
Thanks for getting this merged. |
This version detects integer overflow and returns 'Nothing'.
With lazy bytestrings it will not read new chunks after an "unreasonable" number
(more than ~32kB) of leading zeros and will instead return 'Nothing'.
Tests check that maxBound and minBound are handled correctly possibly with
explicit signs (for maxBound), and that overflowing by 1, 10 and many
appended zeros all fail.