Backport of safe readInt from streaming-bytestring #309

vdukhovni · 2020-10-21T07:27:45Z

This version detects integer overflow and returns 'Nothing'.

With lazy bytestrings it will not read new chunks after an "unreasonable" number
(more than ~32kB) of leading zeros and will instead return 'Nothing'.

Tests check that maxBound and minBound are handled correctly possibly with
explicit signs (for maxBound), and that overflowing by 1, 10 and many
appended zeros all fail.

This version detects integer overflow and returns 'Nothing'. With lazy bytestrings it will not new chunks after an "unreasonable" number (more than ~32kB) of leading zeros and will instead return 'Nothing'. Tests check that maxBound and minBound are handled correctly possibly with explicit signs (for maxBound), and that overflowing by 1, 10 and many appended zeros all fail.

Data/ByteString/Char8.hs

Bodigrim · 2020-10-25T12:54:25Z

Data/ByteString/Lazy/Char8.hs

+-- zero digits when trying to decode a number.  When reading the first
+-- non-zero digit would require requesting a new chunk and ~32KB of
+-- leading zeros have already been read, the conversion is aborted and
+-- 'Nothing' is returned.


This basically means that the result of readInt depends on the chunk structure of the string. This is undesirable from my perspective, because chunks are implicit and should be indistinguishible from user perspective.

But there's never any latent lazy IO in the middle of a chunk. The distinction is only important under extremely unusual circumstances, 32k or more of leading zeros read so far, and the current chunk (more than one might have been consumed) has ended. My take was that once the chunk is already in memory, most of the cost has been paid, and we may as well keep reading.

Granted, with streaming-bytestring the chunk structure and monadic effects are more explicit, and I could perhaps even tweak the policy to ignore chunk boundaries, and only stop when a new monadic effect is required to fetch the next chunk.

Here the chunk structure is our only signal that more data may need to be brought into memory. Also, and perhaps more importantly, checking the zero count limit at a smaller granularity can have a noticeable performance impact. Chunk boundaries give us an opportunity to make the check infrequently.

I do not suggest to check zero count limit at a smaller granularity. I'd rather do not check it at all. Otherwise a) the behaviour of readInt depends on the internal structure of chunks, b) lazy readInt behaves differently from strict readInt.

I see, that's a tenable policy. Your take is that unlike the streaming-bytestring variant, here you'd prefer to just keep reading unconditionally. If someone is unlucky enough to read an Int from an unbounded stream (network socket) of zeros, then they should have taken steps to limit the amount of data that might be consumed...

So it boils down to whether protection against unbounded reads is part of the "safety" in this new safer readInt. I can live with removing the limit, though for users who do want to avoid spinning forever the work-around is a bit messy.

One would have to use something like readInt . take 32768, but then under normal conditions with the integer taking just a few bytes of the input one is left with not the rest of the bytestring, but the rest of the first 32k, which is not particularly convenient.

If the bound is wanted, doing it around readInt is messy and inefficient, so I think your suggestion amounts to not offering the (lazy I/O) bound. If that's acceptable (it is what you get with readInteger at the moment) then OK, if others agree that's what the API should do.

So long as "lazy" bytestrings are not actually of the lazy-io variety, but are rather data already in memory, arranged in chunks with perhaps some deferred pure computations in the tail, one might argue that it is OK to not bound the readInt input buffer.

But even with pure lazy ByteStrings pathology is still possible, and avoided by this PR:

λ> import qualified Data.ByteString.Lazy.Char8 as L8 λ> let x = L8.repeat '0' λ> L8.readInt x Nothing λ> L8.readInteger x Just [... time goes by ...]^CInterrupted.

Well, length (L8.repeat '0') diverges as well, but nobody sets an upper bound for it.

Yes, my take is that streaming-bytestring has more explicit notion of chunks and effects, does not have a strict counterpart and is not a core package, which allows to be more "opinionated" in design decisions. But in bytestring I would prefer lazy API be invariant modulo internal representation and mimic strict API as close as possible. @sjakobi @hsyl20 thoughts?

No feedback seems likely, unless more prodding will help... How shall we proceed?

Sorry for the delay! I'll try to make up my mind until Sunday.

Thanks for arguing your points so clearly, @Bodigrim and @vdukhovni, and apologies for the late response! :)

I agree with @Bodigrim's points:

But in bytestring I would prefer lazy API be invariant modulo internal representation and mimic strict API as close as possible.

Bodigrim · 2020-10-25T12:55:37Z

Data/ByteString/Lazy/Char8.hs

-         --                  in n' `seq` c' `seq` JustS n' c'
-
+{-# INLINABLE readInt #-}
+readInt = start


Could we possibly share more with strict readInt implementation? Maybe move some functions to internal modules or whatever.

It is possible to do that, but I am not convinced it is worthwhile. I see more value in keeping the entire implementation in one place. The common bits are I think not large enough to worry about repetition.

The one thing I'm inclined to share is the maxBound and minBound-related constants. I can move those, if we agree on that. The accumulator functions I think are likely better optimised in place, and not worth exporting.

I moved the constants into Data.ByteString.Internal. If you feel really strongly about moving the digits loop, that can be done, but if it were only up to me, I would not.

vdukhovni · 2020-10-29T22:14:10Z

I'll squash the fixups once we agree that all's well.

[ Or whoever decides to "merge" can do it. By the way, please, please avoid "merge" commits, rebase and fast-forward only. ]

vdukhovni · 2020-11-02T00:22:14Z

Should I squash now, or wait for reviews to complete?

Bodigrim · 2020-11-02T18:16:20Z

I'll squash it myself, when ready for merge.
Do you have a coverage report?

vdukhovni · 2020-11-02T22:16:42Z

I'll squash it myself, when ready for merge.
Do you have a coverage report?

I've been trying to get a coverage report that's not empty, so far with no luck... All the produced HTML files are identical stubs with no data for any functions at all.

[ Edit: Scratch that, finally made progress
Other than not explicitly testing empty initial chunks in the lazy ByteString case, coverage looks to be essentially
100% for both readInt variants. ]

vdukhovni · 2020-11-02T23:04:05Z

FWIW, to get non-empty coverage reports, I had to copy the "library" section from "bytestring.cabal" into "byteesting-test", removed all the ".." paths and just symlinked Data, cbits, include and LICENCE from the parent directory into tests, and then added "bytestring-test" as an explicit dependency of "prop-compiled".

With that, I finally got a non-empty coverage report. It looks like, coverage reports cover the "library" dependency modules of the test executables, so without a library dependency, all the reports are empty.

If we want to do this regularly, it'll take some more delicate surgery of the bytestring-test cabal file. If we had symlinks on Windows, it would be easier to make this portable. Without that it is a bit of a pain, we basically need a package with the library and tests that does not call itself bytestring to avoid all the dependency conflicts. So the library sources need to appear to be in both places. Perhaps with ".." paths I can work the library into the test suite as a permanent feature...

Bodigrim · 2020-11-02T23:12:56Z

If we want to do this regularly, it'll take some more delicate surgery of the bytestring-test cabal file.

I hope that with #316 we'll be able to abandon bytestring-test package for good.

Data/ByteString/Char8.hs

Data/ByteString/Lazy/Char8.hs

Data/ByteString/Char8.hs

sjakobi

LGTM and thanks, @vdukhovni! :)

Bodigrim · 2020-11-04T18:10:25Z

@vdukhovni https://travis-ci.com/github/haskell/bytestring/jobs/427223096#L298

Data/ByteString/Char8.hs:1047:36: Not in scope: ‘<$>’

vdukhovni · 2020-11-04T18:11:42Z

@vdukhovni https://travis-ci.com/github/haskell/bytestring/jobs/427223096#L298

Data/ByteString/Char8.hs:1047:36: Not in scope: ‘<$>’

I pushed a fix for that. Perhaps you're looking at an outdated snapshot?

Bodigrim · 2020-11-04T18:19:25Z

I re-run Travis build, it still fails.

vdukhovni · 2020-11-04T19:23:29Z

I re-run Travis build, it still fails.

Sorry about that. I failed to actually commit the fixes. Should be good now.

Bodigrim · 2020-11-07T11:10:25Z

Arrrrgh, I'm terribly sorry, merged with rebase instead of squash. My bad.

Created a branch for 0.11 series: https://github.com/haskell/bytestring/tree/bytestring-0.11

vdukhovni · 2020-11-07T18:34:01Z

The commits did not get squashed... Oh well. It does not matter I guess...

vdukhovni · 2020-11-07T19:24:17Z

Thanks for getting this merged.

vdukhovni mentioned this pull request Oct 21, 2020

readInt does not check for overflow. #144

Closed

vdukhovni force-pushed the safe-readInt branch from 54df7a3 to 0130a39 Compare October 21, 2020 14:23

vdukhovni force-pushed the safe-readInt branch from 0130a39 to b74fb60 Compare October 21, 2020 15:47

Bodigrim linked an issue Oct 21, 2020 that may be closed by this pull request

readInt does not check for overflow. #144

Closed

Bodigrim reviewed Oct 25, 2020

View reviewed changes

fixup! Backport of safe readInt from streaming-bytestring

ec1393c

vdukhovni force-pushed the safe-readInt branch from 53370e1 to ec1393c Compare October 25, 2020 18:19

fixup! fixup! Backport of safe readInt from streaming-bytestring

693012b

vdukhovni force-pushed the safe-readInt branch from 6c26a95 to 693012b Compare October 29, 2020 22:49

Bodigrim approved these changes Nov 2, 2020

View reviewed changes

Bodigrim added this to the 0.12.0.0 milestone Nov 2, 2020

Bodigrim requested a review from sjakobi November 2, 2020 22:42

sjakobi reviewed Nov 3, 2020

View reviewed changes

Data/ByteString/Char8.hs Outdated Show resolved Hide resolved

Data/ByteString/Lazy/Char8.hs Outdated Show resolved Hide resolved

Data/ByteString/Char8.hs Show resolved Hide resolved

Data/ByteString/Char8.hs Outdated Show resolved Hide resolved

vdukhovni force-pushed the safe-readInt branch 2 times, most recently from 8edf683 to edf517f Compare November 3, 2020 20:52

sjakobi approved these changes Nov 4, 2020

View reviewed changes

fixup! fixup! fixup! Backport of safe readInt from streaming-bytestring

22ca0ec

vdukhovni force-pushed the safe-readInt branch from edf517f to 22ca0ec Compare November 4, 2020 19:23

Bodigrim merged commit b274158 into haskell:master Nov 7, 2020

vdukhovni deleted the safe-readInt branch November 7, 2020 18:19

Bodigrim mentioned this pull request Dec 18, 2021

Implement readNatural plus readInt and readWord for 8, 16, 32, 64 bit and native machine bit sizes #438

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Backport of safe readInt from streaming-bytestring #309

Backport of safe readInt from streaming-bytestring #309

vdukhovni commented Oct 21, 2020 •

edited

Bodigrim Oct 25, 2020

vdukhovni Oct 25, 2020

Bodigrim Oct 25, 2020

vdukhovni Oct 25, 2020

vdukhovni Oct 25, 2020

Bodigrim Oct 25, 2020

vdukhovni Oct 28, 2020

sjakobi Oct 28, 2020

sjakobi Oct 29, 2020

Bodigrim Oct 25, 2020

vdukhovni Oct 25, 2020

vdukhovni Oct 25, 2020

vdukhovni commented Oct 29, 2020 •

edited

vdukhovni commented Nov 2, 2020

Bodigrim commented Nov 2, 2020

vdukhovni commented Nov 2, 2020 •

edited

vdukhovni commented Nov 2, 2020

Bodigrim commented Nov 2, 2020

sjakobi left a comment

Bodigrim commented Nov 4, 2020

vdukhovni commented Nov 4, 2020

Bodigrim commented Nov 4, 2020

vdukhovni commented Nov 4, 2020

Bodigrim commented Nov 7, 2020 •

edited

vdukhovni commented Nov 7, 2020

vdukhovni commented Nov 7, 2020

Backport of safe readInt from streaming-bytestring #309

Backport of safe readInt from streaming-bytestring #309

Conversation

vdukhovni commented Oct 21, 2020 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

vdukhovni commented Oct 29, 2020 • edited

vdukhovni commented Nov 2, 2020

Bodigrim commented Nov 2, 2020

vdukhovni commented Nov 2, 2020 • edited

vdukhovni commented Nov 2, 2020

Bodigrim commented Nov 2, 2020

sjakobi left a comment

Choose a reason for hiding this comment

Bodigrim commented Nov 4, 2020

vdukhovni commented Nov 4, 2020

Bodigrim commented Nov 4, 2020

vdukhovni commented Nov 4, 2020

Bodigrim commented Nov 7, 2020 • edited

vdukhovni commented Nov 7, 2020

vdukhovni commented Nov 7, 2020

vdukhovni commented Oct 21, 2020 •

edited

vdukhovni commented Oct 29, 2020 •

edited

vdukhovni commented Nov 2, 2020 •

edited

Bodigrim commented Nov 7, 2020 •

edited