Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Backport of safe readInt from streaming-bytestring #309

Merged
merged 4 commits into from Nov 7, 2020

Conversation

vdukhovni
Copy link
Contributor

@vdukhovni vdukhovni commented Oct 21, 2020

This version detects integer overflow and returns 'Nothing'.

With lazy bytestrings it will not read new chunks after an "unreasonable" number
(more than ~32kB) of leading zeros and will instead return 'Nothing'.

Tests check that maxBound and minBound are handled correctly possibly with
explicit signs (for maxBound), and that overflowing by 1, 10 and many
appended zeros all fail.

This version detects integer overflow and returns 'Nothing'.

With lazy bytestrings it will not new chunks after an "unreasonable" number
(more than ~32kB) of leading zeros and will instead return 'Nothing'.

Tests check that maxBound and minBound are handled correctly possibly with
explicit signs (for maxBound), and that overflowing by 1, 10 and many
appended zeros all fail.
@Bodigrim Bodigrim linked an issue Oct 21, 2020 that may be closed by this pull request
Data/ByteString/Char8.hs Outdated Show resolved Hide resolved
Data/ByteString/Char8.hs Outdated Show resolved Hide resolved
-- zero digits when trying to decode a number. When reading the first
-- non-zero digit would require requesting a new chunk and ~32KB of
-- leading zeros have already been read, the conversion is aborted and
-- 'Nothing' is returned.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This basically means that the result of readInt depends on the chunk structure of the string. This is undesirable from my perspective, because chunks are implicit and should be indistinguishible from user perspective.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But there's never any latent lazy IO in the middle of a chunk. The distinction is only important under extremely unusual circumstances, 32k or more of leading zeros read so far, and the current chunk (more than one might have been consumed) has ended. My take was that once the chunk is already in memory, most of the cost has been paid, and we may as well keep reading.

Granted, with streaming-bytestring the chunk structure and monadic effects are more explicit, and I could perhaps even tweak the policy to ignore chunk boundaries, and only stop when a new monadic effect is required to fetch the next chunk.

Here the chunk structure is our only signal that more data may need to be brought into memory. Also, and perhaps more importantly, checking the zero count limit at a smaller granularity can have a noticeable performance impact. Chunk boundaries give us an opportunity to make the check infrequently.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do not suggest to check zero count limit at a smaller granularity. I'd rather do not check it at all. Otherwise a) the behaviour of readInt depends on the internal structure of chunks, b) lazy readInt behaves differently from strict readInt.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, that's a tenable policy. Your take is that unlike the streaming-bytestring variant, here you'd prefer to just keep reading unconditionally. If someone is unlucky enough to read an Int from an unbounded stream (network socket) of zeros, then they should have taken steps to limit the amount of data that might be consumed...

So it boils down to whether protection against unbounded reads is part of the "safety" in this new safer readInt. I can live with removing the limit, though for users who do want to avoid spinning forever the work-around is a bit messy.

One would have to use something like readInt . take 32768, but then under normal conditions with the integer taking just a few bytes of the input one is left with not the rest of the bytestring, but the rest of the first 32k, which is not particularly convenient.

If the bound is wanted, doing it around readInt is messy and inefficient, so I think your suggestion amounts to not offering the (lazy I/O) bound. If that's acceptable (it is what you get with readInteger at the moment) then OK, if others agree that's what the API should do.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So long as "lazy" bytestrings are not actually of the lazy-io variety, but are rather data already in memory, arranged in chunks with perhaps some deferred pure computations in the tail, one might argue that it is OK to not bound the readInt input buffer.

But even with pure lazy ByteStrings pathology is still possible, and avoided by this PR:

λ> import qualified Data.ByteString.Lazy.Char8 as L8
λ> let x = L8.repeat '0'
λ> L8.readInt x
Nothing
λ> L8.readInteger x
Just [... time goes by ...]^CInterrupted.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, length (L8.repeat '0') diverges as well, but nobody sets an upper bound for it.

Yes, my take is that streaming-bytestring has more explicit notion of chunks and effects, does not have a strict counterpart and is not a core package, which allows to be more "opinionated" in design decisions. But in bytestring I would prefer lazy API be invariant modulo internal representation and mimic strict API as close as possible. @sjakobi @hsyl20 thoughts?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No feedback seems likely, unless more prodding will help... How shall we proceed?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the delay! I'll try to make up my mind until Sunday.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for arguing your points so clearly, @Bodigrim and @vdukhovni, and apologies for the late response! :)

I agree with @Bodigrim's points:

But in bytestring I would prefer lazy API be invariant modulo internal representation and mimic strict API as close as possible.

-- in n' `seq` c' `seq` JustS n' c'

{-# INLINABLE readInt #-}
readInt = start
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we possibly share more with strict readInt implementation? Maybe move some functions to internal modules or whatever.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is possible to do that, but I am not convinced it is worthwhile. I see more value in keeping the entire implementation in one place. The common bits are I think not large enough to worry about repetition.

The one thing I'm inclined to share is the maxBound and minBound-related constants. I can move those, if we agree on that. The accumulator functions I think are likely better optimised in place, and not worth exporting.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I moved the constants into Data.ByteString.Internal. If you feel really strongly about moving the digits loop, that can be done, but if it were only up to me, I would not.

@vdukhovni
Copy link
Contributor Author

vdukhovni commented Oct 29, 2020

I'll squash the fixups once we agree that all's well.

[ Or whoever decides to "merge" can do it. By the way, please, please avoid "merge" commits, rebase and fast-forward only. ]

@vdukhovni
Copy link
Contributor Author

Should I squash now, or wait for reviews to complete?

@Bodigrim
Copy link
Contributor

Bodigrim commented Nov 2, 2020

I'll squash it myself, when ready for merge.
Do you have a coverage report?

@vdukhovni
Copy link
Contributor Author

vdukhovni commented Nov 2, 2020

I'll squash it myself, when ready for merge.
Do you have a coverage report?

I've been trying to get a coverage report that's not empty, so far with no luck... All the produced HTML files are identical stubs with no data for any functions at all.

[ Edit: Scratch that, finally made progress
Other than not explicitly testing empty initial chunks in the lazy ByteString case, coverage looks to be essentially
100% for both readInt variants. ]

@Bodigrim Bodigrim added this to the 0.12.0.0 milestone Nov 2, 2020
@Bodigrim Bodigrim requested a review from sjakobi November 2, 2020 22:42
@vdukhovni
Copy link
Contributor Author

FWIW, to get non-empty coverage reports, I had to copy the "library" section from "bytestring.cabal" into "byteesting-test", removed all the ".." paths and just symlinked Data, cbits, include and LICENCE from the parent directory into tests, and then added "bytestring-test" as an explicit dependency of "prop-compiled".

With that, I finally got a non-empty coverage report. It looks like, coverage reports cover the "library" dependency modules of the test executables, so without a library dependency, all the reports are empty.

If we want to do this regularly, it'll take some more delicate surgery of the bytestring-test cabal file. If we had symlinks on Windows, it would be easier to make this portable. Without that it is a bit of a pain, we basically need a package with the library and tests that does not call itself bytestring to avoid all the dependency conflicts. So the library sources need to appear to be in both places. Perhaps with ".." paths I can work the library into the test suite as a permanent feature...

@Bodigrim
Copy link
Contributor

Bodigrim commented Nov 2, 2020

If we want to do this regularly, it'll take some more delicate surgery of the bytestring-test cabal file.

I hope that with #316 we'll be able to abandon bytestring-test package for good.

Data/ByteString/Char8.hs Outdated Show resolved Hide resolved
Data/ByteString/Lazy/Char8.hs Outdated Show resolved Hide resolved
Data/ByteString/Char8.hs Show resolved Hide resolved
Data/ByteString/Char8.hs Outdated Show resolved Hide resolved
@vdukhovni vdukhovni force-pushed the safe-readInt branch 2 times, most recently from 8edf683 to edf517f Compare November 3, 2020 20:52
Copy link
Member

@sjakobi sjakobi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM and thanks, @vdukhovni! :)

@Bodigrim
Copy link
Contributor

Bodigrim commented Nov 4, 2020

@vdukhovni https://travis-ci.com/github/haskell/bytestring/jobs/427223096#L298

Data/ByteString/Char8.hs:1047:36: Not in scope: ‘<$>’

@vdukhovni
Copy link
Contributor Author

@vdukhovni https://travis-ci.com/github/haskell/bytestring/jobs/427223096#L298

Data/ByteString/Char8.hs:1047:36: Not in scope: ‘<$>’

I pushed a fix for that. Perhaps you're looking at an outdated snapshot?

@Bodigrim
Copy link
Contributor

Bodigrim commented Nov 4, 2020

I re-run Travis build, it still fails.

@vdukhovni
Copy link
Contributor Author

I re-run Travis build, it still fails.

Sorry about that. I failed to actually commit the fixes. Should be good now.

@Bodigrim Bodigrim merged commit b274158 into haskell:master Nov 7, 2020
@Bodigrim
Copy link
Contributor

Bodigrim commented Nov 7, 2020

Arrrrgh, I'm terribly sorry, merged with rebase instead of squash. My bad.

Created a branch for 0.11 series: https://github.com/haskell/bytestring/tree/bytestring-0.11

@vdukhovni vdukhovni deleted the safe-readInt branch November 7, 2020 18:19
@vdukhovni
Copy link
Contributor Author

The commits did not get squashed... Oh well. It does not matter I guess...

@vdukhovni
Copy link
Contributor Author

Thanks for getting this merged.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

readInt does not check for overflow.
4 participants