readInt does not check for overflow. #144

SeanRBurton · 2017-12-09T19:25:27Z

readInt will happily consume as many digits as it can, without checking if the result is actually representable as an int. This seems like a straightforward correctness bug, and there are a few possible alternatives here:

Truncate the result if the next digit would cause an overflow. This does not seem very intuitive.
Return Nothing on overflow. This seems reasonable, but it is not very informative.
Call error. The most informative option, but this does not really feel like an exceptional condition, and this choice might break existing code.

We could also provide an alternative function which returns a structured result.

Also, I think readInt and readInteger are significantly more complicated than they need to be, and readInteger should probably work in int-sized chunks for performance.

The text was updated successfully, but these errors were encountered:

nh2 · 2018-11-28T20:39:37Z

I also ran into this, this is problematic.

My opinion is that we should keep the current behaviour, document it, and add a new function that returns Either.

nh2 · 2018-11-28T20:57:42Z

I'm using this as a temporary, probably slower workaround now:

-- | Parses a `ByteString` into an Integral.
--
-- Uses "Data.ByteString.Char8" so it cannot handle multi-byte encodings.
--
-- As opposed to `BS8.readIntegral`, it has overflow detection.
-- See https://github.com/haskell/bytestring/issues/144
readIntegral :: forall a . (Integral a, Bounded a) => ByteString -> Either String a
readIntegral str = case BS8.readInteger str of
  Nothing -> Left "unexpected non-integral"
  Just (n, rest)
    | not (BS8.null rest) -> Left "trailing text behind integral"
    | n > fromIntegral (maxBound :: a) -> Left "integral out of range (too large)"
    | n < fromIntegral (minBound :: a) -> Left "integral out of range (too small)"
    | otherwise -> Right (fromIntegral n)

I found this still faster than going via text's decimal decoder for my use case.

Bodigrim · 2020-09-26T11:20:28Z

I do not think it is a bug, sounds more like an expected behaviour. For example, Prelude.read does exactly the same:

> read "123456789012345678901234567890" :: Int
-4362896299872285998

One could use readInteger and convert it to Int using toIntegralSized, so the case for a new "safe" function does not look convincing to me.

@vdukhovni what do you think? Cf. haskell-streaming/streaming-bytestring#29

qrilka · 2020-09-26T15:02:40Z

I would argue that read is also broken

Bodigrim · 2020-09-26T15:25:47Z

Haskell Report requires Numeric.readInt to be polymorphic in its return type, so it cannot really detect an overflow.

vdukhovni · 2020-09-26T23:01:02Z

I do not think it is a bug, sounds more like an expected behaviour. For example, [...]
One could use readInteger and convert it to Int [...], so the case for a new "safe" function does not look convincing to me.

@vdukhovni what do you think? Cf. haskell-streaming/streaming-bytestring#29

As you correctly pointed out, readInt, like many other similar functions, silently overflows when the input contains unexpectedly long integral sequences. This is documented behaviour, so there's no bug as such.

While a safer readInt could indeed return Nothing when the computed value overflows, this would impose a noticeable performance penalty, and some subtle issues.

We have readInt implementations for both Data.ByteString.Char8 and Data.ByteString.Lazy.Char8
Lazy ByteStrings support lazy IO (a separate, recently somewhat contentious topic...)
Reading the value as an Integer and only testing for overflow after the last digit, would introduce a new memory exhaustion exposure to readers of lazy ByteStrings who are just trying to read an Int from a lazy-io stream.
One might be tempted to stop reading the stream early, at the first digit that causes the conversion to overflow, but this introduces considerable subtleties
- What is the appropriate return value on overflow?
- Is (Nothing, original input) really the right answer? (With lazy-io, and reading into an Integer first, we'd have to "rewind" the stream all the way to the start, which is a second major space leak.)
- Space leaks aside, should the return value allow one to distinguish between "not an integer" and "integer overflow"?
- Checking for overflow digit-by-digit (which avoids space leaks) would make the conversion much slower, so it would probably make sense to check only after each run of digits that can be converted with no overflow (either 9 or 18 depending on whether Int values are 32-bit or 64-bit) and with the overflow check done only when combining groups.
- One would need to be careful with minBound, which has no positive representation.

So doing this correctly, with consistent behaviour between the strict and lazy interfaces is non-trivial, especially if one also wants to retain most of the performance of the code path that does not check for overflow (in the reasonable expectation that in many cases the data is known to be free of problem inputs).

vdukhovni · 2020-09-27T02:48:03Z

@vdukhovni what do you think? Cf. haskell-streaming/streaming-bytestring#29

Returning to your question after handling an edge case in that PR that this helped me realise (thanks!), I should perhaps mention that the original bytestring-streaming readInt indeed never overflowed, but at the cost of failing to read some valid integers! I don't think that's at all better.

However, since it never did overflow, there's an opportunity here, to keep that behaviour and return Nothing on overflow, ideally without sacrificing too much performance. (The new PR29 performance is roughly 5x the original in my tests, I'd rather not give that up). So perhaps as a potential reviewer of PR29 in streaming-bytestring we can discuss this topic also there, where there's no prior expectation of overflow in results. That could also remove some of the considerations around unbounded streams of digits documented in that PR...

vdukhovni · 2020-09-30T07:22:24Z

I have opened haskell-streaming/streaming-bytestring#31 which implements overflow checks (for decimal Int).

vdukhovni · 2020-10-01T23:56:11Z

I do not think it is a bug, sounds more like an expected behaviour. For example, Prelude.read does exactly the same:
> read "123456789012345678901234567890" :: Int
-4362896299872285998
One could use readInteger and convert it to Int using toIntegralSized, so the case for a new "safe" function does not look convincing to me.

@vdukhovni what do you think? Cf. haskell-streaming/streaming-bytestring#29

Now that haskell-streaming/streaming-bytestring#31 is about to be merged, my take is that yes, Data.ByteString.Char8.readInt and Data.ByteString.Lazy.Char8.readInt could detect overflow at performance no worse than the original, provided the lazy version has protection to avoid reading unbounded runs of leading zeros. (basically a port of the streaming code, that should quite similar). The only difference between streaming and lazy, is that lazy is just the Chunk case of streaming and there's no residual value at the end, but Chunks can be implicitly monadic behind one's back (lazy io). So the port would mostly be a matter of deleting some code...

Bodigrim · 2020-10-03T16:51:38Z

I wonder whether replacing existing readInt with a safer version requires a major version bump.

vdukhovni · 2020-10-03T21:52:39Z

I wonder whether replacing existing readInt with a safer version requires a major version bump.

Is a Nothing answer any worse than a random integral value (well, congruent to the real value mod 2^n for n ∈ {32,64})?
The cautious approach might be to require a major version bump. It is not clear who might be affected.

My best guess is that the main users affected would be users running CI-tests on edge-cases, who artificially create unexpected inputs, and see how the code behaves.
The only other use-case that comes to mind would be "input validation" where after checking that the input is all decimal digits with some regexp, the code uses readInt and does not expect Nothing.

No idea whether either of these is an issue for anyone.

[ Has 0.11 been released yet? If not, perhaps do this now, for 0.11? ]

Bodigrim · 2020-10-03T22:07:33Z

I don't think any of the issues you mentioned are of practical concern. But, strictly speaking, this is a change of the documented behaviour (not UB and not a bug). @sjakobi @hsyl20 what do you think?

[0.11 has been already released. But I do not think that we must wait another decade for 0.12]

sjakobi · 2020-10-05T08:20:41Z

I wonder whether replacing existing readInt with a safer version requires a major version bump.

But, strictly speaking, this is a change of the documented behaviour (not UB and not a bug). @sjakobi @hsyl20 what do you think?

Yeah, such a change would require a major version bump.

Bodigrim · 2020-10-20T21:38:41Z

Following our discussion about branching in #241, I am +1 for porting haskell-streaming/streaming-bytestring#31 to bytestring.

vdukhovni · 2020-10-20T21:54:31Z

Following our discussion about branching in #241, I am +1 for porting haskell-streaming/streaming-bytestring#31 to bytestring.

Should I wait for more +1 indications? Also, I don't mind if someone else volunteers to do the port. Perhaps more eyes on the code will help to gain confidence in its correctness (assuming no bugs are found). But, if the port is up to me, I'll do it...

sjakobi · 2020-10-20T22:20:25Z

Following our discussion about branching in #241, I am +1 for porting haskell-streaming/streaming-bytestring#31 to bytestring.

👍 from me too.

vdukhovni · 2020-10-21T03:38:08Z

I have a question about the desired behaviour of readInt with lazy ByteStrings. In the streaming module, the function fails once ~32 kB of consecutive leading zeros (perhaps after a [+-] sign) have been read, and a new chunk needs to be accessed to perhaps finally find the start of the number.

Should the same sort of cut-off be implemented for lazy ByteString? The idea is the same, don't spin forever reading ridiculous quantities from a stream (that even be unbounded), but it is not as obviously applicable to lazy bytestrings where I/O is either absent entirely (just a "rope" in memory), or implicit (interleaved lazy I/O).

What should the port do?

[ Note: "fails", means returns (Nothing, original input) ]

vdukhovni · 2020-10-21T07:28:59Z

I've opened #309. Pending feedback it keeps the "reasonable" bound on the number of leading zeros. This bound can be removed if it is deemed not useful for lazy ByteStrings.

vdukhovni · 2020-10-21T08:22:08Z

One more thing to think about is whether there should be a variant for reading unsigned Word without overflow. Previously you could write out a word, and read it back with readInt, but now if the word is larger than maxBound :: Int the the conversion will fail, even though a conversion back from Int to Word would have been lossless...

Bodigrim · 2020-10-21T17:37:30Z

Should the same sort of cut-off be implemented for lazy ByteString?

I do not have a strong opinion here. On one side, this is an additional divergence from the current behaviour. On the other side, I struggle to imagine a legitimate use case for 32k-long sequence of zeros representing Int (and not Integer).

One more thing to think about is whether there should be a variant for reading unsigned Word without overflow.

I often work with Word, readWord routine would be useful for me.

vdukhovni · 2020-10-21T17:59:31Z

Should the same sort of cut-off be implemented for lazy ByteString?

I do not have a strong opinion here. On one side, this is an additional divergence from the current behaviour. On the other side, I struggle to imagine a legitimate use case for 32k-long sequence of zeros representing Int (and not Integer).

And even for Integer, why prefix 32k or more unnecessary leading 0's?

One more thing to think about is whether there should be a variant for reading unsigned Word without overflow.

I often work with Word, readWord routine would be useful for me.

That can be done, but my point is perhaps also that there may be existing users who are currently using readInt to read Word and then just fromIntegral on the result. Provided the input does not overflow Word, this works for them now, but it won't with the new readInt which enforces Int bounds. Is that OK? Such users would have to switch to the new readWord.

One might make the case, that the new function needs a new name: readSafeInt or similar, and the old one be kept around to avoid runtime breakage?

Bodigrim · 2020-10-21T18:26:26Z

And even for Integer, why prefix 32k or more unnecessary leading 0's?

Dunno, it could potentially be a fixed-width column of giant integers.

Is that OK? Such users would have to switch to the new readWord.

Writing Word, but reading it as Int (and then cast to Word) is basically unsafeCoerce and an abuse. Regardless of this discussion, users should stop doing so and instead cast Word to Int, write it, then readInt, then cast to Word back.

One might make the case, that the new function needs a new name: readSafeInt or similar, and the old one be kept around to avoid runtime breakage?

I'm very much against having functions with the same type signature, similar names, but only slightly different semantics. I think it is acceptable to make a function "safer" in a major release.

vdukhovni · 2020-10-21T18:55:41Z

I agree that making things safer is generally good, but I'm not a fan of API changes that lead to only runtime errors and not compile-time errors... :-(

vdukhovni · 2020-10-21T19:21:49Z

In a complex system how is one expected to know that all the call sites are dealt with, including call sites in some library one is using...

vdukhovni · 2020-10-22T12:50:31Z

I'm very much against having functions with the same type signature, similar names, but only slightly different semantics. I think it is acceptable to make a function "safer" in a major release.

One way forward is to rename the new function, and leave a deprecated alias in place. That way you get a compile-time warning when using the old name, but we don't end up with the two similar function problem:

-- new
-- | ... Document overflow checks, and perhaps mention a new safeReadWord...
safeReadInt = ...
-- old
{-# DEPRECATED readInt, changed to check integer bounds, use safeReadInt instead #-}
readInt = safeReadInt

Thoughts? Anyone else?

vdukhovni · 2020-10-23T08:15:24Z

What next steps would you like to see for this PR?

Bodigrim · 2020-10-24T00:27:45Z

Well, this is a balancing act.

From PVP point of view, we are free to make any breaking changes in a major release.
I do not see any legitimate use cases, where the current behaviour of readInt is desirable.
It is more likely that the current behaviour is a source of subtle bugs. If my software is accidentaly relying on it, I would actually be grateful to learn about it.
There remains an easy way to emulate current behaviour via fromInteger and readInteger.
New behaviour does not introduce any runtime errors! It returns a regular Nothing. It is reasonable to expect that a parsing application is ready to process Nothing, because it can naturally arise from many other situations.

That said, I find it acceptable to make readInt safer in the next major release, without introducing new functions.

One way forward is to rename the new function, and leave a deprecated alias in place.

What would happen when readInt is finally removed? This would require all clients to update their code, even if they have never depended on overflowing semantics of readInt.

What next steps would you like to see for this PR?

Sorry, got a tough week. I'll review it over the weekend.

nh2 · 2020-11-25T20:16:33Z

closed this in #309 18 days ago

Could somebody in the knows summarise the outocme of the rather long PR, or perhaps link me to a changelog that says what came out of this?

Thanks!

vdukhovni · 2020-11-25T20:39:49Z

the readInt functions in both strict and lazy bytestrings return Nothing when the value would otherwise overflow maxBound or underflow minBound.

The only downside is that Word values larger than maxBound @Int can no longer be read with readInt (and then converted with fromIntegral), you'll need readInteger for those, unless/until we introduce readWord.

nh2 · 2020-11-25T20:52:17Z

@vdukhovni That sounds great and is exactly what I'd expect from that function name, thank you!

nh2 added a commit to nh2/bytestring that referenced this issue Nov 28, 2018

readInt: Document overflowing behaviour. See haskell#144

70648c8

nh2 mentioned this issue Nov 28, 2018

readInt: Document overflowing behaviour. #169

Merged

nh2 added a commit to nh2/bytestring that referenced this issue Feb 23, 2019

readInt: Document overflowing behaviour. See haskell#144

58c734e

hvr pushed a commit that referenced this issue Feb 23, 2019

readInt: Document overflowing behaviour. See #144

4ed96fb

sjakobi added the bug label Jun 25, 2020

Bodigrim mentioned this issue Oct 5, 2020

Release planning #241

Closed

Bodigrim linked a pull request Oct 21, 2020 that will close this issue

Backport of safe readInt from streaming-bytestring #309

Merged

Bodigrim added this to the 0.12.0.0 milestone Nov 7, 2020

Bodigrim closed this as completed in #309 Nov 7, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

readInt does not check for overflow. #144

readInt does not check for overflow. #144

SeanRBurton commented Dec 9, 2017

nh2 commented Nov 28, 2018

nh2 commented Nov 28, 2018 •

edited

Loading

Bodigrim commented Sep 26, 2020

qrilka commented Sep 26, 2020

Bodigrim commented Sep 26, 2020

vdukhovni commented Sep 26, 2020

vdukhovni commented Sep 27, 2020

vdukhovni commented Sep 30, 2020

vdukhovni commented Oct 1, 2020

Bodigrim commented Oct 3, 2020

vdukhovni commented Oct 3, 2020 •

edited

Loading

Bodigrim commented Oct 3, 2020

sjakobi commented Oct 5, 2020

Bodigrim commented Oct 20, 2020

vdukhovni commented Oct 20, 2020 •

edited

Loading

sjakobi commented Oct 20, 2020

vdukhovni commented Oct 21, 2020 •

edited

Loading

vdukhovni commented Oct 21, 2020

vdukhovni commented Oct 21, 2020

Bodigrim commented Oct 21, 2020

vdukhovni commented Oct 21, 2020

Bodigrim commented Oct 21, 2020

vdukhovni commented Oct 21, 2020

vdukhovni commented Oct 21, 2020

vdukhovni commented Oct 22, 2020 •

edited

Loading

vdukhovni commented Oct 23, 2020

Bodigrim commented Oct 24, 2020

nh2 commented Nov 25, 2020

vdukhovni commented Nov 25, 2020

nh2 commented Nov 25, 2020

readInt does not check for overflow. #144

readInt does not check for overflow. #144

Comments

SeanRBurton commented Dec 9, 2017

nh2 commented Nov 28, 2018

nh2 commented Nov 28, 2018 • edited Loading

Bodigrim commented Sep 26, 2020

qrilka commented Sep 26, 2020

Bodigrim commented Sep 26, 2020

vdukhovni commented Sep 26, 2020

vdukhovni commented Sep 27, 2020

vdukhovni commented Sep 30, 2020

vdukhovni commented Oct 1, 2020

Bodigrim commented Oct 3, 2020

vdukhovni commented Oct 3, 2020 • edited Loading

Bodigrim commented Oct 3, 2020

sjakobi commented Oct 5, 2020

Bodigrim commented Oct 20, 2020

vdukhovni commented Oct 20, 2020 • edited Loading

sjakobi commented Oct 20, 2020

vdukhovni commented Oct 21, 2020 • edited Loading

vdukhovni commented Oct 21, 2020

vdukhovni commented Oct 21, 2020

Bodigrim commented Oct 21, 2020

vdukhovni commented Oct 21, 2020

Bodigrim commented Oct 21, 2020

vdukhovni commented Oct 21, 2020

vdukhovni commented Oct 21, 2020

vdukhovni commented Oct 22, 2020 • edited Loading

vdukhovni commented Oct 23, 2020

Bodigrim commented Oct 24, 2020

nh2 commented Nov 25, 2020

vdukhovni commented Nov 25, 2020

nh2 commented Nov 25, 2020

nh2 commented Nov 28, 2018 •

edited

Loading

vdukhovni commented Oct 3, 2020 •

edited

Loading

vdukhovni commented Oct 20, 2020 •

edited

Loading

vdukhovni commented Oct 21, 2020 •

edited

Loading

vdukhovni commented Oct 22, 2020 •

edited

Loading