Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

readInt does not check for overflow. #144

Closed
SeanRBurton opened this issue Dec 9, 2017 · 30 comments · Fixed by #309
Closed

readInt does not check for overflow. #144

SeanRBurton opened this issue Dec 9, 2017 · 30 comments · Fixed by #309
Labels
Milestone

Comments

@SeanRBurton
Copy link
Contributor

readInt will happily consume as many digits as it can, without checking if the result is actually representable as an int. This seems like a straightforward correctness bug, and there are a few possible alternatives here:

  • Truncate the result if the next digit would cause an overflow. This does not seem very intuitive.
  • Return Nothing on overflow. This seems reasonable, but it is not very informative.
  • Call error. The most informative option, but this does not really feel like an exceptional condition, and this choice might break existing code.

We could also provide an alternative function which returns a structured result.

Also, I think readInt and readInteger are significantly more complicated than they need to be, and readInteger should probably work in int-sized chunks for performance.

@nh2
Copy link
Member

nh2 commented Nov 28, 2018

I also ran into this, this is problematic.

My opinion is that we should keep the current behaviour, document it, and add a new function that returns Either.

@nh2
Copy link
Member

nh2 commented Nov 28, 2018

I'm using this as a temporary, probably slower workaround now:

-- | Parses a `ByteString` into an Integral.
--
-- Uses "Data.ByteString.Char8" so it cannot handle multi-byte encodings.
--
-- As opposed to `BS8.readIntegral`, it has overflow detection.
-- See https://github.com/haskell/bytestring/issues/144
readIntegral :: forall a . (Integral a, Bounded a) => ByteString -> Either String a
readIntegral str = case BS8.readInteger str of
  Nothing -> Left "unexpected non-integral"
  Just (n, rest)
    | not (BS8.null rest) -> Left "trailing text behind integral"
    | n > fromIntegral (maxBound :: a) -> Left "integral out of range (too large)"
    | n < fromIntegral (minBound :: a) -> Left "integral out of range (too small)"
    | otherwise -> Right (fromIntegral n)

I found this still faster than going via text's decimal decoder for my use case.

nh2 added a commit to nh2/bytestring that referenced this issue Feb 23, 2019
@sjakobi sjakobi added the bug label Jun 25, 2020
@Bodigrim
Copy link
Contributor

I do not think it is a bug, sounds more like an expected behaviour. For example, Prelude.read does exactly the same:

> read "123456789012345678901234567890" :: Int
-4362896299872285998

One could use readInteger and convert it to Int using toIntegralSized, so the case for a new "safe" function does not look convincing to me.

@vdukhovni what do you think? Cf. haskell-streaming/streaming-bytestring#29

@qrilka
Copy link
Contributor

qrilka commented Sep 26, 2020

I would argue that read is also broken

@Bodigrim
Copy link
Contributor

Haskell Report requires Numeric.readInt to be polymorphic in its return type, so it cannot really detect an overflow.

@vdukhovni
Copy link
Contributor

I do not think it is a bug, sounds more like an expected behaviour. For example, [...]
One could use readInteger and convert it to Int [...], so the case for a new "safe" function does not look convincing to me.

@vdukhovni what do you think? Cf. haskell-streaming/streaming-bytestring#29

As you correctly pointed out, readInt, like many other similar functions, silently overflows when the input contains unexpectedly long integral sequences. This is documented behaviour, so there's no bug as such.

While a safer readInt could indeed return Nothing when the computed value overflows, this would impose a noticeable performance penalty, and some subtle issues.

  1. We have readInt implementations for both Data.ByteString.Char8 and Data.ByteString.Lazy.Char8
  2. Lazy ByteStrings support lazy IO (a separate, recently somewhat contentious topic...)
  3. Reading the value as an Integer and only testing for overflow after the last digit, would introduce a new memory exhaustion exposure to readers of lazy ByteStrings who are just trying to read an Int from a lazy-io stream.
  4. One might be tempted to stop reading the stream early, at the first digit that causes the conversion to overflow, but this introduces considerable subtleties
    • What is the appropriate return value on overflow?
    • Is (Nothing, original input) really the right answer? (With lazy-io, and reading into an Integer first, we'd have to "rewind" the stream all the way to the start, which is a second major space leak.)
    • Space leaks aside, should the return value allow one to distinguish between "not an integer" and "integer overflow"?
    • Checking for overflow digit-by-digit (which avoids space leaks) would make the conversion much slower, so it would probably make sense to check only after each run of digits that can be converted with no overflow (either 9 or 18 depending on whether Int values are 32-bit or 64-bit) and with the overflow check done only when combining groups.
    • One would need to be careful with minBound, which has no positive representation.

So doing this correctly, with consistent behaviour between the strict and lazy interfaces is non-trivial, especially if one also wants to retain most of the performance of the code path that does not check for overflow (in the reasonable expectation that in many cases the data is known to be free of problem inputs).

@vdukhovni
Copy link
Contributor

@vdukhovni what do you think? Cf. haskell-streaming/streaming-bytestring#29

Returning to your question after handling an edge case in that PR that this helped me realise (thanks!), I should perhaps mention that the original bytestring-streaming readInt indeed never overflowed, but at the cost of failing to read some valid integers! I don't think that's at all better.

However, since it never did overflow, there's an opportunity here, to keep that behaviour and return Nothing on overflow, ideally without sacrificing too much performance. (The new PR29 performance is roughly 5x the original in my tests, I'd rather not give that up). So perhaps as a potential reviewer of PR29 in streaming-bytestring we can discuss this topic also there, where there's no prior expectation of overflow in results. That could also remove some of the considerations around unbounded streams of digits documented in that PR...

@vdukhovni
Copy link
Contributor

I have opened haskell-streaming/streaming-bytestring#31 which implements overflow checks (for decimal Int).

@vdukhovni
Copy link
Contributor

I do not think it is a bug, sounds more like an expected behaviour. For example, Prelude.read does exactly the same:

> read "123456789012345678901234567890" :: Int
-4362896299872285998

One could use readInteger and convert it to Int using toIntegralSized, so the case for a new "safe" function does not look convincing to me.

@vdukhovni what do you think? Cf. haskell-streaming/streaming-bytestring#29

Now that haskell-streaming/streaming-bytestring#31 is about to be merged, my take is that yes, Data.ByteString.Char8.readInt and Data.ByteString.Lazy.Char8.readInt could detect overflow at performance no worse than the original, provided the lazy version has protection to avoid reading unbounded runs of leading zeros. (basically a port of the streaming code, that should quite similar). The only difference between streaming and lazy, is that lazy is just the Chunk case of streaming and there's no residual value at the end, but Chunks can be implicitly monadic behind one's back (lazy io). So the port would mostly be a matter of deleting some code...

@Bodigrim
Copy link
Contributor

Bodigrim commented Oct 3, 2020

I wonder whether replacing existing readInt with a safer version requires a major version bump.

@vdukhovni
Copy link
Contributor

vdukhovni commented Oct 3, 2020

I wonder whether replacing existing readInt with a safer version requires a major version bump.

Is a Nothing answer any worse than a random integral value (well, congruent to the real value mod 2^n for n ∈ {32,64})?
The cautious approach might be to require a major version bump. It is not clear who might be affected.

My best guess is that the main users affected would be users running CI-tests on edge-cases, who artificially create unexpected inputs, and see how the code behaves.
The only other use-case that comes to mind would be "input validation" where after checking that the input is all decimal digits with some regexp, the code uses readInt and does not expect Nothing.

No idea whether either of these is an issue for anyone.

[ Has 0.11 been released yet? If not, perhaps do this now, for 0.11? ]

@Bodigrim
Copy link
Contributor

Bodigrim commented Oct 3, 2020

I don't think any of the issues you mentioned are of practical concern. But, strictly speaking, this is a change of the documented behaviour (not UB and not a bug). @sjakobi @hsyl20 what do you think?

[0.11 has been already released. But I do not think that we must wait another decade for 0.12]

@sjakobi
Copy link
Member

sjakobi commented Oct 5, 2020

I wonder whether replacing existing readInt with a safer version requires a major version bump.

But, strictly speaking, this is a change of the documented behaviour (not UB and not a bug). @sjakobi @hsyl20 what do you think?

Yeah, such a change would require a major version bump.

@Bodigrim
Copy link
Contributor

Following our discussion about branching in #241, I am +1 for porting haskell-streaming/streaming-bytestring#31 to bytestring.

@vdukhovni
Copy link
Contributor

vdukhovni commented Oct 20, 2020

Following our discussion about branching in #241, I am +1 for porting haskell-streaming/streaming-bytestring#31 to bytestring.

Should I wait for more +1 indications? Also, I don't mind if someone else volunteers to do the port. Perhaps more eyes on the code will help to gain confidence in its correctness (assuming no bugs are found). But, if the port is up to me, I'll do it...

@sjakobi
Copy link
Member

sjakobi commented Oct 20, 2020

Following our discussion about branching in #241, I am +1 for porting haskell-streaming/streaming-bytestring#31 to bytestring.

👍 from me too.

@vdukhovni
Copy link
Contributor

vdukhovni commented Oct 21, 2020

I have a question about the desired behaviour of readInt with lazy ByteStrings. In the streaming module, the function fails once ~32 kB of consecutive leading zeros (perhaps after a [+-] sign) have been read, and a new chunk needs to be accessed to perhaps finally find the start of the number.

Should the same sort of cut-off be implemented for lazy ByteString? The idea is the same, don't spin forever reading ridiculous quantities from a stream (that even be unbounded), but it is not as obviously applicable to lazy bytestrings where I/O is either absent entirely (just a "rope" in memory), or implicit (interleaved lazy I/O).

What should the port do?

[ Note: "fails", means returns (Nothing, original input) ]

@vdukhovni
Copy link
Contributor

I've opened #309. Pending feedback it keeps the "reasonable" bound on the number of leading zeros. This bound can be removed if it is deemed not useful for lazy ByteStrings.

@vdukhovni
Copy link
Contributor

One more thing to think about is whether there should be a variant for reading unsigned Word without overflow. Previously you could write out a word, and read it back with readInt, but now if the word is larger than maxBound :: Int the the conversion will fail, even though a conversion back from Int to Word would have been lossless...

@Bodigrim Bodigrim linked a pull request Oct 21, 2020 that will close this issue
@Bodigrim
Copy link
Contributor

Should the same sort of cut-off be implemented for lazy ByteString?

I do not have a strong opinion here. On one side, this is an additional divergence from the current behaviour. On the other side, I struggle to imagine a legitimate use case for 32k-long sequence of zeros representing Int (and not Integer).

One more thing to think about is whether there should be a variant for reading unsigned Word without overflow.

I often work with Word, readWord routine would be useful for me.

@vdukhovni
Copy link
Contributor

Should the same sort of cut-off be implemented for lazy ByteString?

I do not have a strong opinion here. On one side, this is an additional divergence from the current behaviour. On the other side, I struggle to imagine a legitimate use case for 32k-long sequence of zeros representing Int (and not Integer).

And even for Integer, why prefix 32k or more unnecessary leading 0's?

One more thing to think about is whether there should be a variant for reading unsigned Word without overflow.

I often work with Word, readWord routine would be useful for me.

That can be done, but my point is perhaps also that there may be existing users who are currently using readInt to read Word and then just fromIntegral on the result. Provided the input does not overflow Word, this works for them now, but it won't with the new readInt which enforces Int bounds. Is that OK? Such users would have to switch to the new readWord.

One might make the case, that the new function needs a new name: readSafeInt or similar, and the old one be kept around to avoid runtime breakage?

@Bodigrim
Copy link
Contributor

And even for Integer, why prefix 32k or more unnecessary leading 0's?

Dunno, it could potentially be a fixed-width column of giant integers.

Is that OK? Such users would have to switch to the new readWord.

Writing Word, but reading it as Int (and then cast to Word) is basically unsafeCoerce and an abuse. Regardless of this discussion, users should stop doing so and instead cast Word to Int, write it, then readInt, then cast to Word back.

One might make the case, that the new function needs a new name: readSafeInt or similar, and the old one be kept around to avoid runtime breakage?

I'm very much against having functions with the same type signature, similar names, but only slightly different semantics. I think it is acceptable to make a function "safer" in a major release.

@vdukhovni
Copy link
Contributor

I agree that making things safer is generally good, but I'm not a fan of API changes that lead to only runtime errors and not compile-time errors... :-(

@vdukhovni
Copy link
Contributor

In a complex system how is one expected to know that all the call sites are dealt with, including call sites in some library one is using...

@vdukhovni
Copy link
Contributor

vdukhovni commented Oct 22, 2020

I'm very much against having functions with the same type signature, similar names, but only slightly different semantics. I think it is acceptable to make a function "safer" in a major release.

One way forward is to rename the new function, and leave a deprecated alias in place. That way you get a compile-time warning when using the old name, but we don't end up with the two similar function problem:

-- new
-- | ... Document overflow checks, and perhaps mention a new safeReadWord...
safeReadInt = ...
-- old
{-# DEPRECATED readInt, changed to check integer bounds, use safeReadInt instead #-}
readInt = safeReadInt

Thoughts? Anyone else?

@vdukhovni
Copy link
Contributor

What next steps would you like to see for this PR?

@Bodigrim
Copy link
Contributor

Well, this is a balancing act.

  • From PVP point of view, we are free to make any breaking changes in a major release.
  • I do not see any legitimate use cases, where the current behaviour of readInt is desirable.
  • It is more likely that the current behaviour is a source of subtle bugs. If my software is accidentaly relying on it, I would actually be grateful to learn about it.
  • There remains an easy way to emulate current behaviour via fromInteger and readInteger.
  • New behaviour does not introduce any runtime errors! It returns a regular Nothing. It is reasonable to expect that a parsing application is ready to process Nothing, because it can naturally arise from many other situations.

That said, I find it acceptable to make readInt safer in the next major release, without introducing new functions.

One way forward is to rename the new function, and leave a deprecated alias in place.

What would happen when readInt is finally removed? This would require all clients to update their code, even if they have never depended on overflowing semantics of readInt.

What next steps would you like to see for this PR?

Sorry, got a tough week. I'll review it over the weekend.

@Bodigrim Bodigrim added this to the 0.12.0.0 milestone Nov 7, 2020
@nh2
Copy link
Member

nh2 commented Nov 25, 2020

closed this in #309 18 days ago

Could somebody in the knows summarise the outocme of the rather long PR, or perhaps link me to a changelog that says what came out of this?

Thanks!

@vdukhovni
Copy link
Contributor

the readInt functions in both strict and lazy bytestrings return Nothing when the value would otherwise overflow maxBound or underflow minBound.

The only downside is that Word values larger than maxBound @Int can no longer be read with readInt (and then converted with fromIntegral), you'll need readInteger for those, unless/until we introduce readWord.

@nh2
Copy link
Member

nh2 commented Nov 25, 2020

@vdukhovni That sounds great and is exactly what I'd expect from that function name, thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants