Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unexpected input-sharing behaviour of parsers #492

Closed
EggBaconAndSpam opened this issue Oct 13, 2022 · 3 comments · Fixed by #495
Closed

Unexpected input-sharing behaviour of parsers #492

EggBaconAndSpam opened this issue Oct 13, 2022 · 3 comments · Fixed by #495
Milestone

Comments

@EggBaconAndSpam
Copy link
Contributor

I finally managed to hunt down a memory leak which arose from the following arguably surprising behaviour:

Running either of these parsers and holding on to their results will keep the entire input in memory:

import Data.Text (Text)
import qualified Data.Text.Lazy as TL

testSharingStrict :: Parsec Void Text Text
testSharingStrict = takeP Nothing 1

testSharingLazy :: Parsec Void TL.Text TL.Text
testSharingLazy =
  fmap TL.concat . many . try $ takeP Nothing 1 <* takeP Nothing (TL.defaultChunkSize - 1)

This is due to the sharing semantics of both strict and lazy text. (I believe the same applies to strict and lazy bytestrings.)

Having been bitten by this I'm struggling to come up with a scenario where we actually gain anything from sharing memory with the input stream. In the strict case that always keeps the entire input in memory, while in the lazy case we at least keep a whole chunk (of 16K) for each 'output' (which may be much smaller than that).

It seems to me we could get rid of the input-sharing behaviour be adding a few copy :: Text -> Text in the right places:

instance Stream T.Text where
  ...
  takeN_ n s
    | n <= 0 = Just (T.empty, s)
    | T.null s = Nothing
    | otherwise = Just (first T.copy $ T.splitAt n s)
  takeWhile_ = first T.copy $ T.span

Why is that a bad idea, and/or what am I missing?

@EggBaconAndSpam
Copy link
Contributor Author

EggBaconAndSpam commented Oct 17, 2022

One way to get rid of this foot-gun:

  • add copy to the right places of instance Stream Text etc. to not share default
  • add a Shared newtype wrapper s.t. instance Stream (Shared Text) retains the current sharing behaviour, as an optional optimisation (with disclaimer)

What do you think?

@mrkkrp
Copy link
Owner

mrkkrp commented Oct 21, 2022

This is a valid concern, thanks for bringing this up. I think I originally wanted combinators like takeN_ to be as fast as possible and for that reason I used splitAt. I was much more preoccupied by raw speed than by the potential danger of sharing long input streams. I'd be curious to know how adding copy affects the benchmarks. If it is not too bad, which it could well be, we might do better by just adding copy in the right places and avoid sharing.

Would you have time to add copy and run the benchmarks that are included with the library (before/after the change)?

@mrkkrp mrkkrp added this to the 9.3.0 milestone Oct 21, 2022
@EggBaconAndSpam
Copy link
Contributor Author

Thanks for having a look at this!

I'll get you a PR / more data by the end of this week (this is mainly a deadline for myself to not postpone this indefinitely).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants