Unexpected input-sharing behaviour of parsers #492

EggBaconAndSpam · 2022-10-13T12:50:45Z

I finally managed to hunt down a memory leak which arose from the following arguably surprising behaviour:

Running either of these parsers and holding on to their results will keep the entire input in memory:

import Data.Text (Text)
import qualified Data.Text.Lazy as TL

testSharingStrict :: Parsec Void Text Text
testSharingStrict = takeP Nothing 1

testSharingLazy :: Parsec Void TL.Text TL.Text
testSharingLazy =
  fmap TL.concat . many . try $ takeP Nothing 1 <* takeP Nothing (TL.defaultChunkSize - 1)

This is due to the sharing semantics of both strict and lazy text. (I believe the same applies to strict and lazy bytestrings.)

Having been bitten by this I'm struggling to come up with a scenario where we actually gain anything from sharing memory with the input stream. In the strict case that always keeps the entire input in memory, while in the lazy case we at least keep a whole chunk (of 16K) for each 'output' (which may be much smaller than that).

It seems to me we could get rid of the input-sharing behaviour be adding a few copy :: Text -> Text in the right places:

instance Stream T.Text where
  ...
  takeN_ n s
    | n <= 0 = Just (T.empty, s)
    | T.null s = Nothing
    | otherwise = Just (first T.copy $ T.splitAt n s)
  takeWhile_ = first T.copy $ T.span

Why is that a bad idea, and/or what am I missing?

The text was updated successfully, but these errors were encountered:

EggBaconAndSpam · 2022-10-17T07:49:23Z

One way to get rid of this foot-gun:

add copy to the right places of instance Stream Text etc. to not share default
add a Shared newtype wrapper s.t. instance Stream (Shared Text) retains the current sharing behaviour, as an optional optimisation (with disclaimer)

What do you think?

mrkkrp · 2022-10-21T18:39:32Z

This is a valid concern, thanks for bringing this up. I think I originally wanted combinators like takeN_ to be as fast as possible and for that reason I used splitAt. I was much more preoccupied by raw speed than by the potential danger of sharing long input streams. I'd be curious to know how adding copy affects the benchmarks. If it is not too bad, which it could well be, we might do better by just adding copy in the right places and avoid sharing.

Would you have time to add copy and run the benchmarks that are included with the library (before/after the change)?

EggBaconAndSpam · 2022-10-26T15:21:36Z

Thanks for having a look at this!

I'll get you a PR / more data by the end of this week (this is mainly a deadline for myself to not postpone this indefinitely).

mrkkrp added the performance label Oct 21, 2022

mrkkrp added this to the 9.3.0 milestone Oct 21, 2022

EggBaconAndSpam mentioned this issue Oct 28, 2022

Explicitly copy token sequences to avoid sharing input #495

Merged

mrkkrp closed this as completed in #495 Nov 27, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unexpected input-sharing behaviour of parsers #492

Unexpected input-sharing behaviour of parsers #492

EggBaconAndSpam commented Oct 13, 2022

EggBaconAndSpam commented Oct 17, 2022 •

edited

mrkkrp commented Oct 21, 2022

EggBaconAndSpam commented Oct 26, 2022

Unexpected input-sharing behaviour of parsers #492

Unexpected input-sharing behaviour of parsers #492

Comments

EggBaconAndSpam commented Oct 13, 2022

EggBaconAndSpam commented Oct 17, 2022 • edited

mrkkrp commented Oct 21, 2022

EggBaconAndSpam commented Oct 26, 2022

EggBaconAndSpam commented Oct 17, 2022 •

edited