Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fast scan primitive, like attoparsec's #280

Closed
wants to merge 3 commits into from

Conversation

typedrat
Copy link

@typedrat typedrat commented Mar 7, 2018

Attoparsec has the scan :: s -> (s -> Word8 -> Maybe s) -> Parser ByteString function to efficiently do a stateful scan through the input text. I implemented an equivalent scanP parser for Megaparsec, and also brought in some dirty tricks to help it be faster than just writing the parser in a more contorted way.

megaparsec-scan-bench is the benchmark I made to show how much of a benefit it provides. It's a standard criterion benchmark, build and run the executable to generate your own report or you can look at the results from my desktop here.

The executive summary is that it's effectively even with string when written in terms of the existing interface, but has consistent 2x to 10x speedups for both lazy and strict Text and ByteStrings, even beating attoparsec in some cases.

I have done enough that it's suiting my needs, but I'd be happy to develop it further/improve it if that means getting it upstream.

@mrkkrp
Copy link
Owner

mrkkrp commented Mar 7, 2018

This PR should be against master, not against the megaparsec-6 branch, because if we decide to add this, it'll be for version 7 (the change is breaking).


The reason I omitted this when I worked on similar stuff for version 6 is that Stream has too many methods already, so just like then, I'm not sure that going into this territory is right for Megaparsec. Yes we want to be faster, but at the same time we don't want to be too ad-hoc. Another reason is that parse errors will probably be worse with this (OK for Attoparsec, not OK for Megaparsec). That said, I haven't looked at your change yet. (It's best to open issues and discuss such things first, but if you needed this yourself did all the work anyway, then it's OK.)

@typedrat
Copy link
Author

typedrat commented Mar 7, 2018

@mrkkrp, the reason it's against megaparsec-6 (I'd fix it if there was actual interest in merging this) is the same reason I did this without asking first: I needed this functionality yesterday for something I was working on, and then decided that it'd be worth at least checking if you were interested. Plus, there's a default implementation for scan_, as well.

I can't think of any inherent reason the error reporting should be worse with it than takeWhileP, since the only real difference is that it's plumbing around the state value internally.

With some rekajiggering, I think there's a way to get the improvements available without bloating the class further, if that is a big concern... Although it is a somewhat major change, there are only a few things that'd need to be either just exported in general or put in some Internal module that would allow something like this to be implemented with some sort of StreamExtras and MonadParsecExtras classes outside of the core library. Don't know if that's a great idea, though.

@mrkkrp
Copy link
Owner

mrkkrp commented Mar 7, 2018

since the only real difference is that it's plumbing around the state value internally

But the state influences the logic, right? So we now have something that is "hidden", something that is not reflected in error messages properly.

Internal module exposing the ParsecT data constructor will be available in version 7.

@typedrat
Copy link
Author

typedrat commented Mar 7, 2018

scanP as presented cannot produce errors any more than takeWhileP. If the first Token we meet fails the test with the initial state, then it just provides an empty result, just like takeWhileP.

@mrkkrp
Copy link
Owner

mrkkrp commented Mar 8, 2018

OK, scanP is similar to takeWhileP in that it fetches a prefix of input stream. However where takeWhileP does that using a predicate, i.e. it accepts a certain group of tokens, those that satisfy that predicate, scanP can have arbitrary behavior that won't be properly reflected in the error messages.

An example should help. Imagine you have something like this in scanP:

If after a digit comes a digit which is greater than the previous digit by one, we finish. So "abc12ccc" results in "abc12" being fetched from the input stream.

Now how to meaningfully express this with Maybe String? If we used "normal" combinators we would end up with something good e.g. first it grabs letters and digits and after the second digit we don't expect anything at all, because we're done, it's different from takeWhileP where we could always continue consuming more tokens that satisfy our predicate, subtle details! With scanP, this complex behavior will be represented as Just "characters with fancy thing at the end" at best, i.e. error reporting will be subpar.

In short, behavior of takeWhileP is easy to integrate with other stuff without losing quality of parse errors (because it's just a block of uniform tokens, tokens that satisfy a predicate, and there is no additional conditions on length of such block or something), while with scanP when it's a bit more complex (and it will be more complex, otherwise there is no point in using it) we get something rough compared to error messages we'd get if we composed our parser from smaller combinators.

That said, I don't say that scanP is not a useful addition. After all, users are free to choose when they want better error reporting vs speed. I even have a use for scanP myself and I know it would make my code faster. So I'll think about this, in the meantime you could rebase against master.

Thanks for your contribution!

@typedrat
Copy link
Author

typedrat commented Mar 8, 2018

Will rebase in the morning.

I have a thought on improving the error reporting, but i’d need to fool around with it for a bit to know how/if it’d work. It wouldn’t be bad to have the surface parser predicate return Either instead of Maybe, with Right as Just but using the Left branch to provide some sort of error handling info.

@mrkkrp
Copy link
Owner

mrkkrp commented Mar 8, 2018

My guess is that we won't be able to get the same quality of parse errors in scanP anyway, so I'd actually drop the Maybe String argument too (labeling could be enough?). Other thought is that in those situations where you want speed so badly, the new internal module can be used directly to create very fast custom helpers, that could make scanP not so necessary.

@typedrat typedrat changed the base branch from megaparsec-6 to master March 9, 2018 01:10
@typedrat
Copy link
Author

typedrat commented Mar 9, 2018

I definitely see what you're saying with the custom helpers, although I'd say that this is a pretty common operation, at least to the point where it might be worth including even if there is the capability to do it yourself.

If nothing else, it'd allow this to be shipped but in a different module instead of adding more bulk to the main classes/files.

In addition, I think that it might be worth considering keeping the concept and speed without copying Attoparsec exactly. Perhaps something like:

data ScanResult st = Continue st | Done | Error String
scanP :: MonadParsec e s m => st -> (st -> Token s -> ScanResult st) -> m (Tokens s)

That'd allow richer semantics and better error handling, but is a bit complicated.

data PState = Letters | Num Char

scanImpl :: Parser T.Text
scanImpl = scanP Letters pred
    where
        difference a b = (fromEnum a) - (fromEnum b)

        pred Letters tok
            | isAlpha tok = Continue Letters
            | isNumber tok = Continue (Num tok)
            -- would likely produce a TrivialError with correct SourcePos, tok, and an
            -- expected label from the String given
            | otherwise = Error "alphanumeric character" 
        pred (Num last) tok
            | isNumber tok && difference tok last == 1 = Done
            | isNumber tok = Continue (Num tok)
            | otherwise = Error "digit"

@typedrat
Copy link
Author

typedrat commented Mar 9, 2018

I've done a test implementation of it (ec7ae85) but it needs some cleanup before I'd add it to this pull request.

It comes out to something like this for the actual parser, and I have the full code and the (identical) output in the repo.

data PState = Letters | Num Char

scanImpl :: Parser T.Text
scanImpl = scanP (Just "characters with fancy thing at the end") Letters pred
    where
        difference a b = (fromEnum a) - (fromEnum b)
        pred Letters (Just tok)
            | isAlpha tok = Continue Letters
            | isNumber tok = Continue (Num tok)
            | otherwise = Expected tok "alphanumeric character"
        pred Letters Nothing = OutOfInput "alphanumeric character"
        pred (Num last) (Just tok)
            | isNumber tok && difference tok last == 1 = Done
            | isNumber tok = Continue (Num tok)
            | otherwise = Expected tok "digit"
        pred (Num _) Nothing = OutOfInput "digit"

@cocreature
Copy link

I almost ended up making pretty much the same PR, however in the end I realized that scan doesn’t quite fit my usecase pattern (I need to drop trailing characters in some cases which I can do more efficiently if I implement it directly). So I suppose that’s one data point for implementing custom helpers directly rather than adding more primitives. It is certainly a data point for exposing internal modules so I’m very much looking forward to that (if there’s anything specific that one can help you with to accelerate that, say so!). That said while scan didn’t quite fit my usecase, I think it is still sufficiently general to be useful and would be a nice to have.

@mrkkrp
Copy link
Owner

mrkkrp commented Mar 26, 2018

Exposing internals is not hard per se, but there is also other stuff that needs to be done before we can release Megaparsec 7 (unfortunately Megaparsec 7 roadmap only exists as a card in my private Trello board). I think closer to summer Megaparsec 7 will be mostly ready.

@cocreature
Copy link

Is there any chance of getting a 6.4.2 release that just exposes internals? Sorry for being impatient, summer just seems pretty far away right now :) Either way, keep up the good work! Looking forward to what’s to come 👍

@mrkkrp
Copy link
Owner

mrkkrp commented Mar 26, 2018

If you need this to get stuff done, why not. I'll try to cut a release this weekend.

@cocreature
Copy link

Great thanks a ton!

@mrkkrp
Copy link
Owner

mrkkrp commented Mar 27, 2018

Done. Version 6.5.0 is on Hackage. Text.Megaparsec.Internal.

@cocreature
Copy link

Thank you for addressing this so quickly!

@mrkkrp mrkkrp closed this Mar 27, 2018
@mrkkrp
Copy link
Owner

mrkkrp commented Aug 20, 2018

For further reference, opened #314.

tomjaguarpaw pushed a commit to tomjaguarpaw/megaparsec that referenced this pull request Sep 29, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants