New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fast scan primitive, like attoparsec's #280
Conversation
This PR should be against The reason I omitted this when I worked on similar stuff for version 6 is that |
@mrkkrp, the reason it's against megaparsec-6 (I'd fix it if there was actual interest in merging this) is the same reason I did this without asking first: I needed this functionality yesterday for something I was working on, and then decided that it'd be worth at least checking if you were interested. Plus, there's a default implementation for I can't think of any inherent reason the error reporting should be worse with it than With some rekajiggering, I think there's a way to get the improvements available without bloating the class further, if that is a big concern... Although it is a somewhat major change, there are only a few things that'd need to be either just exported in general or put in some |
But the state influences the logic, right? So we now have something that is "hidden", something that is not reflected in error messages properly. Internal module exposing the |
|
OK, An example should help. Imagine you have something like this in
Now how to meaningfully express this with In short, behavior of That said, I don't say that Thanks for your contribution! |
Will rebase in the morning. I have a thought on improving the error reporting, but i’d need to fool around with it for a bit to know how/if it’d work. It wouldn’t be bad to have the surface parser predicate return Either instead of Maybe, with Right as Just but using the Left branch to provide some sort of error handling info. |
My guess is that we won't be able to get the same quality of parse errors in |
I definitely see what you're saying with the custom helpers, although I'd say that this is a pretty common operation, at least to the point where it might be worth including even if there is the capability to do it yourself. If nothing else, it'd allow this to be shipped but in a different module instead of adding more bulk to the main classes/files. In addition, I think that it might be worth considering keeping the concept and speed without copying Attoparsec exactly. Perhaps something like: data ScanResult st = Continue st | Done | Error String
scanP :: MonadParsec e s m => st -> (st -> Token s -> ScanResult st) -> m (Tokens s) That'd allow richer semantics and better error handling, but is a bit complicated. data PState = Letters | Num Char
scanImpl :: Parser T.Text
scanImpl = scanP Letters pred
where
difference a b = (fromEnum a) - (fromEnum b)
pred Letters tok
| isAlpha tok = Continue Letters
| isNumber tok = Continue (Num tok)
-- would likely produce a TrivialError with correct SourcePos, tok, and an
-- expected label from the String given
| otherwise = Error "alphanumeric character"
pred (Num last) tok
| isNumber tok && difference tok last == 1 = Done
| isNumber tok = Continue (Num tok)
| otherwise = Error "digit" |
I've done a test implementation of it (ec7ae85) but it needs some cleanup before I'd add it to this pull request. It comes out to something like this for the actual parser, and I have the full code and the (identical) output in the repo. data PState = Letters | Num Char
scanImpl :: Parser T.Text
scanImpl = scanP (Just "characters with fancy thing at the end") Letters pred
where
difference a b = (fromEnum a) - (fromEnum b)
pred Letters (Just tok)
| isAlpha tok = Continue Letters
| isNumber tok = Continue (Num tok)
| otherwise = Expected tok "alphanumeric character"
pred Letters Nothing = OutOfInput "alphanumeric character"
pred (Num last) (Just tok)
| isNumber tok && difference tok last == 1 = Done
| isNumber tok = Continue (Num tok)
| otherwise = Expected tok "digit"
pred (Num _) Nothing = OutOfInput "digit" |
I almost ended up making pretty much the same PR, however in the end I realized that |
Exposing internals is not hard per se, but there is also other stuff that needs to be done before we can release Megaparsec 7 (unfortunately Megaparsec 7 roadmap only exists as a card in my private Trello board). I think closer to summer Megaparsec 7 will be mostly ready. |
Is there any chance of getting a |
If you need this to get stuff done, why not. I'll try to cut a release this weekend. |
Great thanks a ton! |
Done. Version 6.5.0 is on Hackage. |
Thank you for addressing this so quickly! |
For further reference, opened #314. |
Attoparsec has the
scan :: s -> (s -> Word8 -> Maybe s) -> Parser ByteString
function to efficiently do a stateful scan through the input text. I implemented an equivalentscanP
parser for Megaparsec, and also brought in some dirty tricks to help it be faster than just writing the parser in a more contorted way.megaparsec-scan-bench is the benchmark I made to show how much of a benefit it provides. It's a standard criterion benchmark, build and run the executable to generate your own report or you can look at the results from my desktop here.
The executive summary is that it's effectively even with string when written in terms of the existing interface, but has consistent 2x to 10x speedups for both lazy and strict Text and ByteStrings, even beating attoparsec in some cases.
I have done enough that it's suiting my needs, but I'd be happy to develop it further/improve it if that means getting it upstream.