New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Character combinators returning Text rather than [Char]? #106
Comments
So, I think you should use |
Hm. The only reason I'd have a Text in the first place is if I'd loaded it from a ByteString or file and called Meanwhile, tokens from the stream one by one (character at a time), sure, but ultimately as you pick pieces out of the underlying source you're identifying ranges and just copying those out. Seems a shame to go via String after all these years of avoiding String like the plague.
At work @thsutton pointed out that maybe we need specialized range functions, like
It's ugly :) ++ I'm here because attoparsec's error messages are bloody awful but it's consistently at the top for performance in no small part because it's not allocating. I was hoping all the work you'd put into megaparsec would have a way to be competitive with that. And it looked at least as nice as trifecta. Building up Strings for every character we go through as I extract text from a file I'm parsing is going to be hideously expensive at any kind of scale though. I was about to suggest to myself "ok, use Text.MegaParsec.ByteString" instead, but that doesn't seem to get me any further ahead in terms of avoiding a monsterous amount of allocating. Maybe I'm missing the point; maybe allocating a Char for every single character in a large input file and making cons list Strings out of them isn't a problem anymore. It sure used to be. Obviously I have to write a benchmark that uses all three major libraries testing my own use case yada yada. I can't see how this is going to end well, but I'll try it. Feel free to close if the answer remains "keep building up [Char] from ByteString or Text input". Thanks for all your hard work, and congrats on 5.0! AfC |
Megaparsec is built for flexibility, sure it's not as fast as attoparsec where AFAIK things are coded for every type of stream differently. With current design we can use the same code with every type of stream, and tokens in stream can be not even characters but something else. Sure it would be interesting to get things nicer with respect to performance, but I think it means "write a different library". If you have concrete practical ideas what to change, I'm happy to discuss. |
Actually I'm thinking now that it's quite possible to fork Megaparsec, remove a bit of functionality (that is related to parsing of arbitrary streams of tokens) and tune the rest to perform very fast and consume only |
Quality of error messages will sure be the same. |
@afcowie I think rather than functions specialised to some particular If you're really keen to avoid the |
Or you could get the current parsed stream with import Text.Megaparsec.Prim
import Text.Megaparsec.Pos
import qualified Data.Attoparsec.Text as AT
import qualified Data.Text as T
import Data.Text (Text)
import Data.List.NonEmpty (NonEmpty(..))
import qualified Data.Set as S
attoToMega :: MonadParsec e Text m => AT.Parser a -> m a
attoToMega atto = do
State s (p :| z) tw <- getParserState
case AT.parse (AT.match atto) s of
Done s' (t, a) -> let
p' = T.foldl' (\pc c -> snd $ defaultUpdatePos tw pc c) p t
in updateParserState (const $ State s' (p' :| z) tw) >> return a
Fail _ ctx str -> fail str
Partial {} -> fail "End of input" That's just a demonstration of what you can do by updating the state directly, rather than with |
As a side note, I've used Megaparsec to parse a stream of custom typed tokens. It worked beautifully and would hate to see that go to improve Text specific performance. |
Like @recursion-ninja, I'm using Megaparsec to parse a stream of tokens (produced by haskell-lexer in my case). Definitely do not want to see token parsing disappear. |
It looks like we've settled on what Megaparsec is, and that it should stay flexible, albeit not the most efficient parsing library compared to attoparsec. Making text-specialized version of Megaparsec would be definitely an interesting project, but unfortunately I don't have the time/need to write it (I already maintain 20+ packages). Closing this. |
Can this be moved into the documentation? |
Where do you suggest to put it? It's |
I think we should not go into all possible details everywhere. It's documentation, not a tutorial. If anyone starts to wonder why it works this way, it's easy to figure it out. |
@afcowie, This feature is coming in Megaparsec 6: #206. I'll start by making If this change is successful, we can talk about expanding the Megaparsec 6 will be released this summer if everything goes as planned. |
What's the status on this ? I would love to have a |
Use |
A few months ago, I updated our parsers to use the primitives found in the These primitives are, in my experience, very well designed and optimized. |
* [mrkkrp#106] Patterns boolean algebra Resolves mrkkrp#106 * Add more tests * Better test names, consistency
When working with
I wrote a parser that did something simple, like:
identifier = some alphaNumChar
no big deal. But the type of that is
Parser String
, which I was a little surprised by, since the specific backend module I've brought in is the Text one; even thoughsome
is collecting a list ofChar
I somehow naively assumed that since the stream was over Text it would magically reuse the buffer underlying the Text object (in the way thatsubstr
ing a ByteString or Text does) and more magically, take a parser of many Chars and give meParser Text
.Is there some stream aware mode of combining character combinators that is going to resuse existing storage and return Text objects, rather than requiring one to call
T.pack
a lot? Having to doseemed clunky; my application continues on using Text internally so returning Text objects from the parser would be preferable.
Maybe I'm missing something obvious or idiomatic when it comes to Haskell parsing; apologies if this is noise.
AfC
The text was updated successfully, but these errors were encountered: