Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Character combinators returning Text rather than [Char]? #106

Closed
istathar opened this issue May 16, 2016 · 18 comments
Closed

Character combinators returning Text rather than [Char]? #106

istathar opened this issue May 16, 2016 · 18 comments

Comments

@istathar
Copy link

When working with

import Text.Megaparsec
import Text.Megaparsec.Text

I wrote a parser that did something simple, like:

identifier = some alphaNumChar

no big deal. But the type of that is Parser String, which I was a little surprised by, since the specific backend module I've brought in is the Text one; even though some is collecting a list of Char I somehow naively assumed that since the stream was over Text it would magically reuse the buffer underlying the Text object (in the way that substring a ByteString or Text does) and more magically, take a parser of many Chars and give me Parser Text.

Is there some stream aware mode of combining character combinators that is going to resuse existing storage and return Text objects, rather than requiring one to call T.pack a lot? Having to do

identifier = T.pack <$> some alphaNumChar

seemed clunky; my application continues on using Text internally so returning Text objects from the parser would be preferable.

Maybe I'm missing something obvious or idiomatic when it comes to Haskell parsing; apologies if this is noise.

AfC

@mrkkrp
Copy link
Owner

mrkkrp commented May 16, 2016

tokens and parsers written on top of it return lists of tokens and since token of Text is Char, you get [Char] ≡ String. You then later convert it to Text. “Autopacking” is possible, but I wonder if it's really worth it:

  • You may want to get String anyway for short sequences of characters it's fine and String is still widely used. If you get Text, you need to convert back.
  • If you need Text or ByteString it's trivial to perform conversion.
  • Automatic transformation would require special version of some for example. Currently we use some from Control.Applicative and I think we should continue doing that. Many functions would need to be changed and not all of them are part of the library.
  • You don't get performance boost this way because Megaparsec extracts tokens from stream one by one anyway, we don't code all primitives separately for every type of stream (as Attoparsec does). The only thing that changes is how we uncons stream.

So, I think you should use T.pack if you want to get Text from your parser. That's not a big problem, from my experience, at least.

@istathar
Copy link
Author

You don't get a performance boost

Hm. The only reason I'd have a Text in the first place is if I'd loaded it from a ByteString or file and called decodeUtf8 or similar on it.

Meanwhile, tokens from the stream one by one (character at a time), sure, but ultimately as you pick pieces out of the underlying source you're identifying ranges and just copying those out. Seems a shame to go via String after all these years of avoiding String like the plague.

using Control.Applicative's some

At work @thsutton pointed out that maybe we need specialized range functions, like someText say.

Not a problem to use T.pack

It's ugly :)

++

I'm here because attoparsec's error messages are bloody awful but it's consistently at the top for performance in no small part because it's not allocating. I was hoping all the work you'd put into megaparsec would have a way to be competitive with that. And it looked at least as nice as trifecta. Building up Strings for every character we go through as I extract text from a file I'm parsing is going to be hideously expensive at any kind of scale though.

I was about to suggest to myself "ok, use Text.MegaParsec.ByteString" instead, but that doesn't seem to get me any further ahead in terms of avoiding a monsterous amount of allocating.

Maybe I'm missing the point; maybe allocating a Char for every single character in a large input file and making cons list Strings out of them isn't a problem anymore. It sure used to be. Obviously I have to write a benchmark that uses all three major libraries testing my own use case yada yada. I can't see how this is going to end well, but I'll try it.

Feel free to close if the answer remains "keep building up [Char] from ByteString or Text input".

Thanks for all your hard work, and congrats on 5.0!

AfC

@mrkkrp
Copy link
Owner

mrkkrp commented May 17, 2016

Megaparsec is built for flexibility, sure it's not as fast as attoparsec where AFAIK things are coded for every type of stream differently. With current design we can use the same code with every type of stream, and tokens in stream can be not even characters but something else. Sure it would be interesting to get things nicer with respect to performance, but I think it means "write a different library". If you have concrete practical ideas what to change, I'm happy to discuss.

@mrkkrp
Copy link
Owner

mrkkrp commented May 18, 2016

Actually I'm thinking now that it's quite possible to fork Megaparsec, remove a bit of functionality (that is related to parsing of arbitrary streams of tokens) and tune the rest to perform very fast and consume only Text. Are you interested in something like this?

@mrkkrp
Copy link
Owner

mrkkrp commented May 18, 2016

Quality of error messages will sure be the same.

@thsutton
Copy link

@afcowie I think rather than functions specialised to some particular Stream s I would expect to see a primitive with a type like takeWhile :: MonadParsec e s m => (Token s -> Bool) -> m s. Or a type for tokens that uses the Stream type rather than [Token s] like (Token s -> Token s -> Bool) -> s -> m s. But as @mrkkrp mentions this isn't possible with the existing Stream interface.

If you're really keen to avoid the T.pack you could probably add a lexer pretty easily.

@Zemyla
Copy link

Zemyla commented Jun 26, 2016

Or you could get the current parsed stream with getParserState, parse the Text stream manually, and put it back with updateParserState. You can even do it with an Attoparsec parser:

import Text.Megaparsec.Prim
import Text.Megaparsec.Pos
import qualified Data.Attoparsec.Text as AT
import qualified Data.Text as T
import Data.Text (Text)
import Data.List.NonEmpty (NonEmpty(..))
import qualified Data.Set as S

attoToMega :: MonadParsec e Text m => AT.Parser a -> m a
attoToMega atto = do
  State s (p :| z) tw <- getParserState
  case AT.parse (AT.match atto) s of
    Done s' (t, a) -> let
      p' = T.foldl' (\pc c -> snd $ defaultUpdatePos tw pc c) p t
      in updateParserState (const $ State s' (p' :| z) tw) >> return a
    Fail _ ctx str -> fail str
    Partial {} -> fail "End of input"

That's just a demonstration of what you can do by updating the state directly, rather than with uncons.

@recursion-ninja
Copy link

As a side note, I've used Megaparsec to parse a stream of custom typed tokens. It worked beautifully and would hate to see that go to improve Text specific performance.

@erikd
Copy link
Contributor

erikd commented Oct 10, 2016

Like @recursion-ninja, I'm using Megaparsec to parse a stream of tokens (produced by haskell-lexer in my case). Definitely do not want to see token parsing disappear.

@mrkkrp
Copy link
Owner

mrkkrp commented Jan 12, 2017

It looks like we've settled on what Megaparsec is, and that it should stay flexible, albeit not the most efficient parsing library compared to attoparsec. Making text-specialized version of Megaparsec would be definitely an interesting project, but unfortunately I don't have the time/need to write it (I already maintain 20+ packages). Closing this.

@flip111
Copy link

flip111 commented Mar 30, 2017

tokens and parsers written on top of it return lists of tokens and since token of Text is Char, you get [Char] ≡ String

Can this be moved into the documentation?

@mrkkrp
Copy link
Owner

mrkkrp commented Mar 30, 2017

Where do you suggest to put it? It's Text-specific, or at least stream type specific, so I'm not sure it belogns to general functions like tokens. We could put it into docs of string and string', but it's obvious that they return list of tokens, not stream, just from their type signatures.

@flip111
Copy link

flip111 commented Mar 30, 2017

@mrkkrp
Copy link
Owner

mrkkrp commented Mar 30, 2017

I think we should not go into all possible details everywhere. It's documentation, not a tutorial. If anyone starts to wonder why it works this way, it's easy to figure it out.

@mrkkrp
Copy link
Owner

mrkkrp commented May 5, 2017

@afcowie, This feature is coming in Megaparsec 6: #206. I'll start by making string, string', match and other parsers return Text for Text input stream, ByteString for ByteString, etc. This won't have any effect on generality of the library and its usefulness with custom streams. This should also improve the performance.

If this change is successful, we can talk about expanding the Stream type class further as to equip it with something like takeTill/dropTill, which can eventually lead to further performance improvements for stream like Text.

Megaparsec 6 will be released this summer if everything goes as planned.

@jfaure
Copy link

jfaure commented Nov 30, 2019

What's the status on this ? I would love to have a takeTill :: (Char -> Bool) -> Parser T.Text

@mrkkrp
Copy link
Owner

mrkkrp commented Nov 30, 2019

Use takeWhileP.

@recursion-ninja
Copy link

A few months ago, I updated our parsers to use the primitives found in the MonadParsec typeclass wherever possible. takeWhileP was one of the main functions we switched to using over thing like some satisfy. There were dramatic speed and memory footprint improvements. Often multiple orders of magnitude less memory was used.

These primitives are, in my experience, very well designed and optimized.

tomjaguarpaw pushed a commit to tomjaguarpaw/megaparsec that referenced this issue Sep 29, 2022
* [mrkkrp#106] Patterns boolean algebra

Resolves mrkkrp#106

* Add more tests

* Better test names, consistency
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

8 participants