Character combinators returning Text rather than [Char]? #106

istathar · 2016-05-16T01:04:58Z

When working with

import Text.Megaparsec
import Text.Megaparsec.Text

I wrote a parser that did something simple, like:

identifier = some alphaNumChar

no big deal. But the type of that is Parser String, which I was a little surprised by, since the specific backend module I've brought in is the Text one; even though some is collecting a list of Char I somehow naively assumed that since the stream was over Text it would magically reuse the buffer underlying the Text object (in the way that substring a ByteString or Text does) and more magically, take a parser of many Chars and give me Parser Text.

Is there some stream aware mode of combining character combinators that is going to resuse existing storage and return Text objects, rather than requiring one to call T.pack a lot? Having to do

identifier = T.pack <$> some alphaNumChar

seemed clunky; my application continues on using Text internally so returning Text objects from the parser would be preferable.

Maybe I'm missing something obvious or idiomatic when it comes to Haskell parsing; apologies if this is noise.

AfC

The text was updated successfully, but these errors were encountered:

mrkkrp · 2016-05-16T04:42:17Z

tokens and parsers written on top of it return lists of tokens and since token of Text is Char, you get [Char] ≡ String. You then later convert it to Text. “Autopacking” is possible, but I wonder if it's really worth it:

You may want to get String anyway for short sequences of characters it's fine and String is still widely used. If you get Text, you need to convert back.
If you need Text or ByteString it's trivial to perform conversion.
Automatic transformation would require special version of some for example. Currently we use some from Control.Applicative and I think we should continue doing that. Many functions would need to be changed and not all of them are part of the library.
You don't get performance boost this way because Megaparsec extracts tokens from stream one by one anyway, we don't code all primitives separately for every type of stream (as Attoparsec does). The only thing that changes is how we uncons stream.

So, I think you should use T.pack if you want to get Text from your parser. That's not a big problem, from my experience, at least.

istathar · 2016-05-17T10:17:04Z

You don't get a performance boost

Hm. The only reason I'd have a Text in the first place is if I'd loaded it from a ByteString or file and called decodeUtf8 or similar on it.

Meanwhile, tokens from the stream one by one (character at a time), sure, but ultimately as you pick pieces out of the underlying source you're identifying ranges and just copying those out. Seems a shame to go via String after all these years of avoiding String like the plague.

using Control.Applicative's some

At work @thsutton pointed out that maybe we need specialized range functions, like someText say.

Not a problem to use T.pack

It's ugly :)

++

I'm here because attoparsec's error messages are bloody awful but it's consistently at the top for performance in no small part because it's not allocating. I was hoping all the work you'd put into megaparsec would have a way to be competitive with that. And it looked at least as nice as trifecta. Building up Strings for every character we go through as I extract text from a file I'm parsing is going to be hideously expensive at any kind of scale though.

I was about to suggest to myself "ok, use Text.MegaParsec.ByteString" instead, but that doesn't seem to get me any further ahead in terms of avoiding a monsterous amount of allocating.

Maybe I'm missing the point; maybe allocating a Char for every single character in a large input file and making cons list Strings out of them isn't a problem anymore. It sure used to be. Obviously I have to write a benchmark that uses all three major libraries testing my own use case yada yada. I can't see how this is going to end well, but I'll try it.

Feel free to close if the answer remains "keep building up [Char] from ByteString or Text input".

Thanks for all your hard work, and congrats on 5.0!

AfC

mrkkrp · 2016-05-17T10:37:57Z

Megaparsec is built for flexibility, sure it's not as fast as attoparsec where AFAIK things are coded for every type of stream differently. With current design we can use the same code with every type of stream, and tokens in stream can be not even characters but something else. Sure it would be interesting to get things nicer with respect to performance, but I think it means "write a different library". If you have concrete practical ideas what to change, I'm happy to discuss.

mrkkrp · 2016-05-18T05:06:39Z

Actually I'm thinking now that it's quite possible to fork Megaparsec, remove a bit of functionality (that is related to parsing of arbitrary streams of tokens) and tune the rest to perform very fast and consume only Text. Are you interested in something like this?

mrkkrp · 2016-05-18T05:07:00Z

Quality of error messages will sure be the same.

thsutton · 2016-05-18T11:10:55Z

@afcowie I think rather than functions specialised to some particular Stream s I would expect to see a primitive with a type like takeWhile :: MonadParsec e s m => (Token s -> Bool) -> m s. Or a type for tokens that uses the Stream type rather than [Token s] like (Token s -> Token s -> Bool) -> s -> m s. But as @mrkkrp mentions this isn't possible with the existing Stream interface.

If you're really keen to avoid the T.pack you could probably add a lexer pretty easily.

Zemyla · 2016-06-26T02:47:16Z

Or you could get the current parsed stream with getParserState, parse the Text stream manually, and put it back with updateParserState. You can even do it with an Attoparsec parser:

import Text.Megaparsec.Prim
import Text.Megaparsec.Pos
import qualified Data.Attoparsec.Text as AT
import qualified Data.Text as T
import Data.Text (Text)
import Data.List.NonEmpty (NonEmpty(..))
import qualified Data.Set as S

attoToMega :: MonadParsec e Text m => AT.Parser a -> m a
attoToMega atto = do
  State s (p :| z) tw <- getParserState
  case AT.parse (AT.match atto) s of
    Done s' (t, a) -> let
      p' = T.foldl' (\pc c -> snd $ defaultUpdatePos tw pc c) p t
      in updateParserState (const $ State s' (p' :| z) tw) >> return a
    Fail _ ctx str -> fail str
    Partial {} -> fail "End of input"

That's just a demonstration of what you can do by updating the state directly, rather than with uncons.

recursion-ninja · 2016-07-03T19:02:06Z

As a side note, I've used Megaparsec to parse a stream of custom typed tokens. It worked beautifully and would hate to see that go to improve Text specific performance.

erikd · 2016-10-10T03:11:20Z

Like @recursion-ninja, I'm using Megaparsec to parse a stream of tokens (produced by haskell-lexer in my case). Definitely do not want to see token parsing disappear.

mrkkrp · 2017-01-12T15:26:58Z

It looks like we've settled on what Megaparsec is, and that it should stay flexible, albeit not the most efficient parsing library compared to attoparsec. Making text-specialized version of Megaparsec would be definitely an interesting project, but unfortunately I don't have the time/need to write it (I already maintain 20+ packages). Closing this.

flip111 · 2017-03-30T12:26:27Z

tokens and parsers written on top of it return lists of tokens and since token of Text is Char, you get [Char] ≡ String

Can this be moved into the documentation?

mrkkrp · 2017-03-30T12:36:43Z

Where do you suggest to put it? It's Text-specific, or at least stream type specific, so I'm not sure it belogns to general functions like tokens. We could put it into docs of string and string', but it's obvious that they return list of tokens, not stream, just from their type signatures.

flip111 · 2017-03-30T12:37:54Z

put it here https://hackage.haskell.org/package/megaparsec-5.2.0/docs/Text-Megaparsec-Text.html

mrkkrp · 2017-03-30T13:57:13Z

I think we should not go into all possible details everywhere. It's documentation, not a tutorial. If anyone starts to wonder why it works this way, it's easy to figure it out.

mrkkrp · 2017-05-05T21:34:33Z

@afcowie, This feature is coming in Megaparsec 6: #206. I'll start by making string, string', match and other parsers return Text for Text input stream, ByteString for ByteString, etc. This won't have any effect on generality of the library and its usefulness with custom streams. This should also improve the performance.

If this change is successful, we can talk about expanding the Stream type class further as to equip it with something like takeTill/dropTill, which can eventually lead to further performance improvements for stream like Text.

Megaparsec 6 will be released this summer if everything goes as planned.

jfaure · 2019-11-30T07:47:43Z

What's the status on this ? I would love to have a takeTill :: (Char -> Bool) -> Parser T.Text

mrkkrp · 2019-11-30T09:31:04Z

Use takeWhileP.

recursion-ninja · 2019-11-30T16:36:36Z

A few months ago, I updated our parsers to use the primitives found in the MonadParsec typeclass wherever possible. takeWhileP was one of the main functions we switched to using over thing like some satisfy. There were dramatic speed and memory footprint improvements. Often multiple orders of magnitude less memory was used.

These primitives are, in my experience, very well designed and optimized.

* [mrkkrp#106] Patterns boolean algebra Resolves mrkkrp#106 * Add more tests * Better test names, consistency

mrkkrp added the discussion label May 17, 2016

mrkkrp mentioned this issue Sep 10, 2016

It's pretty hard to make a lexer using Data.Text haskell/parsec#59

Open

mrkkrp closed this as completed Jan 12, 2017

mrkkrp mentioned this issue Jan 18, 2017

Parser function for Text #172

Closed

tomjaguarpaw pushed a commit to tomjaguarpaw/megaparsec that referenced this issue Sep 29, 2022

[mrkkrp#106] Patterns boolean algebra (mrkkrp#192)

c5a4490

* [mrkkrp#106] Patterns boolean algebra Resolves mrkkrp#106 * Add more tests * Better test names, consistency

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Character combinators returning Text rather than [Char]? #106

Character combinators returning Text rather than [Char]? #106

istathar commented May 16, 2016

mrkkrp commented May 16, 2016

istathar commented May 17, 2016

mrkkrp commented May 17, 2016

mrkkrp commented May 18, 2016 •

edited

mrkkrp commented May 18, 2016

thsutton commented May 18, 2016

Zemyla commented Jun 26, 2016

recursion-ninja commented Jul 3, 2016

erikd commented Oct 10, 2016

mrkkrp commented Jan 12, 2017

flip111 commented Mar 30, 2017

mrkkrp commented Mar 30, 2017

flip111 commented Mar 30, 2017

mrkkrp commented Mar 30, 2017

mrkkrp commented May 5, 2017

jfaure commented Nov 30, 2019 •

edited

mrkkrp commented Nov 30, 2019

recursion-ninja commented Nov 30, 2019

Character combinators returning Text rather than [Char]? #106

Character combinators returning Text rather than [Char]? #106

Comments

istathar commented May 16, 2016

mrkkrp commented May 16, 2016

istathar commented May 17, 2016

mrkkrp commented May 17, 2016

mrkkrp commented May 18, 2016 • edited

mrkkrp commented May 18, 2016

thsutton commented May 18, 2016

Zemyla commented Jun 26, 2016

recursion-ninja commented Jul 3, 2016

erikd commented Oct 10, 2016

mrkkrp commented Jan 12, 2017

flip111 commented Mar 30, 2017

mrkkrp commented Mar 30, 2017

flip111 commented Mar 30, 2017

mrkkrp commented Mar 30, 2017

mrkkrp commented May 5, 2017

jfaure commented Nov 30, 2019 • edited

mrkkrp commented Nov 30, 2019

recursion-ninja commented Nov 30, 2019

mrkkrp commented May 18, 2016 •

edited

jfaure commented Nov 30, 2019 •

edited