Replace 'attoparsec' with 'parsec' #799

harpocrates · 2018-04-10T19:33:12Z

This is following the discussion in #784.

Potential unresolved issues:

What version bounds should be put on parsec and text?
~~I'd like to do more testing (and add more test cases)~~
Checking for noticeable performance regressions

@Fuuzetsu I think you wrote most (all?) of the code I'm touching - would you mind taking a look at this when (if) you have the time?

The Haddock parser no longer needs to worry about bytestrings. All the internal parsing work in haddock-library happens over 'Text'.

* hyperlinks * codeblocks * examples Pretty much all issues are due to attoparsec's backtracking failure behaviour vs. parsec's non-backtracking failure behaviour.

Fuuzetsu

What version bounds should be put on parsec and text?

No lower than is available in GHC 8.4.x. I'd probably just leave upper bound open.

I'd like to do more testing (and add more test cases)

Please do. I think just eyeing it I spotted a bug so I'm marking this as request changes. When writing the parser I explicitly relied on the backtracking behaviour. I saw you sprinkled some try around but I can't tell if it's enough. Ideally now that we do not have backtracking by default, maybe some parsers can be written in a better way? I'm not requesting this though. I just want to be confident in lack of change of semantics.

Checking for noticeable performance regressions

Please look and tell us about space too. Haddock tends to use a lot of memory so ideally we're not adding to that burden in the parser.

When convenient, you should also squash your changes a bit, some commits are unnecessary.

Fuuzetsu · 2018-04-11T03:18:03Z

haddock-library/src/Documentation/Haddock/Parser/Monad.hs

-
-notChar :: Char -> Parser Char
-notChar = lift . Attoparsec.notChar
+char =  Parsec.char


I think it's okay to not do this. It was done before because we had the whole StateT wrapper and it was just exported for our own convenience. I'd just import parsec in the module I need these functions in and use it directly. It's not like we have to maintain the existing API either as the type is changing too. Same comment for all the other trivial re-exports.

Fuuzetsu · 2018-04-11T03:25:15Z

haddock-library/src/Documentation/Haddock/Parser/Monad.hs


 decimal :: Integral a => Parser a
-decimal = lift Attoparsec.decimal
+decimal = foldl' step 0 `fmap` Parsec.many1 (satisfy isDigit)


https://hackage.haskell.org/package/parsec-3.1.13.0/docs/Text-Parsec-Char.html#v:digit ; same for hexadecimal

Fuuzetsu · 2018-04-11T03:27:40Z

haddock-library/src/Documentation/Haddock/Parser/Util.hs

+import           Data.Maybe (isJust)
+import           Data.Char (isSpace)
+
+-- | Like 'T.uncons', but for the last character instead of the first.


Precisely this function already exists in text and is omgoptimized https://hackage.haskell.org/package/text-1.2.3.0/docs/Data-Text.html#v:unsnoc

I swear I looked for this, but didn't find it. *fpalm

Fuuzetsu · 2018-04-11T03:28:39Z

haddock-library/src/Documentation/Haddock/Parser/Util.hs

-strip :: String -> String
-strip = (\f -> f . f) $ dropWhile isSpace . reverse
+strip :: Text -> Text  
+strip = T.strip 


As in other module, we don't really care about this function, we can just use T.strip at call site instead.

Fuuzetsu · 2018-04-11T03:33:08Z

haddock-library/src/Documentation/Haddock/Parser/Monad.hs

+char =  Parsec.char
+
+many' :: Parser a -> Parser [a]
+many' = Parsec.manyAccum (\x xs -> x `seq` x : xs)


Do we care about forcing the values?

I was blindly replicating the behaviour of attorparsec. Looking at the use-site, however, it looks like we don't care about forcing values (they are already forced).

Fuuzetsu · 2018-04-11T03:39:21Z

haddock-library/src/Documentation/Haddock/Parser/Util.hs

 skipHorizontalSpace :: Parser ()
-skipHorizontalSpace = skipWhile isHorizontalSpace
+skipHorizontalSpace = skipWhile (`inClass` horizontalSpace)


This and takeHorizontalSpace are only functions using inClass and they both use it to check for horizontal space. I would remove inClass and horizontalSpace and replace it with

isHorizontalSpace :: Char -> Bool isHorizontalSpace ' ' = True isHorizontalSpace '\t' = True ...

or similar implementation and just use that.

Fuuzetsu · 2018-04-11T03:44:08Z

haddock-library/src/Documentation/Haddock/Parser/Util.hs

-makeLabeled f input = case break isSpace $ removeEscapes $ strip input of
-  (uri, "")    -> f uri Nothing
-  (uri, label) -> f uri (Just $ dropWhile isSpace label)
+makeLabeled :: (String -> Maybe String -> a) -> Text -> a


it might be OK to just change the function to Text -> Maybe Text -> a here, the inputs are always utf8

Hmm. I disagree here. This sounds like a change we should make once (if?) DocH moves to using Text.

Fuuzetsu · 2018-04-11T03:46:10Z

haddock-library/src/Documentation/Haddock/Parser/Util.hs

-takeUntil :: ByteString -> Parser ByteString
-takeUntil end_ = dropEnd <$> requireEnd (scan (False, end) p) >>= gotSome
+removeEscapes :: Text -> Text
+removeEscapes = T.filter (/= '\\') . T.replace "\\\\" "\\"


Didn't you change semantics? The previous version would leave a single slash where two were present while this first replaces double slashes with single ones and then removes all slashes anyway.I thought we had a test for this...

hvr · 2018-04-11T06:47:15Z

I really want to see a benchmark, as if there's a serious regression it needs to be fixed before this can be merged.

gbaz · 2018-04-11T06:58:01Z

What would be really nice is a sort of comprehensive-ish "golden" test suite for regression here. Like, parse a representative chunk of hackage with one parser and the other and diff the results...

hvr · 2018-04-11T06:59:30Z

lens, Cabal and base would be good benchmarks to see if there's regressions in time or space

harpocrates · 2018-04-11T08:22:02Z

I've added bounds on parsec/text based on those in Cabal.

I like the idea of testing for differences in output/time/space via a handful of large projects. lens, Cabal, and base sound good to me. I am prepared for a regression in time/space. The question is really just how much.

A random data point before I check out for the night: I just built docs for lens using Haddock before and after this PR. Since I don't have installed haddock interface files with the right version numbers (that got bumped in anticipation of ghc-8.4.2) and since I built only the HTML, I would expect a sizeable chunk of runtime to go to doc parsing. Yet, the two runtimes look about the same. 😃

harpocrates · 2018-04-11T20:03:51Z

Alright, I've run some preliminary tests and I'm pretty sure the output of haddock is the same before and after this PR for both lens and Cabal (haven't gotten around to base yet). I'm going to continue to run this sort of correctness test (and maybe even semi-automate it) for more packages.

I'd appreciate some help figuring out how to measure useful things in the space/time dimensions. Here's what I've tried (using Cabal as the library I'm testing Haddock on).

Build Haddock with profiling and optimizations

haddock$ cabal install --enable-profiling --ghc-options="-O2"

Build documentation for Cabal using this Haddock, and generate an .hp file

cabal$ cabal new-haddock --haddock-options="+RTS -p -hy -RTS" \
                         --with-haddock=<path-to-haddock-from-step1> Cabal

Plot the generated haddock.hp file:
```
cabal$ hp2ps Cabal/haddock.hp
```

I've done this for Haddock before and after this PR. Here are the outputs:

I've noticed that runtime can fluctuate by up to 2 seconds (for both the parsec and attoparsec programs), but the memory profile stays the same. At a glance the graphs seem to be telling me something obvious: that the GHC API part of Haddock dominates its memory use. That seems somewhat not useful.

Fuuzetsu · 2018-04-11T22:25:55Z

This is what I said is the case on the relevant ticket. You could try to measure parser vs parser more directly but I don't see much point. As long as there is no change in semantics and perf isn't so bad that it's actually visible in a run, it's fine.

alexbiehl · 2018-04-12T06:00:19Z

Alec, this looks great. You can take a look d270aee . You could add cost centres and timers in LexParseRn or somewhere around.

hvr · 2018-04-14T09:51:33Z

I'm still skeptical about this change which doesn't fix any actual issue other than possibly some imo questionable aesthetic goal, as this as this basically means that haddock-library will gain more non-private(!) dependencies (parsec, mtl and text); this increases the cost of the Hi Haddock GSOC and its followup projects which requires to move haddock-library into GHC in order to get access to the Haddock AST :-(

harpocrates · 2018-04-14T16:02:22Z

I've been looking at profiling/timing information for the past couple of days and I'm increasingly convinced that there really is no (visible) regression. So far, I've investigated on cabal, idris, pandoc, text, and containers.

I'm still skeptical about this change which doesn't fix any actual issue other than possibly some imo questionable aesthetic goal, as this as this basically means that haddock-library will gain more non-private(!) dependencies (parsec, mtl and text);

I dream that one day Haddock will use text from end to end. :) More seriously:

we get rid of issues that come from pretending that UTF-8 is latin1 (stuff like this is gross Support unicode operators, proper modules #747)
vendored code is not something to be proud of: this is why we have package managers.
Haddock's primary (long-term) purpose is, IMO, to parse documentation. A dependency on parsec makes sense. Especially if Haddock is to someday support other formats too: it will have to be able to do even more parsing!
parsec is already in the boot libs!

this increases the cost of the Hi Haddock GSOC and its followup projects which requires to move haddock-library into GHC in order to get access to the Haddock AST :-(

I'm confused now.

I thought the plan was to move the raw docstrings into the interface files, not fully parsed DocH. In which case there should be no reason for a dependency on haddock-library - haddock-library provides a way to interpret those strings into something structured. That way GHC can remain agnostic to the actual Haddock format.

the barrier to contribution stays lower
we open up the possibility to create tools similar to Haddock (but which intepret docstrings differently)
we don't force GHC to parse Haddocks before writing out an interface file

Does this mean that any sort of markdown or reST backend would have to be integrated with GHC?

Is there a mailing list or IRC channel where this sort of thing gets discussed? 😕

gbaz · 2018-04-14T16:27:56Z

Is there a mailing list or IRC channel where this sort of thing gets discussed?

@harpocrates I agree we need some structured discussion on the Hi Haddock plans. It's cross cutting and there are a few different takes on what is to be done.

Here is the state of the discussion: indeed the plan is only to put raw docstrings into interface files. but we also need a way to put in the information about link-dependencies so that we can rename and resolve links to other modules. To extract those links, one path of attack is to use haddock-library he parsing. We could also not extract those until haddock build time, but then the project fails its major goal of letting haddock move away from directly needing to muck with the GHC api.

Here's what I've been suggesting instead (though herbert is wary of it)

Put in a lexer that is simple but can yield false positives -- i.e. it finds every link, but also possibility a bit of things that aren't links. Then it puts all the information into the hi file. We have to be sure that it actually parses all links, but this shouldn't be hard. Also, this means that we can't warn on unlinkable identifiers at compile time, just at haddock-built time, but I think that should be ok. Finally, this still means that new format-type backends would need to be integrated with GHC, but only to the extent they increased the sorts of link indicators we'd need to lex for. (markdown, as has been proposed, wouldn't, I think, but reST would).

Some semi-related discussion here: #796

harpocrates · 2018-04-14T16:53:29Z

I support the lexer idea. Even if it has false positives, the chances those map onto actual identifiers are pretty low, so you won't even be polluting interface files. At worst, GHC ends up looking up more names than it needs to.

@hvr Is there a place where we could discuss this? Accessible to the public? Even a GitHub issue would be fine by me...

harpocrates · 2018-04-16T04:11:26Z

@Fuuzetsu I think the behaviour is correct now. I've diffed the output and compared runtimes for

Cabal
Idris
Pandoc
Cabal
lens

I found exactly one difference: the new parser notices that non-breaking spaces are still spaces (the space right after -- | on this line). I think the new behaviour is desirable.

I've additionally spent a day profiling time/allocations. Using parsec instead of attoparsec has roughly doubled allocations/time of parsing (but the difference is not visible when running Haddock as a whole). I can recover some of that (and bring the difference down to 1.5x) by optimizing peekChar, scan, and takeWhile into less readable text-specific parsers. For example:

-- | Apply a parser for a character zero or more times and collect the result in
-- a string.
takeWhile :: (Char -> Bool) -> Parser Text
takeWhile f = do
    inp <- Parsec.getInput
    pos <- Parsec.getPosition
    let (i, pos', inp') = go 0 pos inp
    Parsec.setInput inp'
    Parsec.setPosition pos'
    pure (T.take i inp)
  where
  go !i !pos txt = case T.uncons txt of
                     Just (c, txt') | f c -> go (i+1) (Parsec.updatePosChar pos c) txt'
                     _ -> (i, pos, txt)

I've opted not to do this because I don't think the extra performance is worth the readability. I'm willing to be told otherwise.

alexbiehl · 2018-04-17T23:54:48Z

Good stuff. One thing to solve is the friction with the hi haddock proposal.

Personally I think since parsec is itself a boot library this change is ok to include with ghc
The other option is going with Gershorms partial lexer idea which would avoid making haddock-library a dependency

harpocrates · 2018-04-18T02:46:46Z

+1 for the partial lexer idea. I'm against the idea of making haddock-library a dependency (regardless of whether or not that impacts this PR getting merged).

Fuuzetsu · 2018-04-18T23:40:09Z

As far as this PR by itself goes it looks OK to me. I don't know about any of the partial lexer stuff so it's upon to you (plural) to decide what to do with it.

hvr · 2018-04-24T06:57:49Z

I'm against the idea of making haddock-library a dependency

Unfortunately I see no way to avoid this if Hi Haddock is to achieve the goal that motivated me to write up the proposal in the first place :-(

If it can't become a dependency/boot-lib of GHC then Hi Haddock is basically dead on arrival for me

harpocrates · 2018-04-24T07:26:40Z

Unfortunately I see no way to avoid this if Hi Haddock is to achieve the goal that motivated me to write up the proposal in the first place :-(

If it can't become a dependency of GHC Hi Haddock is basically dead on arrival for me

I'm really not trying to be obtuse here, but I still don't understand exactly what is being planned: why must haddock-library be a dependency of GHC (apart from finding links)? I've read both the proposal (https://summer.haskell.org/ideas.html#hi-haddock) and the GHC dev thread (https://mail.haskell.org/pipermail/ghc-devs/2014-March/004355.html) - is there a more detailed proposal somewhere?

The idea that haddock-library (and consequently the Haddock format) is going to be baked into GHC is what bugs me and coworkers with whom I've discussed this. We like the idea of putting just raw docstrings (possibly along with some LinkEnv-like information) into interface files because that means GHC remains markup-agnostic. Ideally, Haddock becomes less coupled with GHC, not more.

On that note, I think I'm going to step back as I feel I'm starting to tread on other people's plans, and I'm starting to sound like a broken record 😄.

sjakobi · 2018-04-24T08:00:04Z

Is there a mailing list or IRC channel where this sort of thing gets discussed?

@harpocrates I agree we need some structured discussion on the Hi Haddock plans.

@harpocrates, @gbaz: I have created #805 as a place for discussing the question of the format.

sjakobi · 2018-04-24T08:55:23Z

@harpocrates

Unfortunately I see no way to avoid this if Hi Haddock is to achieve the goal that motivated me to write up the proposal in the first place :-(

If it can't become a dependency of GHC Hi Haddock is basically dead on arrival for me

I'm really not trying to be obtuse here, but I still don't understand exactly what is being planned: why must haddock-library be a dependency of GHC (apart from finding links)?

Even if ghc wouldn't directly depend on haddock-library to parse the documentation, ghci would need haddock-library so it can parse and pretty-print haddocks as part of the new :doc command.

I've read both the proposal (https://summer.haskell.org/ideas.html#hi-haddock) and the GHC dev thread (https://mail.haskell.org/pipermail/ghc-devs/2014-March/004355.html) - is there a more detailed proposal somewhere?

The proposal I submitted is here. Although much of what I proposed will probably come out quite differently, so take it with a grain of salt! :)

The idea that haddock-library (and consequently the Haddock format) is going to be baked into GHC is what bugs me and coworkers with whom I've discussed this.

Are there other reasons for that, apart from your apparent preference for a different markup format?

harpocrates · 2018-04-24T09:08:55Z

Even if ghc wouldn't directly depend on haddock-library to parse the documentation, ghci would need haddock-library so it can parse and pretty-print haddocks as part of the new :doc command.

I'm not sure I'd want documentation to be pretty-printed if it is going to be printed into a terminal output. That seems also like something that could be a GHCi option for those who want it (something like the -interactive-print option). One such implementation could then depend on haddock-library, but I wouldn't package that implementation with GHCi.

Are there other reasons for that, apart from your apparent preference for a different markup format?

I'd like to see Haddock become less tightly coupled with GHC for ease of development (both GHC's and Haddock's).

Also, congratulations on starting this project! 🎉

hvr · 2018-04-24T11:12:21Z

I'd like to see Haddock become less tightly coupled with GHC

Sorry, that's unrealistic. That's not going to happen.

gbaz · 2018-04-24T15:05:32Z

Sorry, that's unrealistic. That's not going to happen.

Herbert, this is not helpful. We're having a discussion on if it is realistic or not. We know your position. You need to motivate it. Further, this is not just your call, or my call. Bringing a bunch more code into the dependencies for GHC is something which the core GHC team needs to sign off on, and which should be discussed on the ghc-devs list. A big amount of effort in the past has been to remove external deps from GHC as much as possible to simplify development. I know this can't always happen, and I know that expanding the set of core libs can also be to the good -- but there is an innate tradeoff there, and it is worth hashing out in detail the consequences with the full set of core GHC devs.

I'll follow up more on @sjakobi 's new thread (thanks!) and then as a first step in the community-bonding period of the process, the discussion should be brought to the attention of the ghc-devs list so we can get some broader input.

hvr · 2018-04-24T21:26:32Z

Bringing a bunch more code into the dependencies for GHC is something which the core GHC team needs to sign off on,

This is what people are missing or ignoring; we're not bringing more code in. It was always there, conceiled.

hvr · 2018-04-25T11:59:48Z

Since everyone's been arguing so strongly for the principle of avoiding vendored code inlined from a different package, can we get this over with? What's missing for this PR to be merged?

gbaz · 2018-04-25T16:55:19Z

My only question is the timing vis-a-vis pending release plans. Outside of that, I think your concerns were the main outstanding thing. While you don't seem happy about this still, you do seem resigned to it, and as such I vote we merge (barring, of course, release plans which may dictate otherwise?).

alexbiehl · 2018-04-25T16:56:43Z

Gershorm, I am ready to merge. Waiting for Alec to confirm.

@gbaz @harpocrates

harpocrates · 2018-04-25T17:38:02Z

@alexbiehl Go ahead and merge.

alexbiehl · 2018-04-25T18:25:20Z

Thanks Alec! Looking forward to have Text based haddock-library. (hold back on that for a few days. I wanted to offer this as a project for Bayhack in the hopes to increase contributors)

harpocrates · 2018-04-25T18:27:45Z

@alexbiehl Thanks for mentioning that! I was about to jump into that 😃. More contributors is always a good thing.

alexbiehl · 2018-06-06T10:34:39Z

@harpocrates I know there was some work regarding String-Text-replacement at BayHack and a promise to open a PR soon(tm) I am not sure it will happen - I haven't heard back from the authors. So if you have some spare time - do it.

* Remove attoparsec with parsec and start fixing failed parses * Make tests pass * Fix encoding issues The Haddock parser no longer needs to worry about bytestrings. All the internal parsing work in haddock-library happens over 'Text'. * Remove attoparsec vendor * Fix stuff broken in 'attoparsec' -> 'parsec' * hyperlinks * codeblocks * examples Pretty much all issues are due to attoparsec's backtracking failure behaviour vs. parsec's non-backtracking failure behaviour. * Fix small TODOs * Missing quote + Haddocks * Better handle spaces before/after paragraphs * Address review comments (cherry picked from commit 79c7159)

harpocrates added 6 commits April 8, 2018 23:25

Remove attoparsec with parsec and start fixing failed parses

9425f66

Make tests pass

949dcb1

Fix encoding issues

1859094

The Haddock parser no longer needs to worry about bytestrings. All the internal parsing work in haddock-library happens over 'Text'.

Remove attoparsec vendor

a3f4a6e

Fix stuff broken in 'attoparsec' -> 'parsec'

1f9d970

* hyperlinks * codeblocks * examples Pretty much all issues are due to attoparsec's backtracking failure behaviour vs. parsec's non-backtracking failure behaviour.

Fix small TODOs

6b4cf22

harpocrates changed the title ~~Replace 'attoparsec with parsec~~ Replace 'attoparsec' with 'parsec' Apr 10, 2018

Missing quote + Haddocks

f63a174

harpocrates force-pushed the replace-attoparsec-with-parsec branch from ea008de to f63a174 Compare April 10, 2018 21:32

Better handle spaces before/after paragraphs

985b680

Fuuzetsu self-assigned this Apr 11, 2018

Fuuzetsu suggested changes Apr 11, 2018

View reviewed changes

Address review comments

fd8c5ec

Fuuzetsu approved these changes Apr 18, 2018

View reviewed changes

sjakobi mentioned this pull request Apr 24, 2018

[Hi Haddock] In what format to export Haddocks in .hi files? #805

Closed

alexbiehl merged commit 79c7159 into haskell:ghc-8.4 Apr 25, 2018

Replace 'attoparsec' with 'parsec' #799

Replace 'attoparsec' with 'parsec' #799

Uh oh!

Conversation

harpocrates commented Apr 10, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Fuuzetsu left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

hvr commented Apr 11, 2018

Uh oh!

gbaz commented Apr 11, 2018

Uh oh!

hvr commented Apr 11, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

harpocrates commented Apr 11, 2018

Uh oh!

harpocrates commented Apr 11, 2018

Uh oh!

Fuuzetsu commented Apr 11, 2018

Uh oh!

alexbiehl commented Apr 12, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

hvr commented Apr 14, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

harpocrates commented Apr 14, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gbaz commented Apr 14, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

harpocrates commented Apr 14, 2018

Uh oh!

harpocrates commented Apr 16, 2018

Uh oh!

alexbiehl commented Apr 17, 2018

Uh oh!

harpocrates commented Apr 18, 2018

Uh oh!

Fuuzetsu commented Apr 18, 2018

Uh oh!

hvr commented Apr 24, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

harpocrates commented Apr 24, 2018

Uh oh!

sjakobi commented Apr 24, 2018

Uh oh!

sjakobi commented Apr 24, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

harpocrates commented Apr 24, 2018

Uh oh!

hvr commented Apr 24, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

harpocrates commented Apr 10, 2018 •

edited

Loading

hvr commented Apr 11, 2018 •

edited

Loading

alexbiehl commented Apr 12, 2018 •

edited

Loading

hvr commented Apr 14, 2018 •

edited

Loading

harpocrates commented Apr 14, 2018 •

edited

Loading

gbaz commented Apr 14, 2018 •

edited

Loading

hvr commented Apr 24, 2018 •

edited

Loading

sjakobi commented Apr 24, 2018 •

edited

Loading

hvr commented Apr 24, 2018 •

edited

Loading