Skip to content
This repository was archived by the owner on Aug 3, 2024. It is now read-only.

Conversation

harpocrates
Copy link
Collaborator

@harpocrates harpocrates commented Apr 10, 2018

This is following the discussion in #784.

Potential unresolved issues:

  • What version bounds should be put on parsec and text?
  • I'd like to do more testing (and add more test cases)
  • Checking for noticeable performance regressions

@Fuuzetsu I think you wrote most (all?) of the code I'm touching - would you mind taking a look at this when (if) you have the time?

The Haddock parser no longer needs to worry about bytestrings. All
the internal parsing work in haddock-library happens over 'Text'.
* hyperlinks
* codeblocks
* examples

Pretty much all issues are due to attoparsec's backtracking failure
behaviour vs. parsec's non-backtracking failure behaviour.
@harpocrates harpocrates changed the title Replace 'attoparsec with parsec Replace 'attoparsec' with 'parsec' Apr 10, 2018
@harpocrates harpocrates force-pushed the replace-attoparsec-with-parsec branch from ea008de to f63a174 Compare April 10, 2018 21:32
@Fuuzetsu Fuuzetsu self-assigned this Apr 11, 2018
Copy link
Member

@Fuuzetsu Fuuzetsu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What version bounds should be put on parsec and text?

No lower than is available in GHC 8.4.x. I'd probably just leave upper bound open.

I'd like to do more testing (and add more test cases)

Please do. I think just eyeing it I spotted a bug so I'm marking this as request changes. When writing the parser I explicitly relied on the backtracking behaviour. I saw you sprinkled some try around but I can't tell if it's enough. Ideally now that we do not have backtracking by default, maybe some parsers can be written in a better way? I'm not requesting this though. I just want to be confident in lack of change of semantics.

Checking for noticeable performance regressions

Please look and tell us about space too. Haddock tends to use a lot of memory so ideally we're not adding to that burden in the parser.

When convenient, you should also squash your changes a bit, some commits are unnecessary.


notChar :: Char -> Parser Char
notChar = lift . Attoparsec.notChar
char = Parsec.char
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's okay to not do this. It was done before because we had the whole StateT wrapper and it was just exported for our own convenience. I'd just import parsec in the module I need these functions in and use it directly. It's not like we have to maintain the existing API either as the type is changing too. Same comment for all the other trivial re-exports.


decimal :: Integral a => Parser a
decimal = lift Attoparsec.decimal
decimal = foldl' step 0 `fmap` Parsec.many1 (satisfy isDigit)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

import Data.Maybe (isJust)
import Data.Char (isSpace)

-- | Like 'T.uncons', but for the last character instead of the first.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Precisely this function already exists in text and is omgoptimized https://hackage.haskell.org/package/text-1.2.3.0/docs/Data-Text.html#v:unsnoc

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I swear I looked for this, but didn't find it. *fpalm

strip :: String -> String
strip = (\f -> f . f) $ dropWhile isSpace . reverse
strip :: Text -> Text
strip = T.strip
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As in other module, we don't really care about this function, we can just use T.strip at call site instead.

char = Parsec.char

many' :: Parser a -> Parser [a]
many' = Parsec.manyAccum (\x xs -> x `seq` x : xs)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we care about forcing the values?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was blindly replicating the behaviour of attorparsec. Looking at the use-site, however, it looks like we don't care about forcing values (they are already forced).

skipHorizontalSpace :: Parser ()
skipHorizontalSpace = skipWhile isHorizontalSpace
skipHorizontalSpace = skipWhile (`inClass` horizontalSpace)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This and takeHorizontalSpace are only functions using inClass and they both use it to check for horizontal space. I would remove inClass and horizontalSpace and replace it with

isHorizontalSpace :: Char -> Bool
isHorizontalSpace ' ' = True
isHorizontalSpace '\t' = True
...

or similar implementation and just use that.

makeLabeled f input = case break isSpace $ removeEscapes $ strip input of
(uri, "") -> f uri Nothing
(uri, label) -> f uri (Just $ dropWhile isSpace label)
makeLabeled :: (String -> Maybe String -> a) -> Text -> a
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it might be OK to just change the function to Text -> Maybe Text -> a here, the inputs are always utf8

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm. I disagree here. This sounds like a change we should make once (if?) DocH moves to using Text.

takeUntil :: ByteString -> Parser ByteString
takeUntil end_ = dropEnd <$> requireEnd (scan (False, end) p) >>= gotSome
removeEscapes :: Text -> Text
removeEscapes = T.filter (/= '\\') . T.replace "\\\\" "\\"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Didn't you change semantics? The previous version would leave a single slash where two were present while this first replaces double slashes with single ones and then removes all slashes anyway.I thought we had a test for this...

@hvr
Copy link
Member

hvr commented Apr 11, 2018

I really want to see a benchmark, as if there's a serious regression it needs to be fixed before this can be merged.

@gbaz
Copy link
Contributor

gbaz commented Apr 11, 2018

What would be really nice is a sort of comprehensive-ish "golden" test suite for regression here. Like, parse a representative chunk of hackage with one parser and the other and diff the results...

@hvr
Copy link
Member

hvr commented Apr 11, 2018

lens, Cabal and base would be good benchmarks to see if there's regressions in time or space

@harpocrates
Copy link
Collaborator Author

I've added bounds on parsec/text based on those in Cabal.

I like the idea of testing for differences in output/time/space via a handful of large projects. lens, Cabal, and base sound good to me. I am prepared for a regression in time/space. The question is really just how much.


A random data point before I check out for the night: I just built docs for lens using Haddock before and after this PR. Since I don't have installed haddock interface files with the right version numbers (that got bumped in anticipation of ghc-8.4.2) and since I built only the HTML, I would expect a sizeable chunk of runtime to go to doc parsing. Yet, the two runtimes look about the same. 😃

@harpocrates
Copy link
Collaborator Author

Alright, I've run some preliminary tests and I'm pretty sure the output of haddock is the same before and after this PR for both lens and Cabal (haven't gotten around to base yet). I'm going to continue to run this sort of correctness test (and maybe even semi-automate it) for more packages.

I'd appreciate some help figuring out how to measure useful things in the space/time dimensions. Here's what I've tried (using Cabal as the library I'm testing Haddock on).

  1. Build Haddock with profiling and optimizations

    haddock$ cabal install --enable-profiling --ghc-options="-O2"
    
  2. Build documentation for Cabal using this Haddock, and generate an .hp file

    cabal$ cabal new-haddock --haddock-options="+RTS -p -hy -RTS" \
                             --with-haddock=<path-to-haddock-from-step1> Cabal
    
  3. Plot the generated haddock.hp file:

    cabal$ hp2ps Cabal/haddock.hp
    
    

I've done this for Haddock before and after this PR. Here are the outputs:

I've noticed that runtime can fluctuate by up to 2 seconds (for both the parsec and attoparsec programs), but the memory profile stays the same. At a glance the graphs seem to be telling me something obvious: that the GHC API part of Haddock dominates its memory use. That seems somewhat not useful.

@Fuuzetsu
Copy link
Member

This is what I said is the case on the relevant ticket. You could try to measure parser vs parser more directly but I don't see much point. As long as there is no change in semantics and perf isn't so bad that it's actually visible in a run, it's fine.

@alexbiehl
Copy link
Member

alexbiehl commented Apr 12, 2018

Alec, this looks great. You can take a look d270aee . You could add cost centres and timers in LexParseRn or somewhere around.

@hvr
Copy link
Member

hvr commented Apr 14, 2018

I'm still skeptical about this change which doesn't fix any actual issue other than possibly some imo questionable aesthetic goal, as this as this basically means that haddock-library will gain more non-private(!) dependencies (parsec, mtl and text); this increases the cost of the Hi Haddock GSOC and its followup projects which requires to move haddock-library into GHC in order to get access to the Haddock AST :-(

@harpocrates
Copy link
Collaborator Author

harpocrates commented Apr 14, 2018

I've been looking at profiling/timing information for the past couple of days and I'm increasingly convinced that there really is no (visible) regression. So far, I've investigated on cabal, idris, pandoc, text, and containers.

I'm still skeptical about this change which doesn't fix any actual issue other than possibly some imo questionable aesthetic goal, as this as this basically means that haddock-library will gain more non-private(!) dependencies (parsec, mtl and text);

I dream that one day Haddock will use text from end to end. :) More seriously:

  • we get rid of issues that come from pretending that UTF-8 is latin1 (stuff like this is gross Support unicode operators, proper modules #747)
  • vendored code is not something to be proud of: this is why we have package managers.
  • Haddock's primary (long-term) purpose is, IMO, to parse documentation. A dependency on parsec makes sense. Especially if Haddock is to someday support other formats too: it will have to be able to do even more parsing!
  • parsec is already in the boot libs!

this increases the cost of the Hi Haddock GSOC and its followup projects which requires to move haddock-library into GHC in order to get access to the Haddock AST :-(

I'm confused now.

I thought the plan was to move the raw docstrings into the interface files, not fully parsed DocH. In which case there should be no reason for a dependency on haddock-library - haddock-library provides a way to interpret those strings into something structured. That way GHC can remain agnostic to the actual Haddock format.

  • the barrier to contribution stays lower
  • we open up the possibility to create tools similar to Haddock (but which intepret docstrings differently)
  • we don't force GHC to parse Haddocks before writing out an interface file

Does this mean that any sort of markdown or reST backend would have to be integrated with GHC?

Is there a mailing list or IRC channel where this sort of thing gets discussed? 😕

@gbaz
Copy link
Contributor

gbaz commented Apr 14, 2018

Is there a mailing list or IRC channel where this sort of thing gets discussed?

@harpocrates I agree we need some structured discussion on the Hi Haddock plans. It's cross cutting and there are a few different takes on what is to be done.

Here is the state of the discussion: indeed the plan is only to put raw docstrings into interface files. but we also need a way to put in the information about link-dependencies so that we can rename and resolve links to other modules. To extract those links, one path of attack is to use haddock-library he parsing. We could also not extract those until haddock build time, but then the project fails its major goal of letting haddock move away from directly needing to muck with the GHC api.

Here's what I've been suggesting instead (though herbert is wary of it)

  • Put in a lexer that is simple but can yield false positives -- i.e. it finds every link, but also possibility a bit of things that aren't links. Then it puts all the information into the hi file. We have to be sure that it actually parses all links, but this shouldn't be hard. Also, this means that we can't warn on unlinkable identifiers at compile time, just at haddock-built time, but I think that should be ok. Finally, this still means that new format-type backends would need to be integrated with GHC, but only to the extent they increased the sorts of link indicators we'd need to lex for. (markdown, as has been proposed, wouldn't, I think, but reST would).

Some semi-related discussion here: #796

@harpocrates
Copy link
Collaborator Author

I support the lexer idea. Even if it has false positives, the chances those map onto actual identifiers are pretty low, so you won't even be polluting interface files. At worst, GHC ends up looking up more names than it needs to.

@hvr Is there a place where we could discuss this? Accessible to the public? Even a GitHub issue would be fine by me...

@harpocrates
Copy link
Collaborator Author

@Fuuzetsu I think the behaviour is correct now. I've diffed the output and compared runtimes for

  • Cabal
  • Idris
  • Pandoc
  • Cabal
  • lens

I found exactly one difference: the new parser notices that non-breaking spaces are still spaces (the space right after -- | on this line). I think the new behaviour is desirable.

I've additionally spent a day profiling time/allocations. Using parsec instead of attoparsec has roughly doubled allocations/time of parsing (but the difference is not visible when running Haddock as a whole). I can recover some of that (and bring the difference down to 1.5x) by optimizing peekChar, scan, and takeWhile into less readable text-specific parsers. For example:

-- | Apply a parser for a character zero or more times and collect the result in
-- a string.
takeWhile :: (Char -> Bool) -> Parser Text
takeWhile f = do
    inp <- Parsec.getInput
    pos <- Parsec.getPosition
    let (i, pos', inp') = go 0 pos inp
    Parsec.setInput inp'
    Parsec.setPosition pos'
    pure (T.take i inp)
  where
  go !i !pos txt = case T.uncons txt of
                     Just (c, txt') | f c -> go (i+1) (Parsec.updatePosChar pos c) txt'
                     _ -> (i, pos, txt)

I've opted not to do this because I don't think the extra performance is worth the readability. I'm willing to be told otherwise.

@alexbiehl
Copy link
Member

Good stuff. One thing to solve is the friction with the hi haddock proposal.

  • Personally I think since parsec is itself a boot library this change is ok to include with ghc
  • The other option is going with Gershorms partial lexer idea which would avoid making haddock-library a dependency

@harpocrates
Copy link
Collaborator Author

+1 for the partial lexer idea. I'm against the idea of making haddock-library a dependency (regardless of whether or not that impacts this PR getting merged).

@Fuuzetsu
Copy link
Member

As far as this PR by itself goes it looks OK to me. I don't know about any of the partial lexer stuff so it's upon to you (plural) to decide what to do with it.

@hvr
Copy link
Member

hvr commented Apr 24, 2018

I'm against the idea of making haddock-library a dependency

Unfortunately I see no way to avoid this if Hi Haddock is to achieve the goal that motivated me to write up the proposal in the first place :-(

If it can't become a dependency/boot-lib of GHC then Hi Haddock is basically dead on arrival for me

@harpocrates
Copy link
Collaborator Author

Unfortunately I see no way to avoid this if Hi Haddock is to achieve the goal that motivated me to write up the proposal in the first place :-(

If it can't become a dependency of GHC Hi Haddock is basically dead on arrival for me

I'm really not trying to be obtuse here, but I still don't understand exactly what is being planned: why must haddock-library be a dependency of GHC (apart from finding links)? I've read both the proposal (https://summer.haskell.org/ideas.html#hi-haddock) and the GHC dev thread (https://mail.haskell.org/pipermail/ghc-devs/2014-March/004355.html) - is there a more detailed proposal somewhere?

The idea that haddock-library (and consequently the Haddock format) is going to be baked into GHC is what bugs me and coworkers with whom I've discussed this. We like the idea of putting just raw docstrings (possibly along with some LinkEnv-like information) into interface files because that means GHC remains markup-agnostic. Ideally, Haddock becomes less coupled with GHC, not more.

On that note, I think I'm going to step back as I feel I'm starting to tread on other people's plans, and I'm starting to sound like a broken record 😄.

@sjakobi
Copy link
Member

sjakobi commented Apr 24, 2018

Is there a mailing list or IRC channel where this sort of thing gets discussed?

@harpocrates I agree we need some structured discussion on the Hi Haddock plans.

@harpocrates, @gbaz: I have created #805 as a place for discussing the question of the format.

@sjakobi
Copy link
Member

sjakobi commented Apr 24, 2018

@harpocrates

Unfortunately I see no way to avoid this if Hi Haddock is to achieve the goal that motivated me to write up the proposal in the first place :-(

If it can't become a dependency of GHC Hi Haddock is basically dead on arrival for me

I'm really not trying to be obtuse here, but I still don't understand exactly what is being planned: why must haddock-library be a dependency of GHC (apart from finding links)?

Even if ghc wouldn't directly depend on haddock-library to parse the documentation, ghci would need haddock-library so it can parse and pretty-print haddocks as part of the new :doc command.

I've read both the proposal (https://summer.haskell.org/ideas.html#hi-haddock) and the GHC dev thread (https://mail.haskell.org/pipermail/ghc-devs/2014-March/004355.html) - is there a more detailed proposal somewhere?

The proposal I submitted is here. Although much of what I proposed will probably come out quite differently, so take it with a grain of salt! :)

The idea that haddock-library (and consequently the Haddock format) is going to be baked into GHC is what bugs me and coworkers with whom I've discussed this.

Are there other reasons for that, apart from your apparent preference for a different markup format?

@harpocrates
Copy link
Collaborator Author

Even if ghc wouldn't directly depend on haddock-library to parse the documentation, ghci would need haddock-library so it can parse and pretty-print haddocks as part of the new :doc command.

I'm not sure I'd want documentation to be pretty-printed if it is going to be printed into a terminal output. That seems also like something that could be a GHCi option for those who want it (something like the -interactive-print option). One such implementation could then depend on haddock-library, but I wouldn't package that implementation with GHCi.

Are there other reasons for that, apart from your apparent preference for a different markup format?

I'd like to see Haddock become less tightly coupled with GHC for ease of development (both GHC's and Haddock's).

Also, congratulations on starting this project! 🎉

@hvr
Copy link
Member

hvr commented Apr 24, 2018

I'd like to see Haddock become less tightly coupled with GHC

Sorry, that's unrealistic. That's not going to happen.

@gbaz
Copy link
Contributor

gbaz commented Apr 24, 2018

Sorry, that's unrealistic. That's not going to happen.

Herbert, this is not helpful. We're having a discussion on if it is realistic or not. We know your position. You need to motivate it. Further, this is not just your call, or my call. Bringing a bunch more code into the dependencies for GHC is something which the core GHC team needs to sign off on, and which should be discussed on the ghc-devs list. A big amount of effort in the past has been to remove external deps from GHC as much as possible to simplify development. I know this can't always happen, and I know that expanding the set of core libs can also be to the good -- but there is an innate tradeoff there, and it is worth hashing out in detail the consequences with the full set of core GHC devs.

I'll follow up more on @sjakobi 's new thread (thanks!) and then as a first step in the community-bonding period of the process, the discussion should be brought to the attention of the ghc-devs list so we can get some broader input.

@hvr
Copy link
Member

hvr commented Apr 24, 2018

Bringing a bunch more code into the dependencies for GHC is something which the core GHC team needs to sign off on,

This is what people are missing or ignoring; we're not bringing more code in. It was always there, conceiled.

@hvr
Copy link
Member

hvr commented Apr 25, 2018

Since everyone's been arguing so strongly for the principle of avoiding vendored code inlined from a different package, can we get this over with? What's missing for this PR to be merged?

@gbaz
Copy link
Contributor

gbaz commented Apr 25, 2018

My only question is the timing vis-a-vis pending release plans. Outside of that, I think your concerns were the main outstanding thing. While you don't seem happy about this still, you do seem resigned to it, and as such I vote we merge (barring, of course, release plans which may dictate otherwise?).

@alexbiehl
Copy link
Member

Gershorm, I am ready to merge. Waiting for Alec to confirm.

@gbaz @harpocrates

@harpocrates
Copy link
Collaborator Author

@alexbiehl Go ahead and merge.

@alexbiehl alexbiehl merged commit 79c7159 into haskell:ghc-8.4 Apr 25, 2018
@alexbiehl
Copy link
Member

Thanks Alec! Looking forward to have Text based haddock-library. (hold back on that for a few days. I wanted to offer this as a project for Bayhack in the hopes to increase contributors)

@harpocrates
Copy link
Collaborator Author

@alexbiehl Thanks for mentioning that! I was about to jump into that 😃. More contributors is always a good thing.

@alexbiehl
Copy link
Member

@harpocrates I know there was some work regarding String-Text-replacement at BayHack and a promise to open a PR soon(tm) I am not sure it will happen - I haven't heard back from the authors. So if you have some spare time - do it.

sjakobi pushed a commit to sjakobi/haddock that referenced this pull request Jun 10, 2018
* Remove attoparsec with parsec and start fixing failed parses

* Make tests pass

* Fix encoding issues

The Haddock parser no longer needs to worry about bytestrings. All
the internal parsing work in haddock-library happens over 'Text'.

* Remove attoparsec vendor

* Fix stuff broken in 'attoparsec' -> 'parsec'

* hyperlinks
* codeblocks
* examples

Pretty much all issues are due to attoparsec's backtracking failure
behaviour vs. parsec's non-backtracking failure behaviour.

* Fix small TODOs

* Missing quote + Haddocks

* Better handle spaces before/after paragraphs

* Address review comments

(cherry picked from commit 79c7159)
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants