-
Notifications
You must be signed in to change notification settings - Fork 237
Replace 'attoparsec' with 'parsec' #799
Replace 'attoparsec' with 'parsec' #799
Conversation
The Haddock parser no longer needs to worry about bytestrings. All the internal parsing work in haddock-library happens over 'Text'.
* hyperlinks * codeblocks * examples Pretty much all issues are due to attoparsec's backtracking failure behaviour vs. parsec's non-backtracking failure behaviour.
ea008de
to
f63a174
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What version bounds should be put on parsec and text?
No lower than is available in GHC 8.4.x. I'd probably just leave upper bound open.
I'd like to do more testing (and add more test cases)
Please do. I think just eyeing it I spotted a bug so I'm marking this as request changes. When writing the parser I explicitly relied on the backtracking behaviour. I saw you sprinkled some try
around but I can't tell if it's enough. Ideally now that we do not have backtracking by default, maybe some parsers can be written in a better way? I'm not requesting this though. I just want to be confident in lack of change of semantics.
Checking for noticeable performance regressions
Please look and tell us about space too. Haddock tends to use a lot of memory so ideally we're not adding to that burden in the parser.
When convenient, you should also squash your changes a bit, some commits are unnecessary.
|
||
notChar :: Char -> Parser Char | ||
notChar = lift . Attoparsec.notChar | ||
char = Parsec.char |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's okay to not do this. It was done before because we had the whole StateT wrapper and it was just exported for our own convenience. I'd just import parsec in the module I need these functions in and use it directly. It's not like we have to maintain the existing API either as the type is changing too. Same comment for all the other trivial re-exports.
|
||
decimal :: Integral a => Parser a | ||
decimal = lift Attoparsec.decimal | ||
decimal = foldl' step 0 `fmap` Parsec.many1 (satisfy isDigit) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
https://hackage.haskell.org/package/parsec-3.1.13.0/docs/Text-Parsec-Char.html#v:digit ; same for hexadecimal
import Data.Maybe (isJust) | ||
import Data.Char (isSpace) | ||
|
||
-- | Like 'T.uncons', but for the last character instead of the first. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Precisely this function already exists in text
and is omgoptimized https://hackage.haskell.org/package/text-1.2.3.0/docs/Data-Text.html#v:unsnoc
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I swear I looked for this, but didn't find it. *fpalm
strip :: String -> String | ||
strip = (\f -> f . f) $ dropWhile isSpace . reverse | ||
strip :: Text -> Text | ||
strip = T.strip |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As in other module, we don't really care about this function, we can just use T.strip at call site instead.
char = Parsec.char | ||
|
||
many' :: Parser a -> Parser [a] | ||
many' = Parsec.manyAccum (\x xs -> x `seq` x : xs) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we care about forcing the values?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was blindly replicating the behaviour of attorparsec
. Looking at the use-site, however, it looks like we don't care about forcing values (they are already forced).
skipHorizontalSpace :: Parser () | ||
skipHorizontalSpace = skipWhile isHorizontalSpace | ||
skipHorizontalSpace = skipWhile (`inClass` horizontalSpace) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This and takeHorizontalSpace are only functions using inClass
and they both use it to check for horizontal space. I would remove inClass
and horizontalSpace
and replace it with
isHorizontalSpace :: Char -> Bool
isHorizontalSpace ' ' = True
isHorizontalSpace '\t' = True
...
or similar implementation and just use that.
makeLabeled f input = case break isSpace $ removeEscapes $ strip input of | ||
(uri, "") -> f uri Nothing | ||
(uri, label) -> f uri (Just $ dropWhile isSpace label) | ||
makeLabeled :: (String -> Maybe String -> a) -> Text -> a |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it might be OK to just change the function to Text -> Maybe Text -> a
here, the inputs are always utf8
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm. I disagree here. This sounds like a change we should make once (if?) DocH
moves to using Text
.
takeUntil :: ByteString -> Parser ByteString | ||
takeUntil end_ = dropEnd <$> requireEnd (scan (False, end) p) >>= gotSome | ||
removeEscapes :: Text -> Text | ||
removeEscapes = T.filter (/= '\\') . T.replace "\\\\" "\\" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Didn't you change semantics? The previous version would leave a single slash where two were present while this first replaces double slashes with single ones and then removes all slashes anyway.I thought we had a test for this...
I really want to see a benchmark, as if there's a serious regression it needs to be fixed before this can be merged. |
What would be really nice is a sort of comprehensive-ish "golden" test suite for regression here. Like, parse a representative chunk of hackage with one parser and the other and diff the results... |
|
I've added bounds on I like the idea of testing for differences in output/time/space via a handful of large projects. A random data point before I check out for the night: I just built docs for |
Alright, I've run some preliminary tests and I'm pretty sure the output of I'd appreciate some help figuring out how to measure useful things in the space/time dimensions. Here's what I've tried (using
I've done this for Haddock before and after this PR. Here are the outputs: I've noticed that runtime can fluctuate by up to 2 seconds (for both the |
This is what I said is the case on the relevant ticket. You could try to measure parser vs parser more directly but I don't see much point. As long as there is no change in semantics and perf isn't so bad that it's actually visible in a run, it's fine. |
Alec, this looks great. You can take a look d270aee . You could add cost centres and timers in LexParseRn or somewhere around. |
I'm still skeptical about this change which doesn't fix any actual issue other than possibly some imo questionable aesthetic goal, as this as this basically means that |
I've been looking at profiling/timing information for the past couple of days and I'm increasingly convinced that there really is no (visible) regression. So far, I've investigated on cabal, idris, pandoc, text, and containers.
I dream that one day Haddock will use
I'm confused now. I thought the plan was to move the raw docstrings into the interface files, not fully parsed
Does this mean that any sort of markdown or reST backend would have to be integrated with GHC? Is there a mailing list or IRC channel where this sort of thing gets discussed? 😕 |
@harpocrates I agree we need some structured discussion on the Hi Haddock plans. It's cross cutting and there are a few different takes on what is to be done. Here is the state of the discussion: indeed the plan is only to put raw docstrings into interface files. but we also need a way to put in the information about link-dependencies so that we can rename and resolve links to other modules. To extract those links, one path of attack is to use Here's what I've been suggesting instead (though herbert is wary of it)
Some semi-related discussion here: #796 |
I support the lexer idea. Even if it has false positives, the chances those map onto actual identifiers are pretty low, so you won't even be polluting interface files. At worst, GHC ends up looking up more names than it needs to. @hvr Is there a place where we could discuss this? Accessible to the public? Even a GitHub issue would be fine by me... |
@Fuuzetsu I think the behaviour is correct now. I've diffed the output and compared runtimes for
I found exactly one difference: the new parser notices that non-breaking spaces are still spaces (the space right after I've additionally spent a day profiling time/allocations. Using -- | Apply a parser for a character zero or more times and collect the result in
-- a string.
takeWhile :: (Char -> Bool) -> Parser Text
takeWhile f = do
inp <- Parsec.getInput
pos <- Parsec.getPosition
let (i, pos', inp') = go 0 pos inp
Parsec.setInput inp'
Parsec.setPosition pos'
pure (T.take i inp)
where
go !i !pos txt = case T.uncons txt of
Just (c, txt') | f c -> go (i+1) (Parsec.updatePosChar pos c) txt'
_ -> (i, pos, txt) I've opted not to do this because I don't think the extra performance is worth the readability. I'm willing to be told otherwise. |
Good stuff. One thing to solve is the friction with the hi haddock proposal.
|
+1 for the partial lexer idea. I'm against the idea of making |
As far as this PR by itself goes it looks OK to me. I don't know about any of the partial lexer stuff so it's upon to you (plural) to decide what to do with it. |
Unfortunately I see no way to avoid this if Hi Haddock is to achieve the goal that motivated me to write up the proposal in the first place :-( If it can't become a dependency/boot-lib of GHC then Hi Haddock is basically dead on arrival for me |
I'm really not trying to be obtuse here, but I still don't understand exactly what is being planned: why must The idea that On that note, I think I'm going to step back as I feel I'm starting to tread on other people's plans, and I'm starting to sound like a broken record 😄. |
@harpocrates, @gbaz: I have created #805 as a place for discussing the question of the format. |
Even if ghc wouldn't directly depend on haddock-library to parse the documentation, ghci would need haddock-library so it can parse and pretty-print haddocks as part of the new
The proposal I submitted is here. Although much of what I proposed will probably come out quite differently, so take it with a grain of salt! :)
Are there other reasons for that, apart from your apparent preference for a different markup format? |
I'm not sure I'd want documentation to be pretty-printed if it is going to be printed into a terminal output. That seems also like something that could be a GHCi option for those who want it (something like the
I'd like to see Haddock become less tightly coupled with GHC for ease of development (both GHC's and Haddock's). Also, congratulations on starting this project! 🎉 |
Sorry, that's unrealistic. That's not going to happen. |
Herbert, this is not helpful. We're having a discussion on if it is realistic or not. We know your position. You need to motivate it. Further, this is not just your call, or my call. Bringing a bunch more code into the dependencies for GHC is something which the core GHC team needs to sign off on, and which should be discussed on the ghc-devs list. A big amount of effort in the past has been to remove external deps from GHC as much as possible to simplify development. I know this can't always happen, and I know that expanding the set of core libs can also be to the good -- but there is an innate tradeoff there, and it is worth hashing out in detail the consequences with the full set of core GHC devs. I'll follow up more on @sjakobi 's new thread (thanks!) and then as a first step in the community-bonding period of the process, the discussion should be brought to the attention of the ghc-devs list so we can get some broader input. |
This is what people are missing or ignoring; we're not bringing more code in. It was always there, conceiled. |
Since everyone's been arguing so strongly for the principle of avoiding vendored code inlined from a different package, can we get this over with? What's missing for this PR to be merged? |
My only question is the timing vis-a-vis pending release plans. Outside of that, I think your concerns were the main outstanding thing. While you don't seem happy about this still, you do seem resigned to it, and as such I vote we merge (barring, of course, release plans which may dictate otherwise?). |
Gershorm, I am ready to merge. Waiting for Alec to confirm. |
@alexbiehl Go ahead and merge. |
Thanks Alec! Looking forward to have |
@alexbiehl Thanks for mentioning that! I was about to jump into that 😃. More contributors is always a good thing. |
@harpocrates I know there was some work regarding String-Text-replacement at BayHack and a promise to open a PR soon(tm) I am not sure it will happen - I haven't heard back from the authors. So if you have some spare time - do it. |
* Remove attoparsec with parsec and start fixing failed parses * Make tests pass * Fix encoding issues The Haddock parser no longer needs to worry about bytestrings. All the internal parsing work in haddock-library happens over 'Text'. * Remove attoparsec vendor * Fix stuff broken in 'attoparsec' -> 'parsec' * hyperlinks * codeblocks * examples Pretty much all issues are due to attoparsec's backtracking failure behaviour vs. parsec's non-backtracking failure behaviour. * Fix small TODOs * Missing quote + Haddocks * Better handle spaces before/after paragraphs * Address review comments (cherry picked from commit 79c7159)
This is following the discussion in #784.
Potential unresolved issues:
parsec
andtext
?I'd like to do more testing (and add more test cases)@Fuuzetsu I think you wrote most (all?) of the code I'm touching - would you mind taking a look at this when (if) you have the time?