-
Notifications
You must be signed in to change notification settings - Fork 30
Maximally decoupling Haddock and GHC #44
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
We expect the following tasks to be performed in order: | ||
|
||
1. Refactor Haddock such that rendering backends are decoupled from GHC internals and moved to a new package `haddock-backends`, GHC-specific components are kept in `haddock-api`, and all other GHC-agnostic components are moved to `haddock-library`; | ||
2. Design the IR file format, which is expressive enough to represent Haddock documentation and be backwards-compatible; | ||
3. Implement serialization/deserialization of the Haddock AST in `haddock-library`, allowing for embedding IR support into tooling; | ||
4. Update `haddock` executable to interact with the IR file format as both input and output. | ||
|
||
## People | ||
|
||
The people involved in executing this work are: | ||
|
||
* Laurent P. René de Cotret (@LaurentRDC) | ||
|
||
## Budget | ||
|
||
No funding from the Haskell Foundation is required to perform the work outlined in this proposal. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If part 1 goes really well, I think just you, no funding, makes sense. However I fear part 1 will be quite tricky, and we'll have to think quite hard at what to do.
I don't want to saddle us with bureaucracy but I almost want two separate proposals. The first is an unbudgetted "Laurent et al. try to split out the backends, see what goes wrong". Then based on that information we can write a second proposal which quite likely is enough work that it will need funding and extra personnel.
We could do it with one proposal, but I suspect that will lead to a ton of errata. We could also just start step 1 right now and keep this PR open until we know more details, because doing phase 1 informally and just one proposal for phase 2.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A 3rd option is to just keep this for "things go well, we at least want to try it" plan, but give it some sort of "effort box". Like if this proves to be quite hard, @LaurentRDC, it shouldn't be considered a failure to come back and say "I've spent all the free time I can, we need to do these X specific things to unblock Y, that requires more time". Merely uncovering that information is still successful!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The errata process is only really there to correct severe factual errors in already-accepted proposals. Changes during execution are normal, expected, and don't need updates through this committee, unless the one doing the work feels like our feedback would be valuable. I think it's totally fine to work based on our best guess about the future here, and then if things end up differently, we'll have a conversation.
I'm pretty far away from the details of this, and I had trouble understanding the proposal. But let me see if I get the core idea. You want to split the current Haddock into two: This way (B) is completely decoupled from GHC. I think the expectation is that (A) is relatively small, and (B) is relatively big. (E.g. if (B) was tiny there wouldn't be much point.) Right so far? Please could the proposal say this more clearly? One question:
The more such files we have the greater the danger of them getting out of sync. On alterantive might be to stuff it all into I'm not arguing for this way forward, just framing a question. Probably what I say here is half backed. But perhaps the proposal would be stronger if it laid out the current state of play and argued for a particular choice, rather than making a choice silentlhy. |
That's essentially right. The proposal also suggest splitting (B) further, into:
The proposal may be less clear than ideal because it started as a proposal regarding the Haddock IR format only, and then evolved into a Haddock refactoring following discussions with the TWG. I'll edit the proposal to be more coherent.
My understanding is that there were discussions on the GHC side regarding stuffing documentation into One piece of feedback might be: Instead of doing the work on the Haddock side, let's get GHC to generate documentation IR in |
This slightly came up in our discussions, and I argued against it being part of this proposal -- its a much less incremental refactor. The main advantage of emitting something entirely new is that this change can be made entirely independent of touching the source code of GHC -- it is purely a change to haddock. Then, as the GHC team sees fit, if the GHC-specific, IR-emitting component is sufficiently small, it could be possible to migrate it to the GHC repo as a future step -- not to be built as part of GHC per se, but just so it is in CI-sync with GHC. Another advantage as noted is that the version-forward-compat/stable stuff is much easier to enforce. Reading through the above, I realize we also didn't discuss the file layout of the intermediate format, where I would imagine that it might differ from *.hi files, etc. In particular, having an *.hi or *.hie per module makes perfect sense. But conceptually I might imagine a single *.haddock per package. (or per logical collection of source files if they're not yet in a package). In fact, where precisely these files live as build output (be it in one file or many) hasn't yet been clearly specified here either -- I sort of imagined they would be output wherever haddock normally outputs its html output, etc -- which is, afaik, not side-by-side with *.hi files, *.o files, etc. |
@simonpj Very simply, the existing There indeed is feature work that could be done that would change the division of labor between Haddock and GHC, such as the various haddock hi-file thing, but as @gbaz says we wanted to leave that out on purpose. Bottom line is this proposal should only be about internal reworkings of haddock which, only coincidentally, have to do with what parts of haddock use the GHC API. It should be possible for GHC devs to completely ignore this proposal and not have it effect them, though of course you can take an interest if you want. Just want to be clear we are not trying to create any headaches for GHC devs in secret behind GHC devs' backs! :) |
For what it's worth, |
Oh! I am not sure any of us were aware of that, though I don't want to presume when I might be just misremembering. The major difference then would be trying to make that file readable without code depending on the GHC API, so we can render afresh haddocks of old codes. (I suppose making a textual format might be the goal too, but IMO that is a sort of secondary detail that comes cheaply and easily. Decoupling the IR and back-end from GHC is the big higher risk higher value goal that may or may not be possible.) |
As far as I can tell nothing prevents someone from reading Haddock interfaces without depending (directly) on the GHC API. All of All of the necessary interfaces should be available from |
Yes I did mean indirectly. Until the indirect dependency is removed, I have little confidence any format (binary or textural) is actually GHC-version-agnostic even if intends to be. |
Fair enough. I'm not sure how much work it will be to maintain a copy of the Haskell AST relative to the benefit doing so would provide, but I can nevertheless see the concern. I should also point out that Haddock also currently has a (write-only) JSON encoding of its interface file format. See |
I would strongly advise against attempting to maintain a Haddock copy of the Haskell AST. It's a huge AST and it changes regularly -- indeed that's the main reason GHC and Haddock are coupled. I had previously understood that the proposed Haddock IR was for the documetation only, needing no Haskell syntax tree at all -- or at least only enought to capture type signatures and data type declarations for rendering. |
That is exactly the intention right now. @bgamari Is there a reason that the Haddock IR format would be coupled to the Haskell AST beyond some basics? |
At the moment the Haddock interface file format contains the entire |
@bgamari's analysis makes sense to me. Indeed, It's a good question whether or not we can carve off something small enough to both be a less costly duplication and have significant stability benefits over GHC's Also, I think there might some nomenclature confusion that |
Regarding *.haddock files -- if these are indeed interface files, my understanding was that they serve a different purpose, as described in the haddock documentation: "An interface file contains information Haddock needs to produce more documentation that refers to the modules currently being processed - see the [--read-interface] option for more details. The interface file is in a binary format; don’t try to read it." I was not aware that haddock could use them as a standalone intermediate format -- certainly nowhere in the docs is that made clear. In my understanding, the --read-interface option does not generate docs for the interface files -- rather it lets you generate other docs that refer to them. Regarding how much of a decl for an exported declaration is necessary, I would be curious if we can get away with even less than HsType and HsDecl. Could we get away with just a list of tokens, some of which are annotated to refer to other declarations, full stop? My thoughts on this intermediate representation are that it should be as far from haskell AST and as close to what is emitted by the various emitters (backends) as possible. But that's very high level. Concretely, here is the file containing both Interface (the thing we want serializable) and InstalledInterface (the subset of that which currently is exported in *.haddock files): https://github.com/haskell/haddock/blob/main/haddock-api/src/Haddock/Types.hs So perhaps we simply try to decouple |
As a user, I think the problems in the problem statement could be clarified! In the first problem, The final "Therefore," seems to come from outer space. :) Can you clarify how documentation from a package generated years ago appears "frozen in time"? Do you mean that the only way to generate docs for old versions of software is to also use an old version of haddock, which may format things differently? Furthermore, does this proposal actually fix that? In the second problem, can you specify explicitly which tools might benefit? Is it haskell-language-server? Would it benefit right away, or would future work be needed? In general, I <3 refactoring, but I think it needs to be tied to actual user-facing changes to be effective and well-scoped. If this proposal is intended to be a first step, I think the rest of the steps up to a real user-facing change should be described, even if ends up being a very small change! |
I should highlight my comment is "as a user" --- if you find my feedback useful, great! If not, no action needed. :P |
I've updated the Problem statement section to make it more clear. Please check it out and don't hesitate to use the 'Review changes' functionality to help me make the proposal easier to read.
I personally interact with Haddock in two ways: In practice, the idea behind this proposal came from the point-of-view of Hackage maintenance, so I expect that this would be the first place where we'll see user-facing changes enabled by this proposal. I've mentioned this expectation in passing in the proposal. |
Ahh yes, this is my misunderstanding. That being said, I think in principle it should be possible to produce documentation from Haddock interface files. Good catch! |
After considering the discussions here and within the technical working group, I think I need to dive into Haddock's source code in order to determine if it is possible to represent the Haddock documentation AST with very little overlap with the Haskell AST. @Kleidukos and I have a plan to coordinate on some Haddock refactoring in the new year. This will be a good opportunity for me to think about how this proposal can be practivally achieved. I will report back when I have something tangible. |
Just wanted to record here that we discovered that the Process haddocks (https://hackage.haskell.org/package/process-1.6.16.0/docs/System-Process.html) give most of the "annotate different parts of the AST" features of haddock a workout, and so are useful to refer to in this discussion as examples. |
I have studied the Haddock codebase and discussed with @Kleidukos. We are of the opinion that while the results of the proposal are desirable, it would require a lot more work and complexity than simply 'mirroring parts of the Haskell AST' into Haddock. Kleidukos has identified other work which will make this proposal easier to implement in the future. I will use my limited volunteering time towards this end. Until then, I suggest to close this thread soon unless someone else wants to push this forward. |
Thanks for looking into this. If you could summarize the overall difficulties a bit for posterity, that might be useful the next time this kind of idea comes up. It seems to occur on a somewhat regular basis, and it would be great if we could save the next explorers some time by providing even a sketch of a map. |
Good idea. Let's take a look at a single backend (HTML) as an example. First off, the Haskell AST is referenced explicitly (for example, in In addition, there is a lot of GHC-specific data, apart of the Haskell AST, which is wired explicitly throughout the HTML backend. The backend depends explicitly on many things from the
Finally, there is the data packaged into the Haddock The latter problem appears to be the most problematic to me. The |
After discussion with the Technical Working Group, we've agreed to put this proposal in a 'stasis' stage until the work can either:
|
But also, to submit this idea as a gsoc proposal! |
As suggested, this proposal has been submitted as a Google Summer of Code project idea. |
@LaurentRDC when i write proposal for this where I need to submit because this organization not in GSOC |
We're in the process of organizing a Haskell alternative to GSoC, in line with earlier years. Stay tuned! |
This proposal has become dormant for now; feel free to re-open when new things come up. |
I worked on this proposal for Summer of Haskell. I explored the possibility of serializing the Haskell AST to a JSON-based representation, allowing for de-serialization to be done in a forward-compatible manner. The final result of this project is a demo of what this could look like and an overview of difficulties faced by this project. |
Haddock is the standard Haskell documentation tool which parses source files and generates documentation in a few formats, most notably HTML. In practice, development and usage of Haddock is strongly coupled to GHC internals, which slows development of Haddock and makes it more difficult for tooling to integrate Haddock. This proposal seeks to decouple Haddock from GHC as much as possible.
Rendered view