Skip to content

Conversation

LaurentRDC
Copy link
Contributor

Haddock is the standard Haskell documentation tool which parses source files and generates documentation in a few formats, most notably HTML. In practice, development and usage of Haddock is strongly coupled to GHC internals, which slows development of Haddock and makes it more difficult for tooling to integrate Haddock. This proposal seeks to decouple Haddock from GHC as much as possible.

Rendered view

Comment on lines +99 to +114
We expect the following tasks to be performed in order:

1. Refactor Haddock such that rendering backends are decoupled from GHC internals and moved to a new package `haddock-backends`, GHC-specific components are kept in `haddock-api`, and all other GHC-agnostic components are moved to `haddock-library`;
2. Design the IR file format, which is expressive enough to represent Haddock documentation and be backwards-compatible;
3. Implement serialization/deserialization of the Haddock AST in `haddock-library`, allowing for embedding IR support into tooling;
4. Update `haddock` executable to interact with the IR file format as both input and output.

## People

The people involved in executing this work are:

* Laurent P. René de Cotret (@LaurentRDC)

## Budget

No funding from the Haskell Foundation is required to perform the work outlined in this proposal.
Copy link
Contributor

@Ericson2314 Ericson2314 Nov 12, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If part 1 goes really well, I think just you, no funding, makes sense. However I fear part 1 will be quite tricky, and we'll have to think quite hard at what to do.

I don't want to saddle us with bureaucracy but I almost want two separate proposals. The first is an unbudgetted "Laurent et al. try to split out the backends, see what goes wrong". Then based on that information we can write a second proposal which quite likely is enough work that it will need funding and extra personnel.

We could do it with one proposal, but I suspect that will lead to a ton of errata. We could also just start step 1 right now and keep this PR open until we know more details, because doing phase 1 informally and just one proposal for phase 2.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A 3rd option is to just keep this for "things go well, we at least want to try it" plan, but give it some sort of "effort box". Like if this proves to be quite hard, @LaurentRDC, it shouldn't be considered a failure to come back and say "I've spent all the free time I can, we need to do these X specific things to unblock Y, that requires more time". Merely uncovering that information is still successful!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The errata process is only really there to correct severe factual errors in already-accepted proposals. Changes during execution are normal, expected, and don't need updates through this committee, unless the one doing the work feels like our feedback would be valuable. I think it's totally fine to work based on our best guess about the future here, and then if things end up differently, we'll have a conversation.

@simonpj
Copy link
Contributor

simonpj commented Nov 14, 2022

I'm pretty far away from the details of this, and I had trouble understanding the proposal. But let me see if I get the core idea. You want to split the current Haddock into two:
A. A GHC-specific bit, which parses Haskell source files and spits out some Haddock-specific IR
B. A GHC-independent bit, which parse the Haddock IR and generates lots of documentation stuff in different formats

This way (B) is completely decoupled from GHC. I think the expectation is that (A) is relatively small, and (B) is relatively big. (E.g. if (B) was tiny there wouldn't be much point.)

Right so far? Please could the proposal say this more clearly?

One question:

  • There is a danger of a proliferation of derived files from one Haskell source file M.hs:
    • M.o the compiled code
    • M.hi the GHC interface file
    • M.hie, the HLS stuff (I think)
    • M.haddock, proposed here (I think)

The more such files we have the greater the danger of them getting out of sync. On alterantive might be to stuff it all into M.hi but in such a way that (say) the Haddock parts can be read by (B) in a way that is guranteed stable across GHC versions, just as if it was in a separate file M.haddock.

I'm not arguing for this way forward, just framing a question. Probably what I say here is half backed. But perhaps the proposal would be stronger if it laid out the current state of play and argued for a particular choice, rather than making a choice silentlhy.

@LaurentRDC
Copy link
Contributor Author

I'm pretty far away from the details of this, and I had trouble understanding the proposal. But let me see if I get the core idea. You want to split the current Haddock into two: A. A GHC-specific bit, which parses Haskell source files and spits out some Haddock-specific IR B. A GHC-independent bit, which parse the Haddock IR and generates lots of documentation stuff in different formats

This way (B) is completely decoupled from GHC. I think the expectation is that (A) is relatively small, and (B) is relatively big. (E.g. if (B) was tiny there wouldn't be much point.)

That's essentially right. The proposal also suggest splitting (B) further, into:

  • the Haddock AST (and serializing/deserializing Haddock IR);
  • the rendering backends.

The proposal may be less clear than ideal because it started as a proposal regarding the Haddock IR format only, and then evolved into a Haddock refactoring following discussions with the TWG. I'll edit the proposal to be more coherent.

One question:

There is a danger of a proliferation of derived files from one Haskell source file M.hs:

  • M.o the compiled code
  • M.hi the GHC interface file
  • M.hie, the HLS stuff (I think)
  • M.haddock, proposed here (I think)

The more such files we have the greater the danger of them getting out of sync. On alterantive might be to stuff it all into M.hi but in such a way that (say) the Haddock parts can be read by (B) in a way that is guranteed stable across GHC versions, just as if it was in a separate file M.haddock.

My understanding is that there were discussions on the GHC side regarding stuffing documentation into *.hi files. But as @Kleidukos has told us, this is a project with a long horizon.

One piece of feedback might be: Instead of doing the work on the Haddock side, let's get GHC to generate documentation IR in *.hi files. This would also solve the GHC coupling problem that this proposal identifies.

@gbaz
Copy link
Collaborator

gbaz commented Nov 14, 2022

One piece of feedback might be: Instead of doing the work on the Haddock side, let's get GHC to generate documentation IR in *.hi files. This would also solve the GHC coupling problem that this proposal identifies.

This slightly came up in our discussions, and I argued against it being part of this proposal -- its a much less incremental refactor.

The main advantage of emitting something entirely new is that this change can be made entirely independent of touching the source code of GHC -- it is purely a change to haddock. Then, as the GHC team sees fit, if the GHC-specific, IR-emitting component is sufficiently small, it could be possible to migrate it to the GHC repo as a future step -- not to be built as part of GHC per se, but just so it is in CI-sync with GHC.

Another advantage as noted is that the version-forward-compat/stable stuff is much easier to enforce.

Reading through the above, I realize we also didn't discuss the file layout of the intermediate format, where I would imagine that it might differ from *.hi files, etc. In particular, having an *.hi or *.hie per module makes perfect sense. But conceptually I might imagine a single *.haddock per package. (or per logical collection of source files if they're not yet in a package). In fact, where precisely these files live as build output (be it in one file or many) hasn't yet been clearly specified here either -- I sort of imagined they would be output wherever haddock normally outputs its html output, etc -- which is, afaik, not side-by-side with *.hi files, *.o files, etc.

@Ericson2314
Copy link
Contributor

Ericson2314 commented Nov 14, 2022

@simonpj Very simply, the existing haddock executable containing (a) and (b) should continue to exist. and any M.haddock would not be GHC's problem in the slightest.

There indeed is feature work that could be done that would change the division of labor between Haddock and GHC, such as the various haddock hi-file thing, but as @gbaz says we wanted to leave that out on purpose.

Bottom line is this proposal should only be about internal reworkings of haddock which, only coincidentally, have to do with what parts of haddock use the GHC API.

It should be possible for GHC devs to completely ignore this proposal and not have it effect them, though of course you can take an interest if you want. Just want to be clear we are not trying to create any headaches for GHC devs in secret behind GHC devs' backs! :)

@bgamari
Copy link
Contributor

bgamari commented Nov 14, 2022

For what it's worth, haddock already has an intermediate representation which can be serialised to a file. These are typically named <package-name>.haddock. You will see these are already available for packages built with Cabal and can be found via the haddock-interfaces: field in the package database (e.g. see ghc-pkg describe base). Documentation can be built from this interface with haddock's --read-interface flag. The format is binary, not JSON, but otherwise it's not clear to me how the proposed intermediate representation differs from this. Is the proposal to change the encoding to JSON and more precisely define its structure?

@Ericson2314
Copy link
Contributor

Oh! I am not sure any of us were aware of that, though I don't want to presume when I might be just misremembering.

The major difference then would be trying to make that file readable without code depending on the GHC API, so we can render afresh haddocks of old codes.

(I suppose making a textual format might be the goal too, but IMO that is a sort of secondary detail that comes cheaply and easily. Decoupling the IR and back-end from GHC is the big higher risk higher value goal that may or may not be possible.)

@bgamari
Copy link
Contributor

bgamari commented Nov 14, 2022

The major difference then would be trying to make that file readable without code depending on the GHC API, so we can render afresh haddocks of old codes.

As far as I can tell nothing prevents someone from reading Haddock interfaces without depending (directly) on the GHC API. All of All of the necessary interfaces should be available from haddock-api (which admittedly does depend upon ghc, although this dependency should be invisible to the user).

@Ericson2314
Copy link
Contributor

Ericson2314 commented Nov 14, 2022

Yes I did mean indirectly. Until the indirect dependency is removed, I have little confidence any format (binary or textural) is actually GHC-version-agnostic even if intends to be.

@bgamari
Copy link
Contributor

bgamari commented Nov 14, 2022

Yes I did mean indirectly. Until the indirect dependency is removed, I have little confidence any format (binary or textural) is actually GHC-version-agnostic even if intends to be.

Fair enough. I'm not sure how much work it will be to maintain a copy of the Haskell AST relative to the benefit doing so would provide, but I can nevertheless see the concern.

I should also point out that Haddock also currently has a (write-only) JSON encoding of its interface file format. See Haddock.Interface.Json.

@simonpj
Copy link
Contributor

simonpj commented Nov 14, 2022

Fair enough. I'm not sure how much work it will be to maintain a copy of the Haskell AST relative to the benefit doing so would provide, but I can nevertheless see the concern.

I would strongly advise against attempting to maintain a Haddock copy of the Haskell AST. It's a huge AST and it changes regularly -- indeed that's the main reason GHC and Haddock are coupled. I had previously understood that the proposed Haddock IR was for the documetation only, needing no Haskell syntax tree at all -- or at least only enought to capture type signatures and data type declarations for rendering.

@LaurentRDC
Copy link
Contributor Author

I had previously understood that the proposed Haddock IR was for the documetation only, needing no Haskell syntax tree at all -- or at least only enought to capture type signatures and data type declarations for rendering.

That is exactly the intention right now. @bgamari Is there a reason that the Haddock IR format would be coupled to the Haskell AST beyond some basics?

@bgamari
Copy link
Contributor

bgamari commented Nov 15, 2022

That is exactly the intention right now. @bgamari Is there a reason that the Haddock IR format would be coupled to the Haskell AST beyond some basics?

At the moment the Haddock interface file format contains the entire HsDecl for each of the exported declarations. This essentially pulls in the entire AST. It may be possible to refactor much of this dependence away given that we merely need to produce the basic shape of the declaration in the documentation. I would guess you would at very least need HsType and much of HsDecl itself without HsExpr. In light of this, perhaps the duplication wouldn't be too bad. The only potential sticking point may be the hyperlinked sources feature.

@Ericson2314
Copy link
Contributor

@bgamari's analysis makes sense to me. Indeed, It's a good question whether or not we can carve off something small enough to both be a less costly duplication and have significant stability benefits over GHC's HsDecl. Indeed, I am not at all confident in the answer!

Also, I think there might some nomenclature confusion that haddock-library has an AST for the individual bits documentation is quite simple, but the AST for a docs of a library as whole is much more complicated as it must "hang" those bits of documentations on the interface AST. It seems there is no accepted terms to contrast "single item documentation" vs "entire docs / interface with docs", and that led to the confusion.

@gbaz
Copy link
Collaborator

gbaz commented Nov 15, 2022

Regarding *.haddock files -- if these are indeed interface files, my understanding was that they serve a different purpose, as described in the haddock documentation: "An interface file contains information Haddock needs to produce more documentation that refers to the modules currently being processed - see the [--read-interface] option for more details. The interface file is in a binary format; don’t try to read it."

I was not aware that haddock could use them as a standalone intermediate format -- certainly nowhere in the docs is that made clear. In my understanding, the --read-interface option does not generate docs for the interface files -- rather it lets you generate other docs that refer to them.

Regarding how much of a decl for an exported declaration is necessary, I would be curious if we can get away with even less than HsType and HsDecl. Could we get away with just a list of tokens, some of which are annotated to refer to other declarations, full stop?

My thoughts on this intermediate representation are that it should be as far from haskell AST and as close to what is emitted by the various emitters (backends) as possible. But that's very high level.

Concretely, here is the file containing both Interface (the thing we want serializable) and InstalledInterface (the subset of that which currently is exported in *.haddock files): https://github.com/haskell/haddock/blob/main/haddock-api/src/Haddock/Types.hs

So perhaps we simply try to decouple Interface from using the ghc package. However, we may also want to simplify interface as we go, and "partially process" things in it which make use of GHC's AST directly to instead make use of an intermediate product derived from further processing GHC's AST into something closer to the end-state used by the emitters. In doing so, this might make reducing the coupling significantly less painful and duplicative.

@chreekat
Copy link
Member

As a user, I think the problems in the problem statement could be clarified!

In the first problem, The final "Therefore," seems to come from outer space. :) Can you clarify how documentation from a package generated years ago appears "frozen in time"? Do you mean that the only way to generate docs for old versions of software is to also use an old version of haddock, which may format things differently? Furthermore, does this proposal actually fix that?

In the second problem, can you specify explicitly which tools might benefit? Is it haskell-language-server? Would it benefit right away, or would future work be needed?

In general, I <3 refactoring, but I think it needs to be tied to actual user-facing changes to be effective and well-scoped. If this proposal is intended to be a first step, I think the rest of the steps up to a real user-facing change should be described, even if ends up being a very small change!

@chreekat
Copy link
Member

I should highlight my comment is "as a user" --- if you find my feedback useful, great! If not, no action needed. :P

@LaurentRDC
Copy link
Contributor Author

In the first problem, The final "Therefore," seems to come from outer space. :) Can you clarify how documentation from a package generated years ago appears "frozen in time"? Do you mean that the only way to generate docs for old versions of software is to also use an old version of haddock, which may format things differently? Furthermore, does this proposal actually fix that?

In the second problem, can you specify explicitly which tools might benefit? Is it haskell-language-server? Would it benefit right away, or would future work be needed?

I've updated the Problem statement section to make it more clear. Please check it out and don't hesitate to use the 'Review changes' functionality to help me make the proposal easier to read.

In general, I <3 refactoring, but I think it needs to be tied to actual user-facing changes to be effective and well-scoped. If this proposal is intended to be a first step, I think the rest of the steps up to a real user-facing change should be described, even if ends up being a very small change!

I personally interact with Haddock in two ways: cabal haddock and Hackage. The cabal haddock command won't be impacted by this proposal. This is more of a backend improvement; it'll make it easier for other tools (Hackage, HLS, Hoogle, spotlight, non-Haskell tools...) to integrate documentation.

In practice, the idea behind this proposal came from the point-of-view of Hackage maintenance, so I expect that this would be the first place where we'll see user-facing changes enabled by this proposal. I've mentioned this expectation in passing in the proposal.

@bgamari
Copy link
Contributor

bgamari commented Nov 15, 2022

In my understanding, the --read-interface option does not generate docs for the interface files -- rather it lets you generate other docs that refer to them.

Ahh yes, this is my misunderstanding. That being said, I think in principle it should be possible to produce documentation from Haddock interface files. Good catch!

@LaurentRDC
Copy link
Contributor Author

After considering the discussions here and within the technical working group, I think I need to dive into Haddock's source code in order to determine if it is possible to represent the Haddock documentation AST with very little overlap with the Haskell AST.

@Kleidukos and I have a plan to coordinate on some Haddock refactoring in the new year. This will be a good opportunity for me to think about how this proposal can be practivally achieved. I will report back when I have something tangible.

@gbaz
Copy link
Collaborator

gbaz commented Dec 23, 2022

Just wanted to record here that we discovered that the Process haddocks (https://hackage.haskell.org/package/process-1.6.16.0/docs/System-Process.html) give most of the "annotate different parts of the AST" features of haddock a workout, and so are useful to refer to in this discussion as examples.

@LaurentRDC
Copy link
Contributor Author

LaurentRDC commented Jan 16, 2023

I have studied the Haddock codebase and discussed with @Kleidukos. We are of the opinion that while the results of the proposal are desirable, it would require a lot more work and complexity than simply 'mirroring parts of the Haskell AST' into Haddock.

Kleidukos has identified other work which will make this proposal easier to implement in the future. I will use my limited volunteering time towards this end. Until then, I suggest to close this thread soon unless someone else wants to push this forward.

@david-christiansen
Copy link
Contributor

Thanks for looking into this.

If you could summarize the overall difficulties a bit for posterity, that might be useful the next time this kind of idea comes up. It seems to occur on a somewhat regular basis, and it would be great if we could save the next explorers some time by providing even a sketch of a map.

@LaurentRDC
Copy link
Contributor Author

Good idea. Let's take a look at a single backend (HTML) as an example.

First off, the Haskell AST is referenced explicitly (for example, in ppDecl). Therefore, decoupling Haddock from GHC would require a mirroring of the AST for at least type and class declarations (TyClD), type signatures (SigD), foreign declarations (ForD), and more.

In addition, there is a lot of GHC-specific data, apart of the Haskell AST, which is wired explicitly throughout the HTML backend. The backend depends explicitly on many things from the ghc package, such as:

  • the Located, SrcSpan, and other related datatypes;
  • the Module, ModuleName, Name, and RdrName, and NamedThing types;
  • the UnitState datatype;

Finally, there is the data packaged into the Haddock Interface type, which is pervasive throughout backend code, and includes many other GHC-specific things such as DynFlags, ClsInst, FamInst, and many more.

The latter problem appears to be the most problematic to me. The Interface represents the interface of a module, effectively holding all information needed to render documentation for a module. Changing the Interface type to not depend on GHC would be a large effort at this time, resulting in many Haddock types being mirrored versions of GHC types.

@LaurentRDC
Copy link
Contributor Author

After discussion with the Technical Working Group, we've agreed to put this proposal in a 'stasis' stage until the work can either:

  • be done by someone with more time;
  • be made easier because of underlying decoupling between GHC and e.g. the Haskell syntax tree;

@gbaz
Copy link
Collaborator

gbaz commented Feb 9, 2023

But also, to submit this idea as a gsoc proposal!

@LaurentRDC
Copy link
Contributor Author

As suggested, this proposal has been submitted as a Google Summer of Code project idea.

@Aman123lug
Copy link

@LaurentRDC when i write proposal for this where I need to submit because this organization not in GSOC

@david-christiansen
Copy link
Contributor

We're in the process of organizing a Haskell alternative to GSoC, in line with earlier years. Stay tuned!

@LaurentRDC
Copy link
Contributor Author

This proposal has become dormant for now; feel free to re-open when new things come up.

@LaurentRDC LaurentRDC closed this Sep 15, 2023
@Greg-Bm
Copy link

Greg-Bm commented Nov 3, 2023

I worked on this proposal for Summer of Haskell. I explored the possibility of serializing the Haskell AST to a JSON-based representation, allowing for de-serialization to be done in a forward-compatible manner. The final result of this project is a demo of what this could look like and an overview of difficulties faced by this project.
A demo is available at https://github.com/Greg8128/proto-docser-hs . A write-up on the project is available at https://docs.google.com/document/d/1Uw4rIf0EvFguLNDBcgaADbhwN0OqIwdw63Eq7_spLjE , and a report on the difficulties faced by this project is available at https://docs.google.com/document/d/1nykZgSi9k_jP1N4ZVZhSdce2jFpRdwNNT3X2YQLF9vo . I hope that this is helpful for future development of this proposal.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants