[RFC] Simpler AST #355

Julow · 2019-05-10T10:08:14Z

Hi,

First of all, thanks for this great library !

I'm trying to use Odoc in OCamlformat to format documentation comments.
It's been great so far but I ran into a few problems with the representation of comments (Model.Comment):

The parser requires passing a containing_definition argument that is hard to understand.
A lot of comments that were valid with Octavius now trigger warnings
Some part of the AST are very hard to understand and pretty print
References (see Reference.deconstruct #354)

I proposed in #351 to expose Parser_.Ast.
This AST has some advantages:

Parser is easier to call (no need to build dummy arguments)
Less strict (no non_link_inline_element, etc..)
Heading label is not auto generated yet

It could then be simplified by moving some code from the parser to the semantics pass that build Model.Comment:

Represent references as strings
List syntax (- or + vs ul or ol)
This information could also be in Model.Comment.
Keep whitespaces
Allow any language or none in blocks (eg. Raw_markup of string option * string)

Alternatively, most of these could be fixed without exposing Parser_.Ast and only modifying Model.Comment.
But I think this AST will be much easier to change in the future to allow more use cases.
We will probably receive issues about documentation comments on OCamlformat and Odoc will be used by a lot of projects.

I'll be happy to hear your opinions on this.

The text was updated successfully, but these errors were encountered:

jonludlam · 2019-05-10T10:29:11Z

There are several things here to comment on.

Firstly, there's no question that the parser should be easier to call, so nobody has to make up a dummy Root.t and so on.

In general, I think the AST from Model.Comment should contain enough information to recover the original text - that's fundamentally what ocamlformat needs. It's not missing much

the whitespace info
the list syntax used originally
a way to recover the original reference text
a way to know whether the heading was autogenerated or not
unrecognised tags need to be labeled as such along with the original tag

I'm actually quite happy that it currently generates more warnings than octavius - providing that we can produce an AST in all cases octavius could. It'd be handy to feed those warnings back to ocamlformat in a useful way to be reported - it'd be neat if merlin could also do this! This is also why I'm not keen on leaving the references as strings, since warnings in parsing them couldn't be passed back to ocamlformat - essentially we keep all of the input validation happening at the same point.

The raw markup is an issue where we're currently not doing what the ocamldoc manual says which is that when the backend isn't specified it gets treated as latex. That aside, we still need to have a way to distinguish between the backend being specified and odoc picking the default.

There's also an issue with unrecognised tags, which again should contain enough info to recover the original text.

lpw25 · 2019-05-10T12:48:41Z

As a general rule it is always better to keep the semantic representation of something separate from its syntactic representation. Then the syntactic representation can be kept as close to the actual syntax as possible without adding unnecessary details to the semantic representation.

In this case, that means that Parser_.Ast should be independent of the Model types and that is the type which the exposed parser should return. I suspect that would solve all your problems.

This requires making syntactic versions of Model.Paths.Path.Module.t, Model.Paths.Reference.t, Model.Paths.Reference.Module.t, Model.Comment.raw_markup_target and Model.Comment.style.

lpw25 · 2019-05-10T12:51:51Z

Oh, and the transformations that @jonludlam suggests should be applied to the Parser_.Ast types as well, so that they are sufficient to recover the original text, which I assume they are not currently.

jonludlam · 2019-05-13T14:06:01Z

I quite agree that it would be nice to separate the semantic and syntactic types - ocamlformat really is only interested in the syntax (I think!) but merlin will want the semantic tree. However, from a pragmatic point of view the semantic and syntactic representations are currently extremely similar, to the point that if you don't look carefully to spot the differences you might overlook them entirely. I'm worried that exposing both of these right now will be quite confusing.

It's conceivable that they will diverge further though - they're are certainly more different now than when the split was first introduced as restrictions have moved from Ast to Model.Comment. @aantron might be able to comment on whether there are further plans for this. In fact, as @Julow points out above, Ast.docs might be more complex than ocamlformat needs and we could use these needs to drive the definition of the types, so we effectively define the syntactic representation as 'whatever ocamldoc needs and no more'. For example, we might be able to reduce the tag representation if ocamlformat doesn't need so much detail.

@lpw25 your point about syntactic types for the paths is quite interesting. That would indeed allow ocamlformat to reflow long references, which would be pretty useful. Right now it has essentially no choice but to treat them atomically.

lpw25 · 2019-05-13T16:38:55Z

In my experience this happens with every tool that looks vaguely like a compiler: at first some of the semantic types are so close to the syntactic ones that people don't keep them separate, then over time the demands on the type diverge until eventually someone has to bite the bullet and disentangle the two use cases into separate types. Since separating a type into two tends to be a bit of a pain, these days I advocate for always keeping them separate from the beginning as a matter of principle.

aantron · 2019-06-09T02:42:47Z

I agree that we may have to separate the types in the long term (and we already almost do, but the syntactic type is not public). However, I suggest to avoid this for now, until it is either needed by odoc itself (and not ocamlformat), or the parser becomes very stable. I think it's best to continue thinking of the parser and its data types as internal, even if they are used in ocamlformat. This is to speed odoc development. So, we will "bite the bullet" when odoc needs us to. That can break ocamlformat later, but I would recommend to account for that by communicating closely with the ocamlformat developers. In the worst case, if we cause major breakage, ocamlformat can vendor an older version of the odoc parser, until there is time to adapt ocamlformat to a newer one.

Given that, I agree with @jonludlam's suggestion about modifying Model.Comment for ocamlformat's needs.

Julow · 2019-06-18T09:48:43Z

#351 and #364 are merged. This can be closed.

aantron mentioned this issue Jun 11, 2019

Expose parser internals and add Parser.parse_comment_raw #351

Merged

Julow mentioned this issue Jun 13, 2019

Simpler parser ast #364

Merged

Julow closed this as completed Jun 18, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[RFC] Simpler AST #355

[RFC] Simpler AST #355

Julow commented May 10, 2019

jonludlam commented May 10, 2019

lpw25 commented May 10, 2019

lpw25 commented May 10, 2019

jonludlam commented May 13, 2019

lpw25 commented May 13, 2019

aantron commented Jun 9, 2019

Julow commented Jun 18, 2019

[RFC] Simpler AST #355

[RFC] Simpler AST #355

Comments

Julow commented May 10, 2019

jonludlam commented May 10, 2019

lpw25 commented May 10, 2019

lpw25 commented May 10, 2019

jonludlam commented May 13, 2019

lpw25 commented May 13, 2019

aantron commented Jun 9, 2019

Julow commented Jun 18, 2019