Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Parser improvements, with a possible bug fix along the way. #2151

Merged
merged 177 commits into from Nov 22, 2018

Conversation

Projects
None yet
4 participants
@fpottier
Copy link
Contributor

commented Nov 16, 2018

I have been working on and off on a cleanup of the parser, with the aim of preparing the ground before working on better syntax error messages.

I have tried to improve sharing and readability where possible by using parameterized definitions, reduce the size of the automaton by using %inline in selected places, etc.

This cleanup work is not complete (much more could be done!) but I think it would make sense to merge it now, without waiting for trunk to move too far away. (It already has moved away some. I haven't attempted rebasing, yet.)

I have tested the modified parser on 121715 source files (using the latest version of every opam package). I can see two differences in the behavior of the parser:

First, an intentional difference: I have removed the function not_expecting and its use sites, so some syntax error messages have changed. This can be seen as a small regression, although not a serious one in my mind.

Second, an unintentional difference: in a small number of files, a few documentation strings that were lost by the previous parser are now correctly picked up and recorded in the AST; this seems to happen when a class type begins with two docstrings. Also, a few documentation strings have moved in the AST; this seems to happen when a class type ends with a docstring.

As an instance of the first phenomenon, in ocamlnet-4.1.6/src/netclient/nethttp_client.mli, at line 387, the original parser drops the docstring (** [http_call] is the runtime container..., while the modified parser picks it up.

As an instance of the second phenomenon, in ocamlnet-4.1.6/src/netstring/netmime.mli, line 94, the docstring (** Since OCamlnet-4... has moved in the AST. In both cases, it is recognized as a class_type_field/Pctf_attribute; however, in the original parser it appears in the AST as the first field of the class type, whereas in the modified parser it appears as the last field of the class type (which seems to better reflect the structure of the source file).

I am not even sure why the behavior of the parser has changed; I conjecture that this has to do with the fact that in the modified parser, the calls to mark_rhs_docs are not performed exactly at the same time as in the original parser. I could investigate further, perhaps using git bisect, but thought it would be better to first discuss the whole thing.

@XVilka

This comment has been minimized.

Copy link
Contributor

commented Nov 16, 2018

There are conflicts though.

@gasche

This comment has been minimized.

Copy link
Member

commented Nov 17, 2018

We have already discussed this, but in general I think it goes in the right direction.

One thing I don't personally like is the micro-optimization style you use to decide whether to place an %inline or not (I don't think it is easy for others to reproduce and maintain), commit-messages like "inline foo to remove one state" are self-conscious about the prematureness of it. That said, the other changes coming with that part of the PR have many nice things, so on the balance I'm in favor of taking everything in. (I hope that future PRs will be more mangeable in size and better separated in concerns.)

One thing that I still disagree with is the removal of not_expecting before we have a proper error-message counter-proposal. Yes, they hurt your sensibility, but they are artifact of the fact that people took the pain to improve error messages in situations they found important. It would be very easy to add them back in, and I think you should do it. If you insist, I would be willing to do it on my own and send you a patch/commit.

It may be helpful to have @lpw25's opinion on the docstring move business, but it actually looks like a positive change. (Bonus question: is there a way to engineer this so that the semantics is not sensitive to grammar side-effect orders?)

I would like to get this PR merge, sooner rather than later. Could you rebase it against the current trunk? (I could give it a go in case you don't have time.)

@fpottier

This comment has been minimized.

Copy link
Contributor Author

commented Nov 17, 2018

Sorry about size and separation of concerns, I am not sure how to best manage these aspects.

About the placement of %inline, it is true that it is hard to tell in advance where they are useful or counterproductive, but generally speaking, I expect decreasing the size of the automaton and the number of symbols to be helpful when working on syntax error messages.

Considering not_expecting, some changes would be easy to undo (e.g. the production expr: UNDERSCORE, which I have removed, would be easy to add back in) but other changes less so, as they have enabled a cascade of simplifications.

I can try to rebase. The main concern should be to resolve the conflict with @trefis's recent additions to the parser.

@gasche

This comment has been minimized.

Copy link
Member

commented Nov 17, 2018

other changes less so, as they have enabled a cascade of simplifications.

The two other not_expecting I can find in the trunk grammar are on type extensions with +=, which must really not be given a rec (or nonrec) flag. From a distance, it seems easy to follow the spirit of this change by changing the /* NONREC not allowed */ comment in the type_extension rule into something that optionally accepts nonrec (or even maybe either rec or nonrec), and fails in the semantic action if a keyword was actually used.

@gasche

This comment has been minimized.

Copy link
Member

commented Nov 17, 2018

Re. separation of concerns: it's okay, it's easy to see how cascading edits easily lend themselves to entangled changes in the context of a complete refactoring -- and we have a fairly robust methodology for testing that the changes are parser-preserving, so this is not so much of a concern. I think that things will be easier to manage in the future, working on smaller, more focused changes than "the first cleanup since years of tears".

@fpottier fpottier force-pushed the fpottier:improvements branch from 3964f42 to da80e2a Nov 20, 2018

@fpottier

This comment has been minimized.

Copy link
Contributor Author

commented Nov 20, 2018

I have updated this PR to keep all uses of not_expecting. They are now cleanly isolated, so should be easy to remove in the future.

@lpw25

This comment has been minimized.

Copy link
Contributor

commented Nov 20, 2018

It may be helpful to have @lpw25's opinion on the docstring move business, but it actually looks like a positive change.

Yeah, that looks like a bug fix. I spent quite a while getting the cases around struct .... end right, and I don't think I put the same effort for the equivalent object .... end issues.

is there a way to engineer this so that the semantics is not sensitive to grammar side-effect orders?

Possibly, but it is very tricky. There are two issues.

  1. You need to be able to attach documentation strings to all lexical tokens in a way similar to how locations are handled. Currently that happens via the side-channel of a table indexed by locations.
  2. In some cases you want to attach documentation to the token that came before the documentation. Ideally we would do this by having the lexing on a one token delay, however that interacts badly with how the toplevel works. Currently we instead rely on the fact that all cases where we need information about documentation after the token are in states where the parser must look ahead one token before reducing. This isn't just coincidence -- the cases where we need such information are likely (but not guaranteed) to be in such states.

I think that there is probably a way to get the one-token delay to play nicely with the toplevel and that should fix any remaining issues. Having some support from menhir for attaching arbitrary "lexical information" beyond just positions to tokens would also make the code much cleaner, but I don't know whether there is a reasonable API for doing that.

fpottier added some commits Sep 24, 2018

Make the grammar more precise by explicitly disallowing `NONREC`
inside `str_type_extension` and `sig_type_extension`. This removes
the need for post-hoc checks, and isolates two calls to `not_expecting`.
Parser: A simpler definition of `ext_attributes`.
(Not sure why a complicated definition was used.)
Parser: parameterize `let_binding` and `let_bindings` with `EXT`.
`EXT` can be instantiated with either `ext` of `no_ext`.
This allows getting rid of one use of `not_expecting`
in the auxiliary function `class_of_let_bindings`.
This use of `not_expecting` is now clearly isolated
in the definition of `no_ext`.
Parser: remove several uses of `syntax_error()` by explicitly using
`no_override_flag` instead of `override_flag` when this flag is disallowed.
A conflict is avoided by decorating `override_flag` with `%inline`.
Parser: introduce `virtual_with_mutable_flag` and `virtual_with_priva…
…te_flag`

in order to eliminate a redundancy between productions in two places.
This removes two TODO markers.
Exclude `boot/menhir/parser.{ml,mli}` from the list of source files
used by `make build-all-asts` and related commands.

Indeed, the idea is to check that the ASTs do not change when the parser is
modified -- the parser's source itself, if included, would always produce a
false positive.
Update the definitions of `core_type_comma_list` and
`inline_core_type_comma_list` so as to remove the need
for explicit calls to `List.rev`.
Introduce the nonterminal symbol `actual_type_parameters` so as
to improve sharing in the definition of `simple_core_type2_`.
Turn `simple_core_type2_` into an anonymous symbol.
This is equivalent to marking it `%inline`, but is more readable.
Add a test in `tests/parsing` which exercises the resolution of
(some of) the shift/reduce conflicts involving the token HASH.
Merged `simple_core_type`and `simple_core_type2`,
and renamed the resulting symbol `atomic_type`.
This is made possible by using a list of *two* or more
core types in the definition of `actual_type_parameters`.
As a result, `inline_core_type_comma_list` is no longer
useful and disappears.
Some directives of the form `%prec below_HASH` must be
moved as a result of this change.
Reformulate the definition of `actual_type_parameters`
to use `separated_nontrivial_llist`. Although this makes
the automaton slightly larger, it is a higher-level
definition. Added a comment.
Changed the definition of `atomic_type_or_tuple`
so as to use `separated_nontrivial_llist`.
As a result, `atomic_type_star_list` is used only in one place
and can be manually inlined into `constructor_arguments`.

@fpottier fpottier force-pushed the fpottier:improvements branch from 666c4d1 to c518d31 Nov 20, 2018

@fpottier fpottier force-pushed the fpottier:improvements branch from 595457b to 108d8c6 Nov 20, 2018

@fpottier

This comment has been minimized.

Copy link
Contributor Author

commented Nov 21, 2018

I have run the updated parser again on 121715 source files.

The differences involving syntax error messages are gone. (Only 3615 files cause errors; there is certainly not good coverage of OCaml's syntax errors in opam!)

The (positive) difference involving docstrings remains. By the way, although I have not done so, it would probably be a good idea to extend OCaml's test suite with an example that involves docstrings at the beginning and end of a class type. Even if the test itself does not verify that the docstrings are properly parsed and attached in the AST, a regression in the parser would be caught when we compare ASTs (as described in parsing/HACKING.adoc).

As far as I am concerned, this PR seems ready.

@gasche

This comment has been minimized.

Copy link
Member

commented Nov 21, 2018

By the way, although I have not done so, it would probably be a good idea to extend OCaml's test suite with an example that involves docstrings at the beginning and end of a class type.

Why, that sounds like an excellent idea. Would you add a commit to add such a testcase to testsuite/tests/parsing/docstrings.ml?

@fpottier

This comment has been minimized.

Copy link
Contributor Author

commented Nov 21, 2018

Done.

@gasche

This comment has been minimized.

Copy link
Member

commented Nov 21, 2018

@fpottier: it looks like you forgot to change the reference output in your testsuite change. From the testsuite/ repository, make promote DIR=tests/parsing will perform the necessary change for you to review and commit.

@fpottier fpottier force-pushed the fpottier:improvements branch from 07b0556 to 561c5eb Nov 21, 2018

Enrich `testsuite/tests/parsing/docstrings.ml`.
Add docstrings that the current parser drops or misplaces.
This is fixed by GPR#2151.
@fpottier

This comment has been minimized.

Copy link
Contributor Author

commented Nov 21, 2018

Sorry, I had not noticed that the expected output was in the same file.

@gasche gasche merged commit f6837be into ocaml:trunk Nov 22, 2018

2 checks passed

continuous-integration/appveyor/pr AppVeyor build succeeded
Details
continuous-integration/travis-ci/pr The Travis CI build passed
Details
@gasche

This comment has been minimized.

Copy link
Member

commented Nov 22, 2018

Merged, thanks a lot!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.