New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Menhir-generated parser #292

Merged
merged 92 commits into from Sep 2, 2018

Conversation

Projects
None yet
@gasche
Member

gasche commented Nov 15, 2015

This WIP is meant to present work ongoing to switch the OCaml parser from using ocamlyacc as a parser generator to Menhir. This work is joint work with Frédéric Bour ( @def-lkb ), who started this work inside Merlin, François Pottier and Yann Régis-Gianas who have been very reactive at improving Menhir as necessary¹, and Jacques-Henri Jourdan who provided various comments, explanations of the LR stuff, and suggestions along the way.

¹: we rely on several Menhir features introduced for this work, so this patch only works with the very last (or in-development) version of Menhir.

Ocamlyacc is a stable tool (that hasn't required much maintenance effort in the past ten years) and gives good parsing performance. Some reasons to replace it by Menhir are the following, in decreasing order of perceived strength:

  • Menhir lets us write better grammars. Menhir has abstraction features that allow to maintain grammars more conveniently (we can use variable names instead of $2, parametrized rules can remove a lot of redundancy); I expect that most of the grammar-related woes of the recent docstring parsing effort would have been removed by Menhir. Its conflict explanation features should also make it much easier to refactor the grammar while preserving input programs, or evaluate proposed syntax changes
  • Menhir will give us better error messages. Menhir is getting solid support for providing better error messages than would be ever possible with OCamlYacc, thanks to work initiated by Frédéric in Merlin (that resulted in particular in Menhir's new incremental parsing interface, and several suggestion of error explanation and recovery) and additional work by François (see his recent draft Reachability and error diagnosis in LR(1) automata).
  • Menhir is a living project (rather than an esteemed fossil) with a pure OCaml codebase. This makes it easier to maintain and evolve, and much of the above benefits come directly from this flexibility.

Note that we have not yet started applying any work on syntax error messages on the OCaml grammar (there is good work in Merlin and François' experiment on Compcert's C grammar are very promising). This is strictly future work, but getting a Menhir grammar that we would be ready to integrate in the compiler (hopefully a future evolution of the present PR) is a necessary first step.

Performance aspects

Menhir has a --table backend, that generates OCaml code traversing the automated represented as a tabular data-structure (just as OcamlYacc does), and a --code backend that compiles the automaton traversal into pure OCaml code (removing the interpretation overhead). We plan to use the --table backend in the OCaml parser for two reasons:

  • the coming features for syntax error support are only available for the --table backend for now (and may never come to the --code backend)
  • the --code backend generates a huge nest of recursive functions that is hard to handle for the OCaml compiler (both in the type-checking part and later code-generation parts); it currently takes around a minute to compile (which would add a lot to the total compiler build time, which is around a few minutes) and this could even worsen with future compiler changes, so it seems to fragile to use inside the compiler.

While the --code backend is more than competitive with Ocamlyacc performance-wise, using the --table backend would mean a degradation in parsing time from the current parser. The PR includes a benchmark script that you can run on your machine (bash menhir-bench.bash after make world.opt and make install).

On my machine, using Menhir --table adds a 20% overhead to the parsing pass of ocamlc.opt, and a 50% overhead to the parsing pass of ocamlc (we pay a lot when we replace C code into bytecode-interpreted OCaml code). However, the overhead is much more acceptable when considering the whole compilation chain: for bytecode compilation (noticeably faster than native compilation), the overhead I measure becomes 6% in native code and 5% in bytecode.

Is the OCaml community ready to accept a 5% performance degradation for bytecode compilation? I think that it is worth it, as it comes with better error messages and (for maintainers) a grammar that is easier to maintain and evolve.

Bootstrap story

After a few experiments, the bootstrap story that I ended with is the following:

  • the code of the Menhir runtime library and the generated OCaml parser are stored in boot/menhir/ (and copied in parsing/ at parser-compilation time); the grammar is stored in parsing/parser.mly as usual (parser_menhir.mly in the current patch, that maintains the ocamlyacc parser in parallel)
  • a target promote-menhir is added to the Makefile; it generates the OCaml parser from the .mly and copies it, along with a matching copy of the Menhir runtime library, to boot/menhir

This means that Menhir is not a build dependency (nor runtime dependency) of the OCaml compiler distribution, as the source of the parser is kept. The only time at which Menhir is necessary is when running the promote-menhir target to update the parser, that is, for the people that want to change the OCaml grammar.

Status of the proposed migration

The first step of the migration was to get a working Menhir-generated parser that could be compared to the OcamlYacc parser, with as little changes to the grammar as possible. As Frédéric predicted, the main difficulty was the handling of symbol locations, which uses an imperative interface (the Parsing module) in OCamlyacc, and relies on a pure data-passing interface in Menhir. This required extensive changes to the auxiliary functions called by the parser (and the Docstrings module supporting the parsing of documentation comments). On the side of grammar, I use the very ugly hack of relying on cpp to make #define hiding the location-passing from the semantic actions. This allows to write semantic actions that are extremely close to the yacc ones, which simplifies review and correctness checking.

Once this point is done, we have a parser that can easily be verified to be correct: I set up the OCaml frontend so that it parses each input file with both the yacc-generated and menhir-generated parsers, and fail if the resulting ASTs do not exactly match (including locations, etc.). The current parser passes this test for all the files touched by make world.opt (we could test it further on all OPAM-accessible sources, but this would require extensive data collection work; I am already confident that the parser has very few regressions, if any). (Getting the exact same locations as yacc require some changes in Menhir's location handling by François.)

The next step is to use Menhir's abstraction features to remove all the #define used in the semantic actions, and thus get a proper .mly -- the "cpp war". This patch is still in RFC stage because this step is not finished, and I would not propose to include a cpp-preprocessed grammar in trunk. I have done two first patches in this direction, one for mkrhs and one for extra_{str,sig,...}. It is easy to verify that these changes, that get us further and further from the yacc grammar, are correctness-preserving, thanks to the AST comparison machinery.

Requirements for a final replacement?

My personal opinion is that, as soon as this last step is finished, the resulting grammar could be considered for inclusion, replacing the current ocamlyacc grammar. This requires accepting the parsing performance overhead. Finally, this would not be the end of the road, as more work is required to get better syntax error messages.

@gasche gasche changed the title from [RFC] Menhir-generated parser to [WIP] Menhir-generated parser Nov 15, 2015

@gasche gasche referenced this pull request Nov 15, 2015

Closed

Shave the yacc #33

@gasche gasche added the enhancement label Nov 15, 2015

@gasche

This comment has been minimized.

Member

gasche commented Nov 15, 2015

I should point out that what initially motivated to start this work was a comment by Daniel Bünzli ( @dbuenzli ) on a blog post about improving typing error messages. Daniel wrote:

The parser would certainly benefit of being rewritten by hand to provide good error messages and error recovery.

Which made me realize that the message "Error: Syntax error" may be a first low-hanging fruit for improving the usability of OCaml errors. Daniel then added,

(yes, hand written parsers are the only way to achieve that)

and we are working to prove him wrong on this part.

@dbuenzli

This comment has been minimized.

Contributor

dbuenzli commented Nov 16, 2015

and we are working to prove him wrong on this part.

I'd love to be.

@Chris00

This comment has been minimized.

Member

Chris00 commented Nov 16, 2015

Maybe we can use this change to parse -. x**n rightly as -. (x**n) and not as (-. x)**n?

@lefessan

This comment has been minimized.

Contributor

lefessan commented Nov 16, 2015

I like the fact not to have menhir as a build dependency, but I think it should not be the default : most users just use OPAM now to compile OCaml, so the OPAM build steps for OCaml could be make world-without-menhir, whereas the OCaml coreteam would probably want make world to regenerate the parser if it has been modified.

Maybe it would be possible to include Menhir's runtime library into OCaml sources, so that we would only depend on the external executable (that could be compiled with another version of OCaml), while using the bootstrap compiler for Menhir's library ?

@gasche

This comment has been minimized.

Member

gasche commented Nov 16, 2015

@Chris00 noted, but I'd like to avoid any discussion of specific syntactic change, an keep this PR focused on the discussion of the parser change itself.

@lefessan What I would like to add in the short term is a test, when compiling the parser OCaml source, that the .mly file is not more recent -- and fail if that test does not pass. This is not as automatic as what you suggest (I could even test whether Menhir is available and only compile the grammar in that case), but it would already provide a clear workflow (if make world fails, just run make promote-menhir and start again). The dependencies information are good (I use menhir --depend for this), so there should be no conflict issues when restarting the build from a non-clean state. (I did indeed make the mistake during my testing of forgetting to update the OCaml parser after modifying the .mly(p).)

Right now, Menhir's runtime library is included in the OCaml distribution, in boot/menhirLib.{ml,mli}, and this is versioned. The promote-menhir target takes care to refresh these files at the same time the parser_menhir.ml is produced from the .mly, so no mismatch should happen (I was careful to ensure this was possible and abundantly whined to François until he implemented the features necessary to support this workflow). Furthermore, François added to recent versions of Menhir a cute trick that makes sure that, if you try to compile a generated parser against a mismatched runtime library, compilation will fail with a typing error (the parser requires a version_$VERSION identifier that the library provides), so there should be no silent error in this scenario.

(Indeed, the menhir executable can be compiled with another version of OCaml, typically 4.02.3 while you compile for trunk. Then the menhirLib.ml file included in boot/ will be those released by François for that other version of OCaml, so in theory this could fail if you use a very recent version of Menhir to compile a very old version of OCaml. The generated grammar also imposes version constraints, as the type-safe stack introspection interface uses GADTs.)

@whitequark

This comment has been minimized.

Contributor

whitequark commented Nov 18, 2015

@gasche amazing work.

@gasche

This comment has been minimized.

Member

gasche commented Nov 18, 2015

We discussed this at the developer meeting today, and it was decided (in a rather predictable way) that it is too soon to consider for integration the next release -- so in particular I won't try to get it mergeable in trunk before 4.03.

@agarwal

This comment has been minimized.

Member

agarwal commented Nov 24, 2015

Out of scope for this PR, but I wanted to mention that in the long term it would be useful to have lexer/parser that is completely lossless, i.e. every input character is retained. It would then be possible to implement a good syntax highlighter. A highlighter needs to output the exact same content that was input, but also annotate tokens based on their identity in the grammar.

@let-def

This comment has been minimized.

Contributor

let-def commented Nov 24, 2015

@agarwal Not sure that this is really necessary.
A lossless roundtrip through lexing/parsing is useful for all kind of automated source to source transformation, e.g. refactoring.
For syntactic (or semantic) highlighting, current frontend was precise enough (based on my experiments with merlin).

(Although thanks to my fine taste the resulting buffers looked like christmas trees)

@damiendoligez damiendoligez added this to the 4.04-or-later milestone Dec 17, 2015

@mshinwell mshinwell changed the title from [WIP] Menhir-generated parser to Menhir-generated parser Jun 10, 2016

@keleshev

This comment has been minimized.

Contributor

keleshev commented Jun 15, 2016

On --code vs. --table: consider—why not both? The compiler could start by using --code generated parser (which gives speed in case of successful parse), then if syntax error is encountered it can switch to the --table parser and re-parse the file to get a better error message.

This approach might change a 6% performance regression into a comparable performance improvement.

Sexplib follows this approach, although with two different parsers: a hand-written one and an ocamlyacc one.

@gasche

This comment has been minimized.

Member

gasche commented Jun 15, 2016

We thought about this as well, but it does complicate the setup a bit; for example the toplevel is running the lexer+parser directly from the interactive user input, so preparing for a re-parse would require memoizing the token stream.

(Right now the priority would be to update the prototype to the trunk grammar -- the grammar keeps evolving -- and move away from cpp macros for location handling. I personally have no time to commit to this in the short term.)

@bluddy

This comment has been minimized.

bluddy commented Jul 25, 2016

Would it be possible to move Menhir to github?

@gasche

This comment has been minimized.

Member

gasche commented Jul 25, 2016

I'm not the author of Menhir, so it wouldn't be my decision; also, I'm not sure how it relates to this particular work. I generally try to work along with each project's choice of tooling and development model -- as long as it is free software, of course.

@bluddy

This comment has been minimized.

bluddy commented Jul 25, 2016

Obviously not directly related.

I was directed to this PR after asking a question on IRC regarding a compiler PR I'm working on (so-called 'safe-syntax'). Menhir would definitely make adding syntax improvements easier and involve less duplication than the current method.

@bluddy bluddy referenced this pull request Jul 25, 2016

Closed

-safe-syntax option #715

@damiendoligez damiendoligez removed this from the 4.04 milestone Aug 2, 2016

@DemiMarie

This comment has been minimized.

Contributor

DemiMarie commented Dec 1, 2016

One issue is that Menhir is under the QPL, whereas OCaml is now fully under the LGPL2.1 plus exceptions. Furthermore, the OCaml compiler distribution is fully self-contained right now.

@gasche

This comment has been minimized.

Member

gasche commented Dec 1, 2016

One issue is that Menhir is under the QPL, whereas OCaml is now fully under the LGPL2.1 plus exceptions.

I think I can convince François to use an OCaml-compatible license.

Furthermore, the OCaml compiler distribution is fully self-contained right now.

Sure, but the bytecode binaries present in the boot folder are part of this "fully self-contained" set. I propose to add a menhir-generated grammar and the corresponding runtime (both being simple OCaml source files) to this boot folder -- preserving the property that the distribution is self-contained.

@dra27

This comment has been minimized.

Contributor

dra27 commented Dec 9, 2016

For the bootstrap, it would be good (and I would be very happy to help!) with being able to pull in Menhir as a bootstrapped library in the same way FlexDLL is via a submodule on the repository.

@gasche

This comment has been minimized.

Member

gasche commented Dec 9, 2016

I'm a bit wary of submodules. What are your reasons to think that it would be better than just including the parser and its runtime as OCaml source files in boot/?

Do I correctly understand that you are proposing to pull Menhir as whole, not just the bits necessary to run the parser? Then the development step for people willing to modify the parser would be to pull and build the embedded-menhir rather than their own install of Menhir (the latter is the workflow in the proposed patch). I find it a bit heavy, but one advantage is that it makes it easy for the compiler distribution to have its own patches to Menhir (for example if trunk breaks something that breaks Menhir we can fix it locally).

One thing is that @fpottier still lives in the 19th century of romantic software development: there is no publicly available Menhir repository -- only release tarballs. I think it would be very nice to have him change that process, but I expect it to be a more difficult than getting a version released under an OCaml-compatible license.

@gasche gasche merged commit c303d7b into ocaml:trunk Sep 2, 2018

2 checks passed

continuous-integration/appveyor/pr AppVeyor build succeeded
Details
continuous-integration/travis-ci/pr The Travis CI build passed
Details
@pmetzger

This comment has been minimized.

Member

pmetzger commented Sep 2, 2018

I assume that pull requests to make use of Menhir's facilities to improve the compiler's error messages are soon going to be accepted?

@gasche

This comment has been minimized.

Member

gasche commented Sep 2, 2018

This is the plan, but "soon" is too optimistic. @let-def has worked on this and is going to keep working on this, we've discussed it, and we are not convinced that the current error-message facilities are scalable/maintenable enough for the OCaml grammar. @let-def has plan for a different approach (if it works, it can be made available to all Menhir users), but we will need at least a few months to have something to show. This first PR took about three years, so maybe that's "soon" in comparison.

@gasche

This comment has been minimized.

Member

gasche commented Sep 2, 2018

Three other things that could be done and are easier tasks:

  1. We could include Menhir as a subrepo and try to hook it within the build system as a mode of use of Menhir. I'm not sure precisely how that would work (would any user of the parser be asked to use this versioned inclusion instead of an external installation?), and to be honest I'm not very interested in working on it, but @dra27 mentioned it and I trust him that it could be an interesting solution.

  2. We could think about using Menhir for other parsers included in the OCaml distribution. In particular, this would be required if we go with a long-term plan to deprecate ocamlyacc and replace it with Menhir globally within the OCaml community. This also sounds a bit tedious: I don't think we want to bootstrap all those other parsers, so it may be that the submodule (task (1)) is a required prerequisite.

  3. I have vague ideas of small things to play with in the Menhir code generation to get some small performance improvements. Not a priority (performance is already fine), but maybe something to do for fun someday.

  4. We could do a writeup of how to do a transition from ocamlyacc to menhir. The way we went at it in this PR is not a reference, because in the meantime @fpottier added new Menhir features to make the transition easier (in fact, quite easy). Our final result is representative of the best way we know to do migrations, but we want to tell a simpler story than our git history.

  5. There are various ways we could factorize the parsing rules to improve sharing because similar syntactic constructions. For example, many of the signature items are also valid, as is, as structure items, and this could be represented by a shared intermediary-AST type with a single set of parsing rule, whose result get injected into both signature and structure items. We discussed those during the Menhir migration, but decided to keep a very non-invasive approach of preserving the parser structure as much as possible, and left those for future work.

@Octachron

This comment has been minimized.

Contributor

Octachron commented Sep 2, 2018

For 5, updating #1726 reminded me that the extended operator grammar rules could be reworked to avoid a lot of duplications. I might propose a PR to do just that.

@smuenzel-js

This comment has been minimized.

Contributor

smuenzel-js commented Sep 3, 2018

It looks like this can cause a hang on syntax errors, see MPR#7847

@shindere

This comment has been minimized.

Contributor

shindere commented Sep 3, 2018

@let-def

This comment has been minimized.

Contributor

let-def commented Sep 3, 2018

@smuenzel-js I am looking at it, thanks.

@pmetzger

This comment has been minimized.

Member

pmetzger commented Sep 3, 2018

@gasche:

We could do a writeup of how to do a transition from ocamlyacc to menhir. The way we went at it in this PR is not a reference, because in the meantime @fpottier added new Menhir features to make the transition easier (in fact, quite easy). Our final result is representative of the best way we know to do migrations, but we want to tell a simpler story than our git history.

This seems like an excellent idea.

@gasche

This comment has been minimized.

Member

gasche commented Sep 4, 2018

@shindere I just checked that make clean; make coreall; make bootstrap; make bootstrap works as expected.

@shindere

This comment has been minimized.

Contributor

shindere commented Sep 4, 2018

@shindere

This comment has been minimized.

Contributor

shindere commented Oct 3, 2018

Late question, sorry. Is it on purpose that the configure script still
looks for ocamlyacc when cross-compiling? See for instance lines 595 and
below in the configure scrit from commit
1f25d35

@gasche

This comment has been minimized.

Member

gasche commented Oct 3, 2018

I am not sure, but note that the repository still contains some ocamlyacc grammars (notably the .mly files in ocamldoc and ocamltest), so ocamlyacc is still required to build some part of the compiler distribution.

I thought about moving these files to Menhir, but that means either depending on Menhir to build them or adding them to the bootstrap-parsers-from-source logic, and I decided to do neither of these.

@shindere

This comment has been minimized.

Contributor

shindere commented Oct 3, 2018

@fpottier

This comment has been minimized.

Contributor

fpottier commented Oct 3, 2018

As an anecdote, I note that when I run make -j world.opt, the build sometimes fails with an error because ocamlyacc was used to compile parsing/parser.mly (which obviously cannot succeed). If I then run make -j world.opt, the build succeeds! This was on my MacOS machine, I think, where the clock is very coarse (so make problems are more likely to be visible).

@gasche

This comment has been minimized.

Member

gasche commented Oct 3, 2018

if ocamlyacc is still built then it must be there when those not yet migrated grammars are processed. Does that look correct?

Yes, that looks correct. Note that initially some of those files were built usig boot/ocamlyacc, and I moved them to using yacc/ocamlyacc instead (or wherever the fresh-built is) in cbb92d2. I may have done something wrong at this point, but the testing seemed fine.

Conversely, do we need to check for a system-wide Menhir, or do we have the menhir generated files in the repo and thus do not necessarily need to compile them?

The generated files are comitted in the repo. Nothing in the compiler build depend on Menhir, only hacking on parsing/parser.mly does.

@shindere

This comment has been minimized.

Contributor

shindere commented Oct 3, 2018

@gasche

This comment has been minimized.

Member

gasche commented Oct 3, 2018

@fdopen, do you use the cross-compilation support in the configure script of the compiler? Can you help @shindere test configure changes in cross-compilation settings?

@shindere

This comment has been minimized.

Contributor

shindere commented Oct 3, 2018

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment