Join GitHub today
GitHub is home to over 28 million developers working together to host and review code, manage projects, and build software together.
Sign upMenhir-generated parser #292
Conversation
gasche
changed the title from
[RFC] Menhir-generated parser
to
[WIP] Menhir-generated parser
Nov 15, 2015
gasche
added
the
enhancement
label
Nov 15, 2015
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
gasche
Nov 15, 2015
Member
I should point out that what initially motivated to start this work was a comment by Daniel Bünzli ( @dbuenzli ) on a blog post about improving typing error messages. Daniel wrote:
The parser would certainly benefit of being rewritten by hand to provide good error messages and error recovery.
Which made me realize that the message "Error: Syntax error" may be a first low-hanging fruit for improving the usability of OCaml errors. Daniel then added,
(yes, hand written parsers are the only way to achieve that)
and we are working to prove him wrong on this part.
|
I should point out that what initially motivated to start this work was a comment by Daniel Bünzli ( @dbuenzli ) on a blog post about improving typing error messages. Daniel wrote:
Which made me realize that the message "Error: Syntax error" may be a first low-hanging fruit for improving the usability of OCaml errors. Daniel then added,
and we are working to prove him wrong on this part. |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
dbuenzli
Nov 16, 2015
Contributor
and we are working to prove him wrong on this part.
I'd love to be.
I'd love to be. |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
Chris00
Nov 16, 2015
Member
Maybe we can use this change to parse -. x**n rightly as -. (x**n) and not as (-. x)**n?
|
Maybe we can use this change to parse |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
lefessan
Nov 16, 2015
Contributor
I like the fact not to have menhir as a build dependency, but I think it should not be the default : most users just use OPAM now to compile OCaml, so the OPAM build steps for OCaml could be make world-without-menhir, whereas the OCaml coreteam would probably want make world to regenerate the parser if it has been modified.
Maybe it would be possible to include Menhir's runtime library into OCaml sources, so that we would only depend on the external executable (that could be compiled with another version of OCaml), while using the bootstrap compiler for Menhir's library ?
|
I like the fact not to have menhir as a build dependency, but I think it should not be the default : most users just use OPAM now to compile OCaml, so the OPAM build steps for OCaml could be Maybe it would be possible to include Menhir's runtime library into OCaml sources, so that we would only depend on the external executable (that could be compiled with another version of OCaml), while using the bootstrap compiler for Menhir's library ? |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
gasche
Nov 16, 2015
Member
@Chris00 noted, but I'd like to avoid any discussion of specific syntactic change, an keep this PR focused on the discussion of the parser change itself.
@lefessan What I would like to add in the short term is a test, when compiling the parser OCaml source, that the .mly file is not more recent -- and fail if that test does not pass. This is not as automatic as what you suggest (I could even test whether Menhir is available and only compile the grammar in that case), but it would already provide a clear workflow (if make world fails, just run make promote-menhir and start again). The dependencies information are good (I use menhir --depend for this), so there should be no conflict issues when restarting the build from a non-clean state. (I did indeed make the mistake during my testing of forgetting to update the OCaml parser after modifying the .mly(p).)
Right now, Menhir's runtime library is included in the OCaml distribution, in boot/menhirLib.{ml,mli}, and this is versioned. The promote-menhir target takes care to refresh these files at the same time the parser_menhir.ml is produced from the .mly, so no mismatch should happen (I was careful to ensure this was possible and abundantly whined to François until he implemented the features necessary to support this workflow). Furthermore, François added to recent versions of Menhir a cute trick that makes sure that, if you try to compile a generated parser against a mismatched runtime library, compilation will fail with a typing error (the parser requires a version_$VERSION identifier that the library provides), so there should be no silent error in this scenario.
(Indeed, the menhir executable can be compiled with another version of OCaml, typically 4.02.3 while you compile for trunk. Then the menhirLib.ml file included in boot/ will be those released by François for that other version of OCaml, so in theory this could fail if you use a very recent version of Menhir to compile a very old version of OCaml. The generated grammar also imposes version constraints, as the type-safe stack introspection interface uses GADTs.)
|
@Chris00 noted, but I'd like to avoid any discussion of specific syntactic change, an keep this PR focused on the discussion of the parser change itself. @lefessan What I would like to add in the short term is a test, when compiling the parser OCaml source, that the Right now, Menhir's runtime library is included in the OCaml distribution, in (Indeed, the |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
|
@gasche amazing work. |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
gasche
Nov 18, 2015
Member
We discussed this at the developer meeting today, and it was decided (in a rather predictable way) that it is too soon to consider for integration the next release -- so in particular I won't try to get it mergeable in trunk before 4.03.
|
We discussed this at the developer meeting today, and it was decided (in a rather predictable way) that it is too soon to consider for integration the next release -- so in particular I won't try to get it mergeable in trunk before 4.03. |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
agarwal
Nov 24, 2015
Member
Out of scope for this PR, but I wanted to mention that in the long term it would be useful to have lexer/parser that is completely lossless, i.e. every input character is retained. It would then be possible to implement a good syntax highlighter. A highlighter needs to output the exact same content that was input, but also annotate tokens based on their identity in the grammar.
|
Out of scope for this PR, but I wanted to mention that in the long term it would be useful to have lexer/parser that is completely lossless, i.e. every input character is retained. It would then be possible to implement a good syntax highlighter. A highlighter needs to output the exact same content that was input, but also annotate tokens based on their identity in the grammar. |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
let-def
Nov 24, 2015
Contributor
@agarwal Not sure that this is really necessary.
A lossless roundtrip through lexing/parsing is useful for all kind of automated source to source transformation, e.g. refactoring.
For syntactic (or semantic) highlighting, current frontend was precise enough (based on my experiments with merlin).
(Although thanks to my fine taste the resulting buffers looked like christmas trees)
|
@agarwal Not sure that this is really necessary. (Although thanks to my fine taste the resulting buffers looked like christmas trees) |
gasche
added
the
after-next-release
label
Dec 6, 2015
damiendoligez
added this to the
4.04-or-later milestone
Dec 17, 2015
gasche
referenced this pull request
Feb 25, 2016
Closed
fix printing of operator applications with labeled arguments #483
damiendoligez
removed
the
after-next-release
label
Apr 26, 2016
mshinwell
changed the title from
[WIP] Menhir-generated parser
to
Menhir-generated parser
Jun 10, 2016
mshinwell
added
the
work-in-progress
label
Jun 10, 2016
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
keleshev
Jun 15, 2016
Contributor
On --code vs. --table: consider—why not both? The compiler could start by using --code generated parser (which gives speed in case of successful parse), then if syntax error is encountered it can switch to the --table parser and re-parse the file to get a better error message.
This approach might change a 6% performance regression into a comparable performance improvement.
Sexplib follows this approach, although with two different parsers: a hand-written one and an ocamlyacc one.
|
On This approach might change a 6% performance regression into a comparable performance improvement. Sexplib follows this approach, although with two different parsers: a hand-written one and an ocamlyacc one. |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
gasche
Jun 15, 2016
Member
We thought about this as well, but it does complicate the setup a bit; for example the toplevel is running the lexer+parser directly from the interactive user input, so preparing for a re-parse would require memoizing the token stream.
(Right now the priority would be to update the prototype to the trunk grammar -- the grammar keeps evolving -- and move away from cpp macros for location handling. I personally have no time to commit to this in the short term.)
|
We thought about this as well, but it does complicate the setup a bit; for example the toplevel is running the lexer+parser directly from the interactive user input, so preparing for a re-parse would require memoizing the token stream. (Right now the priority would be to update the prototype to the trunk grammar -- the grammar keeps evolving -- and move away from cpp macros for location handling. I personally have no time to commit to this in the short term.) |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
bluddy
commented
Jul 25, 2016
|
Would it be possible to move Menhir to github? |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
gasche
Jul 25, 2016
Member
I'm not the author of Menhir, so it wouldn't be my decision; also, I'm not sure how it relates to this particular work. I generally try to work along with each project's choice of tooling and development model -- as long as it is free software, of course.
|
I'm not the author of Menhir, so it wouldn't be my decision; also, I'm not sure how it relates to this particular work. I generally try to work along with each project's choice of tooling and development model -- as long as it is free software, of course. |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
bluddy
Jul 25, 2016
Obviously not directly related.
I was directed to this PR after asking a question on IRC regarding a compiler PR I'm working on (so-called 'safe-syntax'). Menhir would definitely make adding syntax improvements easier and involve less duplication than the current method.
bluddy
commented
Jul 25, 2016
|
Obviously not directly related. I was directed to this PR after asking a question on IRC regarding a compiler PR I'm working on (so-called 'safe-syntax'). Menhir would definitely make adding syntax improvements easier and involve less duplication than the current method. |
damiendoligez
removed this from the 4.04 milestone
Aug 2, 2016
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
DemiMarie
Dec 1, 2016
Contributor
One issue is that Menhir is under the QPL, whereas OCaml is now fully under the LGPL2.1 plus exceptions. Furthermore, the OCaml compiler distribution is fully self-contained right now.
|
One issue is that Menhir is under the QPL, whereas OCaml is now fully under the LGPL2.1 plus exceptions. Furthermore, the OCaml compiler distribution is fully self-contained right now. |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
gasche
Dec 1, 2016
Member
One issue is that Menhir is under the QPL, whereas OCaml is now fully under the LGPL2.1 plus exceptions.
I think I can convince François to use an OCaml-compatible license.
Furthermore, the OCaml compiler distribution is fully self-contained right now.
Sure, but the bytecode binaries present in the boot folder are part of this "fully self-contained" set. I propose to add a menhir-generated grammar and the corresponding runtime (both being simple OCaml source files) to this boot folder -- preserving the property that the distribution is self-contained.
I think I can convince François to use an OCaml-compatible license.
Sure, but the bytecode binaries present in the |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
dra27
Dec 9, 2016
Contributor
For the bootstrap, it would be good (and I would be very happy to help!) with being able to pull in Menhir as a bootstrapped library in the same way FlexDLL is via a submodule on the repository.
|
For the bootstrap, it would be good (and I would be very happy to help!) with being able to pull in Menhir as a bootstrapped library in the same way FlexDLL is via a submodule on the repository. |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
gasche
Dec 9, 2016
Member
I'm a bit wary of submodules. What are your reasons to think that it would be better than just including the parser and its runtime as OCaml source files in boot/?
Do I correctly understand that you are proposing to pull Menhir as whole, not just the bits necessary to run the parser? Then the development step for people willing to modify the parser would be to pull and build the embedded-menhir rather than their own install of Menhir (the latter is the workflow in the proposed patch). I find it a bit heavy, but one advantage is that it makes it easy for the compiler distribution to have its own patches to Menhir (for example if trunk breaks something that breaks Menhir we can fix it locally).
One thing is that @fpottier still lives in the 19th century of romantic software development: there is no publicly available Menhir repository -- only release tarballs. I think it would be very nice to have him change that process, but I expect it to be a more difficult than getting a version released under an OCaml-compatible license.
|
I'm a bit wary of submodules. What are your reasons to think that it would be better than just including the parser and its runtime as OCaml source files in Do I correctly understand that you are proposing to pull Menhir as whole, not just the bits necessary to run the parser? Then the development step for people willing to modify the parser would be to pull and build the embedded-menhir rather than their own install of Menhir (the latter is the workflow in the proposed patch). I find it a bit heavy, but one advantage is that it makes it easy for the compiler distribution to have its own patches to Menhir (for example if trunk breaks something that breaks Menhir we can fix it locally). One thing is that @fpottier still lives in the 19th century of romantic software development: there is no publicly available Menhir repository -- only release tarballs. I think it would be very nice to have him change that process, but I expect it to be a more difficult than getting a version released under an OCaml-compatible license. |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
dra27
Dec 9, 2016
Contributor
Ah, the 19th Century is certianly a problem! Can't have a submodule without a repo...
My main reason - if it becomes feasible - is that as with FlexDLL, there is a mutual dependency. But it's largely philosophical!
|
Ah, the 19th Century is certianly a problem! Can't have a submodule without a repo... My main reason - if it becomes feasible - is that as with FlexDLL, there is a mutual dependency. But it's largely philosophical! |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
agarwal
Dec 9, 2016
Member
One thing is that @fpottier still lives in the 19th century of romantic software development: there is no publicly available Menhir repository -- only release tarballs. I think it would be very nice to have him change that process, but I expect it to be a more difficult than getting a version released under an OCaml-compatible license.
Consider petitioning him via Change.org.
Consider petitioning him via Change.org. |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
fpottier
Dec 9, 2016
Love the 19th century bit. Actually, as you (@gasche) know, menhirs date back to the pre-Christian era. That said, I am considering moving Menhir's repo to the publicly-visible gitlab.inria.fr. Would that help?
fpottier
commented
Dec 9, 2016
|
Love the 19th century bit. Actually, as you (@gasche) know, menhirs date back to the pre-Christian era. That said, I am considering moving Menhir's repo to the publicly-visible gitlab.inria.fr. Would that help? |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
gasche
Dec 9, 2016
Member
I don't want to say that it is necessary for this particular project -- it's probably not -- but yes, I believe that having a development repository publicly accessible would help people feel confident about Menhir. It's kind of irrational but "having a public repository" is what everyone expects now, and not doing that gives the feeling of a old, closed project (same thing with SVN versioning or... forge hosting; Gitlab is fine, it's the modern underground alternative). It would also have been helpful, I believe, for the people that forked Menhir and built things on top of it (Frédéric, Pippijn), and I am sure it will be helpful to them or others in the future.
|
I don't want to say that it is necessary for this particular project -- it's probably not -- but yes, I believe that having a development repository publicly accessible would help people feel confident about Menhir. It's kind of irrational but "having a public repository" is what everyone expects now, and not doing that gives the feeling of a old, closed project (same thing with SVN versioning or... forge hosting; Gitlab is fine, it's the modern underground alternative). It would also have been helpful, I believe, for the people that forked Menhir and built things on top of it (Frédéric, Pippijn), and I am sure it will be helpful to them or others in the future. |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
DemiMarie
Dec 13, 2016
Contributor
I personally think that one obstacle might be licensing. @fpottier, would you be opposed to relicensing Menhir under the same terms as OCaml?
|
I personally think that one obstacle might be licensing. @fpottier, would you be opposed to relicensing Menhir under the same terms as OCaml? |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
fpottier
Dec 13, 2016
I just checked with Yann Régis-Gianas. We have no objection in principle to changing Menhir's license so as to use the same license as OCaml. We just need a little time to understand exactly what the OCaml license says.
fpottier
commented
Dec 13, 2016
|
I just checked with Yann Régis-Gianas. We have no objection in principle to changing Menhir's license so as to use the same license as OCaml. We just need a little time to understand exactly what the OCaml license says. |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
gasche
May 25, 2017
Member
I just pushed a partially-rebased version of the branch that is in line with the current trunk. No other progress in terms of removing the use of cpp macros has been made (in fact I didn't rebase some of the work I previously did removing cpp macros).
|
I just pushed a partially-rebased version of the branch that is in line with the current trunk. No other progress in terms of removing the use of cpp macros has been made (in fact I didn't rebase some of the work I previously did removing cpp macros). |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
gasche
May 25, 2017
Member
Interestingly the performance numbers, while they remain in the same ballpark, have improved a bit: on a "best out of 5" test, bytecode ocamlc's parsing pass becomes 56% slower, but native ocamlc.opt is only 13% slower at parsing than with ocamlyacc (it was previously a 20% overhead). I suspect that Menhir improved its efficiency since the last time I ran benchmarks -- November 2015.
For a full bytecode compilation, ocamlc has a 5% slowndown and the ocamlc.opt difference is in the noise. This is good: it means that there is no performance difference when people use .opt compilers. Note that since the well-numbered #512 included in 4.04, the ocamlc executable installed is actually ocamlc.opt, so I expect most people to use the native versions now.
|
Interestingly the performance numbers, while they remain in the same ballpark, have improved a bit: on a "best out of 5" test, bytecode For a full bytecode compilation, |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
whitequark
May 26, 2017
Contributor
Note that since the well-numbered #512 included in 4.04, the ocamlc executable installed is actually ocamlc.opt, so I expect most people to use the native versions now.
This regression will hit the cross-compiler users as those only get usable bytecode compilers. Arguably it is a bug in the cross-compiler build process.
This regression will hit the cross-compiler users as those only get usable bytecode compilers. Arguably it is a bug in the cross-compiler build process. |
damiendoligez
added this to the
4.07-or-later milestone
Sep 27, 2017
gasche
and others
added some commits
Jul 4, 2018
gasche
merged commit c303d7b
into
ocaml:trunk
Sep 2, 2018
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
pmetzger
Sep 2, 2018
Member
I assume that pull requests to make use of Menhir's facilities to improve the compiler's error messages are soon going to be accepted?
|
I assume that pull requests to make use of Menhir's facilities to improve the compiler's error messages are soon going to be accepted? |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
gasche
Sep 2, 2018
Member
This is the plan, but "soon" is too optimistic. @let-def has worked on this and is going to keep working on this, we've discussed it, and we are not convinced that the current error-message facilities are scalable/maintenable enough for the OCaml grammar. @let-def has plan for a different approach (if it works, it can be made available to all Menhir users), but we will need at least a few months to have something to show. This first PR took about three years, so maybe that's "soon" in comparison.
|
This is the plan, but "soon" is too optimistic. @let-def has worked on this and is going to keep working on this, we've discussed it, and we are not convinced that the current error-message facilities are scalable/maintenable enough for the OCaml grammar. @let-def has plan for a different approach (if it works, it can be made available to all Menhir users), but we will need at least a few months to have something to show. This first PR took about three years, so maybe that's "soon" in comparison. |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
gasche
Sep 2, 2018
Member
Three other things that could be done and are easier tasks:
-
We could include Menhir as a subrepo and try to hook it within the build system as a mode of use of Menhir. I'm not sure precisely how that would work (would any user of the parser be asked to use this versioned inclusion instead of an external installation?), and to be honest I'm not very interested in working on it, but @dra27 mentioned it and I trust him that it could be an interesting solution.
-
We could think about using Menhir for other parsers included in the OCaml distribution. In particular, this would be required if we go with a long-term plan to deprecate ocamlyacc and replace it with Menhir globally within the OCaml community. This also sounds a bit tedious: I don't think we want to bootstrap all those other parsers, so it may be that the submodule (task (1)) is a required prerequisite.
-
I have vague ideas of small things to play with in the Menhir code generation to get some small performance improvements. Not a priority (performance is already fine), but maybe something to do for fun someday.
-
We could do a writeup of how to do a transition from ocamlyacc to menhir. The way we went at it in this PR is not a reference, because in the meantime @fpottier added new Menhir features to make the transition easier (in fact, quite easy). Our final result is representative of the best way we know to do migrations, but we want to tell a simpler story than our git history.
-
There are various ways we could factorize the parsing rules to improve sharing because similar syntactic constructions. For example, many of the signature items are also valid, as is, as structure items, and this could be represented by a shared intermediary-AST type with a single set of parsing rule, whose result get injected into both signature and structure items. We discussed those during the Menhir migration, but decided to keep a very non-invasive approach of preserving the parser structure as much as possible, and left those for future work.
|
Three other things that could be done and are easier tasks:
|
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
Octachron
Sep 2, 2018
Contributor
For 5, updating #1726 reminded me that the extended operator grammar rules could be reworked to avoid a lot of duplications. I might propose a PR to do just that.
|
For 5, updating #1726 reminded me that the extended operator grammar rules could be reworked to avoid a lot of duplications. I might propose a PR to do just that. |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
smuenzel-js
Sep 3, 2018
Contributor
It looks like this can cause a hang on syntax errors, see MPR#7847
|
It looks like this can cause a hang on syntax errors, see MPR#7847 |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
shindere
Sep 3, 2018
Contributor
|
David Allsopp (2018/08/31 01:43 -0700):
dra27 commented on this pull request.
> +# In order to avoid a build-time dependency on Menhir,
+# we store the result of the parser generator (which
+# are OCaml source files) and Menhir's runtime libraries
+# (that the parser files rely on) in boot/
+
+parsing/parser.ml: \
+ boot/menhir/parser.ml parsing/parser.mly
+ @if [ parsing/parser.mly -nt boot/menhir/parser.ml ]; \
+ then \
+ echo; \
+ tput setaf 3; tput bold; printf "Warning: "; tput sgr0; \
+ echo "Your 'parser.mly' file is more recent \
+ than the parser in 'boot/'."; \
+ echo "Its changes will be ignored unless you run:"; \
+ echo " make promote-menhir"; \
+ echo; \
I just saw this warning fly by on my Git clone. I think the logic needs to be improved - there's no way of guaranteeing when you git clone (or untar a release tarball) that `boot/parser/parser.ml` will be newer than `parsing/parser.mly`
Would it be better to test for the presence of Menhir first?
Has it been verified that the compiler's bootstrap procedure
still works?
|
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
|
@smuenzel-js I am looking at it, thanks. |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
pmetzger
Sep 3, 2018
Member
We could do a writeup of how to do a transition from ocamlyacc to menhir. The way we went at it in this PR is not a reference, because in the meantime @fpottier added new Menhir features to make the transition easier (in fact, quite easy). Our final result is representative of the best way we know to do migrations, but we want to tell a simpler story than our git history.
This seems like an excellent idea.
This seems like an excellent idea. |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
gasche
Sep 4, 2018
Member
@shindere I just checked that make clean; make coreall; make bootstrap; make bootstrap works as expected.
|
@shindere I just checked that |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
shindere
Sep 4, 2018
Contributor
|
Gabriel Scherer (2018/09/04 01:50 -0700):
@shindere I just checked that `make clean; make coreall; make
bootstrap; make bootstrap` works as expected.
Thanks. The bootstrap job on CI has not reported any issue so it must be
okay. Keep in mind that your check is not really complete, though,
because what matters is the ability to bootstrap the compiler after a
change.
|
gasche commentedNov 15, 2015
This WIP is meant to present work ongoing to switch the OCaml parser from using ocamlyacc as a parser generator to Menhir. This work is joint work with Frédéric Bour ( @def-lkb ), who started this work inside Merlin, François Pottier and Yann Régis-Gianas who have been very reactive at improving Menhir as necessary¹, and Jacques-Henri Jourdan who provided various comments, explanations of the LR stuff, and suggestions along the way.
¹: we rely on several Menhir features introduced for this work, so this patch only works with the very last (or in-development) version of Menhir.
Ocamlyacc is a stable tool (that hasn't required much maintenance effort in the past ten years) and gives good parsing performance. Some reasons to replace it by Menhir are the following, in decreasing order of perceived strength:
$2, parametrized rules can remove a lot of redundancy); I expect that most of the grammar-related woes of the recent docstring parsing effort would have been removed by Menhir. Its conflict explanation features should also make it much easier to refactor the grammar while preserving input programs, or evaluate proposed syntax changesNote that we have not yet started applying any work on syntax error messages on the OCaml grammar (there is good work in Merlin and François' experiment on Compcert's C grammar are very promising). This is strictly future work, but getting a Menhir grammar that we would be ready to integrate in the compiler (hopefully a future evolution of the present PR) is a necessary first step.
Performance aspects
Menhir has a --table backend, that generates OCaml code traversing the automated represented as a tabular data-structure (just as OcamlYacc does), and a --code backend that compiles the automaton traversal into pure OCaml code (removing the interpretation overhead). We plan to use the --table backend in the OCaml parser for two reasons:
While the --code backend is more than competitive with Ocamlyacc performance-wise, using the --table backend would mean a degradation in parsing time from the current parser. The PR includes a benchmark script that you can run on your machine (
bash menhir-bench.bashaftermake world.optandmake install).On my machine, using Menhir --table adds a 20% overhead to the parsing pass of
ocamlc.opt, and a 50% overhead to the parsing pass ofocamlc(we pay a lot when we replace C code into bytecode-interpreted OCaml code). However, the overhead is much more acceptable when considering the whole compilation chain: for bytecode compilation (noticeably faster than native compilation), the overhead I measure becomes 6% in native code and 5% in bytecode.Is the OCaml community ready to accept a 5% performance degradation for bytecode compilation? I think that it is worth it, as it comes with better error messages and (for maintainers) a grammar that is easier to maintain and evolve.
Bootstrap story
After a few experiments, the bootstrap story that I ended with is the following:
boot/menhir/(and copied inparsing/at parser-compilation time); the grammar is stored inparsing/parser.mlyas usual (parser_menhir.mlyin the current patch, that maintains the ocamlyacc parser in parallel)promote-menhiris added to the Makefile; it generates the OCaml parser from the.mlyand copies it, along with a matching copy of the Menhir runtime library, toboot/menhirThis means that Menhir is not a build dependency (nor runtime dependency) of the OCaml compiler distribution, as the source of the parser is kept. The only time at which Menhir is necessary is when running the
promote-menhirtarget to update the parser, that is, for the people that want to change the OCaml grammar.Status of the proposed migration
The first step of the migration was to get a working Menhir-generated parser that could be compared to the OcamlYacc parser, with as little changes to the grammar as possible. As Frédéric predicted, the main difficulty was the handling of symbol locations, which uses an imperative interface (the
Parsingmodule) in OCamlyacc, and relies on a pure data-passing interface in Menhir. This required extensive changes to the auxiliary functions called by the parser (and theDocstringsmodule supporting the parsing of documentation comments). On the side of grammar, I use the very ugly hack of relying oncppto make#definehiding the location-passing from the semantic actions. This allows to write semantic actions that are extremely close to the yacc ones, which simplifies review and correctness checking.Once this point is done, we have a parser that can easily be verified to be correct: I set up the OCaml frontend so that it parses each input file with both the yacc-generated and menhir-generated parsers, and fail if the resulting ASTs do not exactly match (including locations, etc.). The current parser passes this test for all the files touched by
make world.opt(we could test it further on all OPAM-accessible sources, but this would require extensive data collection work; I am already confident that the parser has very few regressions, if any). (Getting the exact same locations as yacc require some changes in Menhir's location handling by François.)The next step is to use Menhir's abstraction features to remove all the
#defineused in the semantic actions, and thus get a proper.mly-- the "cpp war". This patch is still in RFC stage because this step is not finished, and I would not propose to include a cpp-preprocessed grammar in trunk. I have done two first patches in this direction, one for mkrhs and one for extra_{str,sig,...}. It is easy to verify that these changes, that get us further and further from the yacc grammar, are correctness-preserving, thanks to the AST comparison machinery.Requirements for a final replacement?
My personal opinion is that, as soon as this last step is finished, the resulting grammar could be considered for inclusion, replacing the current ocamlyacc grammar. This requires accepting the parsing performance overhead. Finally, this would not be the end of the road, as more work is required to get better syntax error messages.