Merr (Meta-ERRor) is a syntax error message generator for LR parsers. It passes samples of erroneous input to the parser and produces an error message function that produces the error message written in the sample file when the same syntax error occurs again. with a modified version of the menhir parser generator. Adapting other parser generators such as ocamlyacc is probably not difficult. The modified menhir can be found at https://github.com/pippijn/menhir.
This tool is based on ideas from Clinton Jeffery's "merr" tool at http://unicon.sourceforge.net/merr/. It is recommended to read the technical paper introducing the concepts.
The merr program takes several inputs to produce its error message function from.
-t <terminals> -e <errors.ml.in> -a <automaton> -p <parser command> -o <output.ml>
The terminals file contains an ocamlyacc grammar or tokens file amended with token names. An extract of this file could look like this:
%token<string> TkIDENTIFIER "identifier" %token TkIF "if" %token TkTHEN "then" %token TkELSE "else"
Merr uses these strings when producing default error messages, so that the
user doesn't see the internal names like "TkIDENTIFIER". Since menhir doesn't
understand these extra string literals after the token name, the grammar file
needs to be preprocessed before passing it to the parser generator. A simple
sed -e 's/^\(%token[^"]*\w\+\)\s*".*"$/\1/' will safely remove the string
literals as well as leading whitespace. This can be done in a
The second input expected by merr is the sample file. Merr works by passing
each sample input to the parser, which should be instrumented to print a pair
(state, token) to its standard output. The
state should be a number,
token is the token type constructor name, e.g.
The parser should have a flag to disable error messages and print this
state/token pair. The parser command and this flag should be passed to merr
-p command line option.
The sample file has a similar syntax as OCaml itself. The following is an excerpt from merr's own error description.
module Tokens = Etokens open Etoken (* provides string_of_token *) open Tokens (* provides the 'token' type *) let message = function | "open" -> function | EOF -> "expected module name after 'open'" | TkIDENTIFIER -> "unexpected identifier `%s' after 'open'" "expected module name (capitalised)" | _ -> "unexpected token '%s' in handler definition" | "open Foo let message = function" -> function | EOF -> "expected '|'-separated code fragments" | _ -> "unexpected token '%s' where code fragments expected"
One or more
open directives are always required, and the opened modules
should provide the function
val string_of_token : token -> string and bring
the token type constructors into scope. Merr can generate a default function
that return the strings in the terminals file, but usually you will want a
better description, using the token data. There can be any number of
In error messages, the
%s part of the message is replaced by the application
string_of_token with the erroneous token. So, for instance if that
function returns the
string argument of the
TkIDENTIFIER constructor, the
identifier can be printed in the error message.
Error messages can consist of multiple consecutive format strings. In the
error message returned from the generated error function, these are joined
with the new-line character
The top-level patterns contain erroneous sample input to be passed to the parser. The inner match dispatches over the current parser token, and a default catch-all case can be assigned that only regards the state, not the token. Instead of a nested match, one can also write the message directly after the code sample. Multi-matches are also possible, if you want the same error handling for multiple distinct code samples (and therefore states).
Merr will check that all code samples are unique in that they arrive at different states in the parser, so that there can be no ambiguities about which error message to display for a given error.
As a secondary source of error messages, the merr program will try to find out
what tokens would allow a shift action in the parser. It will present the user
with a list of token names (from the terminals file) that the parser might
expect at the state where the error occurred. Menhir can produce an automaton
description parseable by merr, using the option
Using the function
After the parser has been run on the inputs and the error message function has
been generated, it can be called when an error occurs in the parser. The
modified menhir parser will raise an exception
StateError of int * token
token is the token type used by the parser and the first argument is
the state. These two can be passed to the error function with the signature
val message : int -> token -> string. Further formatting can be done in the
client of the error module.