New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Rules for well-formed parsers #638
Comments
What is conceptually wrong with discrimination based on atoms? That is, wouldn't just be more intuitive and straightforward to resolve this issue by just having quotations match atoms as well? This particular limitation seems very artificial and unnatural to me. It seems very likely that most end-users would write syntax like |
Because we want a quotation pattern in a semantic transformation, e.g. a macro, that used
That still wouldn't solve the efficiency part. We don't want arbitrary "lookahead" to be necessary to differentiate between the two sides of an |
In the end, the high-level |
Could not tokens with multiple different spelling just be parsed as a node with a unique kind rather than different atoms? That doesn't seem like it would add much overhead. In fact, it could even lessen it as nodes come with a precomputed hash for their kind while atoms don't. Also, as I'm sure you know, there already exists specialized parsers like To your second point, the fact that quotations don't current verify atoms already causes many headaches when writing macros. While it may be more computationally efficient, ill-formed syntax will often generate panics rather than clean errors when some later part of the code acts in a manner that assumed the syntax was well-formed. As you might imagine. this makes such errors hard to diagnose. |
I can't make sense of this.
I'm not sure checking atoms would have caught the cases where I produced ill-formed syntax any earlier, the structure of the syntax is much easier to get wrong. The real solution here is a strongly-typed syntax type indexed by the syntax kind. |
If the syntax quotation elaborator knew the parser that generated each syntax node of the pattern, it could implement special matching behavior for specific parsers such as |
Yeah, that is closer to what I meant by that sentence: keywords with multiple different spellings (atoms) could be wrapped in a node with a distinct kind. I don't think this would incur significant overhead, but, then again, I am not an expert on the parser. |
As an aside:
I think this would be a wonderful long term goal and wasn't aware that it was already being considered! I can already see many places where it would make macro writing considerable more straightforward. For example, one wouldn't have to keep re-verifying that a node is of an expected kind at each level of a macro. |
Note that we only have 7 Unicode/ASCII symbol pairs; if those were the only issue, I think it would be defensible to just remove the duplicate notation. I agree it would be a good idea to add an A related gotcha is the pattern |
against `mathlib` fd47bdf09e90f553519c712378e651975fe8c829 Co-authored-by: Scott Morrison <scott.morrison@gmail.com>
With
Parser
andParserDescr
/syntax
, we have two levels of abstraction for defining parsers. In both cases, however, it is possible to define parsers that break as-of-yet unwritten rules, confusing meta programs such as quotations and the pretty printer. We should define these rules, ideally enforce them at least forsyntax
by construction, and possibly lint them forParser
.Abstract rule
The abstract rule is that the structure of the parser output should uniquely determine the parser call graph/grammar derivation tree that produced it. By "structure" (of a
Syntax
tree) we mean the tree modulo atoms, i.e. exactly what is considered when matching against quotations. An example of a parser breaking this rule is"foo" >> ident <|> "bar" >> term
: ignoring atoms, we cannot know which alternative accepted the inputbar x
. Note thatident <|> term
itself would be unambiguous because<|>
is left biased. In theory,term <|> ident
would also be acceptable, but we would need to know whetherident
is part of theterm
category (or rather, whether the produced kind is so) to decide this case in practice.Another counter example is
many
and other repetition combinators. Inmany p
, ifp
is of unknown "arity" (# of produced nodes), we don't know which syntax node child belongs to which "sequence element". This was "fixed", but that fix is no good either: if we encounter anull
node inside of amany p
output, we don't know in general whethermany
introduced it because of arity > 1 or whether it was produced byp
itself. We either have to wrap every sequence element in a node, which would be wasteful, or demand thatp
is of constant arity 1.In practice, we should strengthen this abstract rule: it should be possible to efficiently determine the grammar derivation based on a reasonable amount of static information. For example, we might not want to add new metadata that lets us decide whether
ident
is interm
like we would need to above. And ideally we would like to decide<|>
by looking at the root kind of the output alone instead of having to dive further into the syntax tree.Implementation for
syntax
Based on the above rule, here is a proposal for a conservative approximation for
syntax
, to be implemented in the translation toParserDescr
:stx
subterm, we compute the arity and, for arity 1, the produced kind, if known and uniqueParserDescr
orParser
), we assume the arity is 1 and the kind is the declaration name. This is correct forsyntax ... :=
and... := leading/trailing_parser ...
declarations, but obviously not in general. We could inspect the definition to be sure, except we can't if we want to make effective use of the module system. Alternatively, we could store the information in an environment extension.many p
and other repetition combinators, we check thatp
is of arity 1p <|> q
, we conservatively check thatp
has a unique produced kind that is notnull
(since fornull
we should not assume that there is a unique parser producing it), and that the RHS is of arity 1.$strLit <|> $strLit
as a special case. However, this is not a great way to define e.g. Unicode alternatives since it ignorespp.unicode
.p
of kindnull
ifq
also has a kind, and it is notnull
. This would not work withp <|> q <|> r
, however.Notation.lean
, the only syntax declaration that does not already fulfill these rules should beWith the additional
null
rule mentioned above, it should be acceptable withgroup("at " locationWildcard)
.Implementation for
Parser
& removal of backtracking in the pretty printerTBD
The text was updated successfully, but these errors were encountered: