Rules for well-formed parsers #638

Kha · 2021-08-20T16:04:16Z

With Parser and ParserDescr/syntax, we have two levels of abstraction for defining parsers. In both cases, however, it is possible to define parsers that break as-of-yet unwritten rules, confusing meta programs such as quotations and the pretty printer. We should define these rules, ideally enforce them at least for syntax by construction, and possibly lint them for Parser.

Abstract rule

The abstract rule is that the structure of the parser output should uniquely determine the parser call graph/grammar derivation tree that produced it. By "structure" (of a Syntax tree) we mean the tree modulo atoms, i.e. exactly what is considered when matching against quotations. An example of a parser breaking this rule is "foo" >> ident <|> "bar" >> term: ignoring atoms, we cannot know which alternative accepted the input bar x. Note that ident <|> term itself would be unambiguous because <|> is left biased. In theory, term <|> ident would also be acceptable, but we would need to know whether ident is part of the term category (or rather, whether the produced kind is so) to decide this case in practice.

Another counter example is many and other repetition combinators. In many p, if p is of unknown "arity" (# of produced nodes), we don't know which syntax node child belongs to which "sequence element". This was "fixed", but that fix is no good either: if we encounter a null node inside of a many p output, we don't know in general whether many introduced it because of arity > 1 or whether it was produced by p itself. We either have to wrap every sequence element in a node, which would be wasteful, or demand that p is of constant arity 1.

In practice, we should strengthen this abstract rule: it should be possible to efficiently determine the grammar derivation based on a reasonable amount of static information. For example, we might not want to add new metadata that lets us decide whether ident is in term like we would need to above. And ideally we would like to decide <|> by looking at the root kind of the output alone instead of having to dive further into the syntax tree.

Implementation for `syntax`

Based on the above rule, here is a proposal for a conservative approximation for syntax, to be implemented in the translation to ParserDescr:

For each stx subterm, we compute the arity and, for arity 1, the produced kind, if known and unique
- for parser aliases, we specify this metadata at registration time
- categories have arity 1 and an unknown kind
- for references to parser definitions (of type ParserDescr or Parser), we assume the arity is 1 and the kind is the declaration name. This is correct for syntax ... := and ... := leading/trailing_parser ... declarations, but obviously not in general. We could inspect the definition to be sure, except we can't if we want to make effective use of the module system. Alternatively, we could store the information in an environment extension.
In many p and other repetition combinators, we check that p is of arity 1
In p <|> q, we conservatively check that p has a unique produced kind that is not null (since for null we should not assume that there is a unique parser producing it), and that the RHS is of arity 1.
- We might also want to allow $strLit <|> $strLit as a special case. However, this is not a great way to define e.g. Unicode alternatives since it ignores pp.unicode.
- We might allow p of kind null if q also has a kind, and it is not null. This would not work with p <|> q <|> r, however.
In Notation.lean, the only syntax declaration that does not already fulfill these rules should be

syntax location := withPosition("at " locationWildcard <|> locationHyp)

With the additional null rule mentioned above, it should be acceptable with group("at " locationWildcard).

Implementation for `Parser` & removal of backtracking in the pretty printer

TBD

The text was updated successfully, but these errors were encountered:

tydeu · 2021-08-23T23:19:32Z

@Kha

The abstract rule is that the structure of the parser output should uniquely determine the parser call graph/grammar derivation tree that produced it. By "structure" (of a Syntax tree) we mean the tree modulo atoms, i.e. exactly what is considered when matching against quotations. An example of a parser breaking this rule is "foo" >> ident <|> "bar" >> term: ignoring atoms, we cannot know which alternative accepted the input bar x.

What is conceptually wrong with discrimination based on atoms? That is, wouldn't just be more intuitive and straightforward to resolve this issue by just having quotations match atoms as well? This particular limitation seems very artificial and unnatural to me. It seems very likely that most end-users would write syntax like "foo" >> ident <|> "bar" >> term and expect it to work and be very confused why it doesn't / can't.

Kha · 2021-08-30T15:04:15Z

What is conceptually wrong with discrimination based on atoms?

Because we want a quotation pattern in a semantic transformation, e.g. a macro, that used fun to still match if the user wrote λ. And avoid the code explosion from checking every single atom as a beneficial side effect.

That is, wouldn't just be more intuitive and straightforward to resolve this issue by just having quotations match atoms as well?

That still wouldn't solve the efficiency part. We don't want arbitrary "lookahead" to be necessary to differentiate between the two sides of an <|>.

Kha · 2021-08-30T15:06:06Z

In the end, the high-level syntax command is free to adhere to these rules in whatever way it wants. If an <|> doesn't fulfill the above rules, syntax could wrap the LHS with a unique kind automatically.

tydeu · 2021-09-03T20:08:10Z

Because we want a quotation pattern in a semantic transformation, e.g. a macro, that used fun to still match if the user wrote λ. And avoid the code explosion from checking every single atom as a beneficial side effect.

Could not tokens with multiple different spelling just be parsed as a node with a unique kind rather than different atoms? That doesn't seem like it would add much overhead. In fact, it could even lessen it as nodes come with a precomputed hash for their kind while atoms don't. Also, as I'm sure you know, there already exists specialized parsers like unicodeSymbol to parse tokens with different ASCII and Unicode spellings -- these could be retooled to produce nodes rather than atoms.

To your second point, the fact that quotations don't current verify atoms already causes many headaches when writing macros. While it may be more computationally efficient, ill-formed syntax will often generate panics rather than clean errors when some later part of the code acts in a manner that assumed the syntax was well-formed. As you might imagine. this makes such errors hard to diagnose.

Kha · 2021-09-04T16:07:12Z

Could not tokens with multiple different spelling just be parsed as a node with a unique kind rather than different atoms?

I can't make sense of this. Syntax is supposed to be a concrete syntax tree, so different inputs ultimately must lead to different atoms. Apart from identifiers, all tokens are stored in atoms, not nodes directly.

ill-formed syntax will often generate panics rather than clean errors when some later part of the code acts in a manner that assumed the syntax was well-formed

I'm not sure checking atoms would have caught the cases where I produced ill-formed syntax any earlier, the structure of the syntax is much easier to get wrong. The real solution here is a strongly-typed syntax type indexed by the syntax kind.

Kha · 2021-09-04T18:18:19Z

If the syntax quotation elaborator knew the parser that generated each syntax node of the pattern, it could implement special matching behavior for specific parsers such as unicodeSymbol. That would be a major refactoring, so I haven't considered it so far, though it might eventually be necessary for the typed syntax support (if we want syntax category types such as Syntax `term). Alternatively, we could wrap the output of unicodeSymbol in a special syntax kind that tells the quotation elaborator to ignore it during matching; perhaps that's even what you meant. It would introduce overhead in some very common syntax, but maybe not significantly so.

tydeu · 2021-09-05T02:34:19Z

Could not tokens with multiple different spelling just be parsed as a node with a unique kind rather than different atoms?

I can't make sense of this. Syntax is supposed to be a concrete syntax tree, so different inputs ultimately must lead to different atoms. Apart from identifiers, all tokens are stored in atoms, not nodes directly.

Alternatively, we could wrap the output of unicodeSymbol in a special syntax kind that tells the quotation elaborator to ignore it during matching; perhaps that's even what you meant. It would introduce overhead in some very common syntax, but maybe not significantly so.

Yeah, that is closer to what I meant by that sentence: keywords with multiple different spellings (atoms) could be wrapped in a node with a distinct kind. I don't think this would incur significant overhead, but, then again, I am not an expert on the parser.

tydeu · 2021-09-05T02:38:25Z

As an aside:

The real solution here is a strongly-typed syntax type indexed by the syntax kind.

I think this would be a wonderful long term goal and wasn't aware that it was already being considered! I can already see many places where it would make macro writing considerable more straightforward. For example, one wouldn't have to keep re-verifying that a node is of an expected kind at each level of a macro.

gebner · 2022-01-27T11:41:36Z

Because we want a quotation pattern in a semantic transformation, e.g. a macro, that used fun to still match if the user wrote λ.

Note that we only have 7 Unicode/ASCII symbol pairs; if those were the only issue, I think it would be defensible to just remove the duplicate notation. I agree it would be a good idea to add an irrelevant node kind that instructs the pattern matcher to ignore that subtree.

A related gotcha is the pattern `(let $x := $v; $b), it would be nice if that worked as one might expect at first glance.

against `mathlib` fd47bdf09e90f553519c712378e651975fe8c829 Co-authored-by: Scott Morrison <scott.morrison@gmail.com>

leodemoura mentioned this issue Aug 24, 2021

chore: align stx precedence in syntax to the new one in macro #634

Merged

Kha mentioned this issue Oct 11, 2021

Widgets: ToHtmlFormat #723

Closed

Kha mentioned this issue Nov 16, 2021

Syntax pattern matching on numbers succeeds unexpectedly #801

Closed

1 task

Kha added the low priority label Jan 26, 2022

ChrisHughes24 pushed a commit to ChrisHughes24/lean4 that referenced this issue Dec 2, 2022

feat: port Mathlib.Control.ULift (leanprover#638)

e1378b6

against `mathlib` fd47bdf09e90f553519c712378e651975fe8c829 Co-authored-by: Scott Morrison <scott.morrison@gmail.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Rules for well-formed parsers #638

Rules for well-formed parsers #638

Kha commented Aug 20, 2021 •

edited

tydeu commented Aug 23, 2021 •

edited

Kha commented Aug 30, 2021

Kha commented Aug 30, 2021 •

edited

tydeu commented Sep 3, 2021 •

edited

Kha commented Sep 4, 2021

Kha commented Sep 4, 2021

tydeu commented Sep 5, 2021 •

edited

tydeu commented Sep 5, 2021

gebner commented Jan 27, 2022

Rules for well-formed parsers #638

Rules for well-formed parsers #638

Comments

Kha commented Aug 20, 2021 • edited

Abstract rule

Implementation for syntax

Implementation for Parser & removal of backtracking in the pretty printer

tydeu commented Aug 23, 2021 • edited

Kha commented Aug 30, 2021

Kha commented Aug 30, 2021 • edited

tydeu commented Sep 3, 2021 • edited

Kha commented Sep 4, 2021

Kha commented Sep 4, 2021

tydeu commented Sep 5, 2021 • edited

tydeu commented Sep 5, 2021

gebner commented Jan 27, 2022

Kha commented Aug 20, 2021 •

edited

Implementation for `syntax`

Implementation for `Parser` & removal of backtracking in the pretty printer

tydeu commented Aug 23, 2021 •

edited

Kha commented Aug 30, 2021 •

edited

tydeu commented Sep 3, 2021 •

edited

tydeu commented Sep 5, 2021 •

edited