Pre-RFC: Non-token-generating symbols in errors #327

robojeb · 2018-10-30T19:01:31Z

Currently to have a rule be provided as expected it must produce a token.
This means that for some important syntax you end up with extra tokens in your stream.

Consider the following grammar which parses ";" delimited lines.

line = { number ~ ";" }
number = { digit+ }
digit = _{ '0'..'9' }

for the input "19191" Pest produces the error

Expected "Line"

Which is unintuitive for end users.

If you make the ";" into a rule

line = { number ~ semicolon }
number = { digit+ }
digit = _{ '0'..'9' }
semicolon = { ";" }

The generated error is as expected but you end up dealing with extra tokens that aren't critical for building ASTs just for parsing the grammar. But when you silence the token that rule produces it goes back to the unintuitive error.

It would be nice if Pest provided a way to indicate "The following rule/token is critical for parsing but I don't actually need a token produced for it"

This will likely need an RFC at some point but I wanted to open the discussion.

The text was updated successfully, but these errors were encountered:

dragostis · 2018-10-31T19:58:54Z

Thanks for your thoughts and super excited to tackle this.

Some important questions that this would need to asnwer:

how does this relate to pest-ast?
would pest-ast make this issue irrelevant because it hides the extra token?
do we want the person reading the grammar to know that a specific rule is meant for error reporting only?
should we introduce a special kind of rule or expression in order to facilitate this?

These are just a starting point for an actual RFC and I'm definitely interested to see where the discussion leads.

CAD97 · 2018-10-31T21:42:49Z

Note that I've not added a way to actually ignore pairs in the pest-ast derive yet; you'd need to have a private member to eat it (though it can just be a unit value). But for the most part, pest-ast allows you to ignore the shape of the parse tree and just grab what you really want.

I'd be interested in what it'd look like to change all silent rules to contribute to error reporting. (I don't know!) Honestly, error recovery and reporting is one of the areas I've spent the least effort considering myself. (I've spent way too much time working on just wasting time with parsing tools on my project, it's just too much fun and a very concrete problem unlike all the soft ones for the rest of the stack.)

If we get this, though, EOI should adopt it when it can, though. The only reason it isn't silent in 2 is for error reporting.

robojeb · 2018-11-06T21:16:24Z

@dragostis My opinion is that this type of issue is likely better to be handled at the parser generator level.

I like the direction pest-ast is heading and I think I will look into using that in the future.
Though I could imagine some cases where a user of pest is going to want to build the AST by hand for validation or because they have more complex structures that can't be easily derived.

With regards to if we want to expose this information to someone examining the grammar, I am not sure how much different this is than what is exposed by suppressing the pairs in the grammar already.

In my example above I already "leak" the fact that I don't handle each digit individually, rather I group them, in this case it doesn't particularly matter but there may be other formats where it does.
I think potentially notating "critical" tokens could potentially assist in understanding the grammar.

GodTamIt · 2019-05-21T18:14:08Z

I know this is a bit older but just wanted to +1 on this.

do we want the person reading the grammar to know that a specific rule is meant for error reporting only?
should we introduce a special kind of rule or expression in order to facilitate this?

I am a proponent of having non-token-generating literals be included in error messages by default and allowing something in the grammar definition to specify exclusion from errors. At least for me, I've found that ignoring non-token-generating literals in error reporting generally leads to more unexpected error messages.

Edit: adding some examples

Take a simple struct declaration missing a semicolon on one of its members:

struct MyStruct {
    bool my_bool
};

No semicolon rule

--> test.txt:2:5
   |
 2 |     bool my_bool␊
   |     ^---
   |
   = expected type identifier

(but that is a type identifier?)

Separate semicolon rule

--> test.txt:3:1
   |
 3 | };␊
   | ^---
   |
   = expected identifier or semicolon

cc @cramertj

wirelyre · 2019-07-15T21:24:17Z

The OCaml parsing library angstrom has a primitive called commit, which could be useful inspiration.

commit indicates that, outside of the usual success–failure–backtrack strategy, if a parse ever reaches that point, no other parse will (should) ever hit that input position. In other words, it commits to the current parse path; if you'd need to backtrack past a commit, parsing fails immediately.

This is intended as an optimization for streaming parsers, but I have used it to provide clearer error messages. Maybe pest could adopt the primitive, perhaps inserting it automatically from static analysis.

dragostis · 2019-07-15T22:09:07Z

@wirelyre, this is something that I've had in mind for a while now and I love the idea; thanks for sharing it. It can definitely be done statically, but it requires quite a bit of work to get right.

wirelyre · 2019-07-15T23:18:56Z

Some more brainstorming:

There is also the option of making errors more expressive in general.

Parsing doesn't just fail at one spot. There are two parts: the context which led to the failure (our current path in the parse tree), and the actual missing/unexpected characters.

Maybe it makes sense to pair these together. Something like:

struct ParsingError<R> {
    context: (InputLocation, R), // or Vec<R> for full context
    failures: Vec<(InputLocation, Kind<R>)>,
}

enum Kind<R> {
    Literal(&'static str), // "expected ';'"
    Positive(R),           // "expected operator"
    Negative(R),           // "unexpected integer"
    // maybe also a negative literal
}

Then propagating errors upwards becomes a heuristic question. Is it more useful (in this parsing state) to keep the full list of failures, or to merge a whole ParsingError into a single Kind::Positive?

19191
^ error while parsing line
19191
     ^ expected ';'

versus

19191
^ expected line

I'd need more time trying this on paper but intuitively it feels more flexible and informative than the current types. Even without any extra syntax in grammars. Also I expect there are some pretty easy heuristics to avoid adding too many failures.

dragostis added the needs-rfc label Oct 31, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Pre-RFC: Non-token-generating symbols in errors #327

Pre-RFC: Non-token-generating symbols in errors #327

robojeb commented Oct 30, 2018

dragostis commented Oct 31, 2018

CAD97 commented Oct 31, 2018

robojeb commented Nov 6, 2018

GodTamIt commented May 21, 2019 •

edited

wirelyre commented Jul 15, 2019

dragostis commented Jul 15, 2019

wirelyre commented Jul 15, 2019

Pre-RFC: Non-token-generating symbols in errors #327

Pre-RFC: Non-token-generating symbols in errors #327

Comments

robojeb commented Oct 30, 2018

dragostis commented Oct 31, 2018

CAD97 commented Oct 31, 2018

robojeb commented Nov 6, 2018

GodTamIt commented May 21, 2019 • edited

wirelyre commented Jul 15, 2019

dragostis commented Jul 15, 2019

wirelyre commented Jul 15, 2019

GodTamIt commented May 21, 2019 •

edited