Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pre-RFC: Non-token-generating symbols in errors #327

Open
robojeb opened this issue Oct 30, 2018 · 7 comments
Open

Pre-RFC: Non-token-generating symbols in errors #327

robojeb opened this issue Oct 30, 2018 · 7 comments

Comments

@robojeb
Copy link

robojeb commented Oct 30, 2018

Currently to have a rule be provided as expected it must produce a token.
This means that for some important syntax you end up with extra tokens in your stream.

Consider the following grammar which parses ";" delimited lines.

line = { number ~ ";" }
number = { digit+ }
digit = _{ '0'..'9' }

for the input "19191" Pest produces the error

Expected "Line"

Which is unintuitive for end users.

If you make the ";" into a rule

line = { number ~ semicolon }
number = { digit+ }
digit = _{ '0'..'9' }
semicolon = { ";" }

The generated error is as expected but you end up dealing with extra tokens that aren't critical for building ASTs just for parsing the grammar. But when you silence the token that rule produces it goes back to the unintuitive error.

It would be nice if Pest provided a way to indicate "The following rule/token is critical for parsing but I don't actually need a token produced for it"

This will likely need an RFC at some point but I wanted to open the discussion.

@dragostis
Copy link
Contributor

Thanks for your thoughts and super excited to tackle this.

Some important questions that this would need to asnwer:

  • how does this relate to pest-ast?
  • would pest-ast make this issue irrelevant because it hides the extra token?
  • do we want the person reading the grammar to know that a specific rule is meant for error reporting only?
  • should we introduce a special kind of rule or expression in order to facilitate this?

These are just a starting point for an actual RFC and I'm definitely interested to see where the discussion leads.

@CAD97
Copy link
Contributor

CAD97 commented Oct 31, 2018

Note that I've not added a way to actually ignore pairs in the pest-ast derive yet; you'd need to have a private member to eat it (though it can just be a unit value). But for the most part, pest-ast allows you to ignore the shape of the parse tree and just grab what you really want.

I'd be interested in what it'd look like to change all silent rules to contribute to error reporting. (I don't know!) Honestly, error recovery and reporting is one of the areas I've spent the least effort considering myself. (I've spent way too much time working on just wasting time with parsing tools on my project, it's just too much fun and a very concrete problem unlike all the soft ones for the rest of the stack.)

If we get this, though, EOI should adopt it when it can, though. The only reason it isn't silent in 2 is for error reporting.

@robojeb
Copy link
Author

robojeb commented Nov 6, 2018

@dragostis My opinion is that this type of issue is likely better to be handled at the parser generator level.

I like the direction pest-ast is heading and I think I will look into using that in the future.
Though I could imagine some cases where a user of pest is going to want to build the AST by hand for validation or because they have more complex structures that can't be easily derived.

With regards to if we want to expose this information to someone examining the grammar, I am not sure how much different this is than what is exposed by suppressing the pairs in the grammar already.

In my example above I already "leak" the fact that I don't handle each digit individually, rather I group them, in this case it doesn't particularly matter but there may be other formats where it does.
I think potentially notating "critical" tokens could potentially assist in understanding the grammar.

@GodTamIt
Copy link

GodTamIt commented May 21, 2019

I know this is a bit older but just wanted to +1 on this.

do we want the person reading the grammar to know that a specific rule is meant for error reporting only?
should we introduce a special kind of rule or expression in order to facilitate this?

I am a proponent of having non-token-generating literals be included in error messages by default and allowing something in the grammar definition to specify exclusion from errors. At least for me, I've found that ignoring non-token-generating literals in error reporting generally leads to more unexpected error messages.

Edit: adding some examples

Take a simple struct declaration missing a semicolon on one of its members:

struct MyStruct {
    bool my_bool
};

No semicolon rule

--> test.txt:2:5
   |
 2 |     bool my_bool␊
   |     ^---
   |
   = expected type identifier

(but that is a type identifier?)

Separate semicolon rule

--> test.txt:3:1
   |
 3 | };␊
   | ^---
   |
   = expected identifier or semicolon

cc @cramertj

@wirelyre
Copy link
Contributor

The OCaml parsing library angstrom has a primitive called commit, which could be useful inspiration.

commit indicates that, outside of the usual success–failure–backtrack strategy, if a parse ever reaches that point, no other parse will (should) ever hit that input position. In other words, it commits to the current parse path; if you'd need to backtrack past a commit, parsing fails immediately.

This is intended as an optimization for streaming parsers, but I have used it to provide clearer error messages. Maybe pest could adopt the primitive, perhaps inserting it automatically from static analysis.

@dragostis
Copy link
Contributor

@wirelyre, this is something that I've had in mind for a while now and I love the idea; thanks for sharing it. It can definitely be done statically, but it requires quite a bit of work to get right.

@wirelyre
Copy link
Contributor

Some more brainstorming:

There is also the option of making errors more expressive in general.

Parsing doesn't just fail at one spot. There are two parts: the context which led to the failure (our current path in the parse tree), and the actual missing/unexpected characters.

Maybe it makes sense to pair these together. Something like:

struct ParsingError<R> {
    context: (InputLocation, R), // or Vec<R> for full context
    failures: Vec<(InputLocation, Kind<R>)>,
}

enum Kind<R> {
    Literal(&'static str), // "expected ';'"
    Positive(R),           // "expected operator"
    Negative(R),           // "unexpected integer"
    // maybe also a negative literal
}

Then propagating errors upwards becomes a heuristic question. Is it more useful (in this parsing state) to keep the full list of failures, or to merge a whole ParsingError into a single Kind::Positive?

19191
^ error while parsing line
19191
     ^ expected ';'

versus

19191
^ expected line

I'd need more time trying this on paper but intuitively it feels more flexible and informative than the current types. Even without any extra syntax in grammars. Also I expect there are some pretty easy heuristics to avoid adding too many failures.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants