-
-
Notifications
You must be signed in to change notification settings - Fork 252
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Pre-RFC: Non-token-generating symbols in errors #327
Comments
Thanks for your thoughts and super excited to tackle this. Some important questions that this would need to asnwer:
These are just a starting point for an actual RFC and I'm definitely interested to see where the discussion leads. |
Note that I've not added a way to actually ignore pairs in the pest-ast derive yet; you'd need to have a private member to eat it (though it can just be a unit value). But for the most part, pest-ast allows you to ignore the shape of the parse tree and just grab what you really want. I'd be interested in what it'd look like to change all silent rules to contribute to error reporting. (I don't know!) Honestly, error recovery and reporting is one of the areas I've spent the least effort considering myself. (I've spent way too much time working on just wasting time with parsing tools on my project, it's just too much fun and a very concrete problem unlike all the soft ones for the rest of the stack.) If we get this, though, |
@dragostis My opinion is that this type of issue is likely better to be handled at the parser generator level. I like the direction With regards to if we want to expose this information to someone examining the grammar, I am not sure how much different this is than what is exposed by suppressing the pairs in the grammar already. In my example above I already "leak" the fact that I don't handle each digit individually, rather I group them, in this case it doesn't particularly matter but there may be other formats where it does. |
I know this is a bit older but just wanted to +1 on this.
I am a proponent of having non-token-generating literals be included in error messages by default and allowing something in the grammar definition to specify exclusion from errors. At least for me, I've found that ignoring non-token-generating literals in error reporting generally leads to more unexpected error messages. Edit: adding some examples Take a simple struct declaration missing a semicolon on one of its members:
No semicolon rule
(but that is a type identifier?) Separate semicolon rule
cc @cramertj |
The OCaml parsing library
This is intended as an optimization for streaming parsers, but I have used it to provide clearer error messages. Maybe |
@wirelyre, this is something that I've had in mind for a while now and I love the idea; thanks for sharing it. It can definitely be done statically, but it requires quite a bit of work to get right. |
Some more brainstorming: There is also the option of making errors more expressive in general. Parsing doesn't just fail at one spot. There are two parts: the context which led to the failure (our current path in the parse tree), and the actual missing/unexpected characters. Maybe it makes sense to pair these together. Something like: struct ParsingError<R> {
context: (InputLocation, R), // or Vec<R> for full context
failures: Vec<(InputLocation, Kind<R>)>,
}
enum Kind<R> {
Literal(&'static str), // "expected ';'"
Positive(R), // "expected operator"
Negative(R), // "unexpected integer"
// maybe also a negative literal
} Then propagating errors upwards becomes a heuristic question. Is it more useful (in this parsing state) to keep the full list of failures, or to merge a whole
I'd need more time trying this on paper but intuitively it feels more flexible and informative than the current types. Even without any extra syntax in grammars. Also I expect there are some pretty easy heuristics to avoid adding too many |
Currently to have a rule be provided as expected it must produce a token.
This means that for some important syntax you end up with extra tokens in your stream.
Consider the following grammar which parses ";" delimited lines.
for the input "19191" Pest produces the error
Which is unintuitive for end users.
If you make the ";" into a rule
The generated error is as expected but you end up dealing with extra tokens that aren't critical for building ASTs just for parsing the grammar. But when you silence the token that rule produces it goes back to the unintuitive error.
It would be nice if Pest provided a way to indicate "The following rule/token is critical for parsing but I don't actually need a token produced for it"
This will likely need an RFC at some point but I wanted to open the discussion.
The text was updated successfully, but these errors were encountered: