Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Invalid token doesn't consume input from match_ in non-initial rule #48

Closed
MiSawa opened this issue Mar 14, 2022 · 8 comments · Fixed by #49
Closed

Invalid token doesn't consume input from match_ in non-initial rule #48

MiSawa opened this issue Mar 14, 2022 · 8 comments · Fixed by #49

Comments

@MiSawa
Copy link
Contributor

MiSawa commented Mar 14, 2022

This works as I expected

lexer! {
    Lexer -> &'input str;

    rule Init {
        's' = "s",
        "a" => |lexer| lexer.return_(lexer.match_()),
    }
}

let input = "sxa";
let mut lexer = Lexer::new(input);
assert_eq!(lexer.next(), Some(Ok((loc(0, 0, 0), "s", loc(0, 1, 1)))));
assert_eq!(lexer.next(), Some(Err(LexerError { location: loc(0, 1, 1), kind: crate::LexerErrorKind::InvalidToken })));
assert_eq!(lexer.next(), Some(Ok((loc(0, 2, 2), "a", loc(0, 3, 3)))));
assert_eq!(lexer.next(), None);

though if I move "a" to a different rule and let "s" enter the rule, it starts behaving differently while I expected to return exactly same as above one

lexer! {
    Lexer -> &'input str;

    rule Init {
        's' => |lexer| lexer.switch_and_return(LexerRule::InString, "s"),
    }
    rule InString {
        "a" => |lexer| lexer.return_(lexer.match_()),
    }
}

let input = "sxa";
let mut lexer = Lexer::new(input);
assert_eq!(lexer.next(), Some(Ok((loc(0, 0, 0), "s", loc(0, 1, 1)))));
assert_eq!(lexer.next(), Some(Err(LexerError { location: loc(0, 1, 1), kind: crate::LexerErrorKind::InvalidToken })));
// Why "x" is included here???
assert_eq!(lexer.next(), Some(Ok((loc(0, 1, 1), "xa", loc(0, 3, 3)))));
// Perhaps this is because we don't switch back to the Init rule?
assert_eq!(lexer.next(), Some(Err(LexerError { location: loc(0, 3, 3), kind: crate::LexerErrorKind::InvalidToken })));
assert_eq!(lexer.next(), None);

It'd be great if

  • InvalidToken consume the invalid part always, and
  • if the last LexerError is expected, have it documented in README.md (and perhaps return a different error kind than InvalidToken?)
@MiSawa MiSawa changed the title Invalid token doesn't consume input from match in non-initial rule Invalid token doesn't consume input from match_ in non-initial rule Mar 14, 2022
@osa1
Copy link
Owner

osa1 commented Mar 15, 2022

Thanks for reporting this.

The fix is easy, but I'm afraid it won't work the way you might expect. Basically the rule is that when the lexer fails, it also consumes the character. Here's an example to demonstrate:

use lexgen::lexer;

lexer! {
    Lexer -> &'input str;

    rule Init {
        "aaa" = "a",
        "b" = "b",
    }
}

fn main() {
    let input = "aab";
    let mut lexer = Lexer::new(input);
    println!("{:?}", lexer.next());
    println!("{:?}", lexer.next());
}

Output:

Some(Err(LexerError { location: Loc { line: 0, col: 0, byte_idx: 0 }, kind: InvalidToken }))
None

Since the lexer realizes that the input is invalid when it sees the b, b is also consumed. So the next next() call returns None.

With the bug that causes inconsistent behavior in your example is fixed, this is the output for your program:

lexer! {
    Lexer -> &'input str;

    rule Init {
        's' => |lexer| lexer.switch_and_return(LexerRule::InString, "s"),
    }
    rule InString {
        "a" => |lexer| lexer.return_(lexer.match_()),
    }
}

fn main() {
    let input = "sxa";
    let mut lexer = Lexer::new(input);
    println!("{:?}", lexer.next());
    println!("{:?}", lexer.next());
    println!("{:?}", lexer.next());
    println!("{:?}", lexer.next());
}

Output:

Some(Ok((Loc { line: 0, col: 0, byte_idx: 0 }, "s", Loc { line: 0, col: 1, byte_idx: 1 })))
Some(Err(LexerError { location: Loc { line: 0, col: 1, byte_idx: 1 }, kind: InvalidToken }))
Some(Err(LexerError { location: Loc { line: 0, col: 2, byte_idx: 2 }, kind: InvalidToken }))
None

We could make the lexers not consume the character when they return an error, but I'm not sure if it will be any more useful than the current behavior. In general, recovery from a lexing error will depend on the current lexer state (i.e. the current rule) and the lexer user state. I'm not sure what would be a good interface to provide to the users so that they will be able to recover from lexing errors.

In my use case, I'm trying to not fail in a lexer, but rather mark returned tokens/lexemes with error information. For example, if I'm lexing character literals, an unterminated literal like 'a is lexed as CharLiteral { terminated: false, ... }. If the lexer fails, I'm skipping characters until the next whitspace and restarting lexing from there. This is of course specific to my application and won't work in general.

osa1 added a commit that referenced this issue Mar 15, 2022
@MiSawa
Copy link
Contributor Author

MiSawa commented Mar 16, 2022

Thank you for swift response and fix!

Basically the rule is that when the lexer fails, it also consumes the character.

Sorry that I didn't include this in the first place, but that'd work fine for me! Actually I'm not trying to recover from lexer error (at least for now), but my problem was that semantic actions being invoked with match_() with an unexpected prefix, which caused a panic.

"\\u" $hex_digit $hex_digit $hex_digit $hex_digit =? |lexer| {
    // Ooops, unwrap cause a panic if there were characters left from InvalidToken!
    let value = u32::from_str_radix(&lexer.match_()[2..], 16).unwrap();
    // This works. (But what to do if the rule didn't have a fixed length like this example?)
    let value = u32::from_str_radix(&lexer.match_()[lexer.match_().len()-4..], 16).unwrap();
    // do something with value and return
    todo!()
}

As for error recoveries, I have a question; I somehow assumed that the rule with longest match is preferred, and then ties are broken by the order of appearance in the rule {} statement. Is this a correct assumption?
I think the current interface is fine as long as there's a fixed precedence over rules so that we can write a "fallback case". If a user want to recover from errors, they can write a fallback case explicitly like

rule Init {
    's' => |lexer| lexer.switch_and_return(LexerRule::InString, lexer.match_()),
}
rule InString {
    'a' => |lexer| lexer.return_(lexer.match_()),
    _ =? |lexer| lexer.return_(Err(lexer.match_())),
}

(or enter to new InFailure rule and consume until a whitespace or whatever), so users should be able to achieve whatever they want already.... (assuming precedence between 'a' and _ in the above example)

@osa1
Copy link
Owner

osa1 commented Mar 16, 2022

Sorry that I didn't include this in the first place, but that'd work fine for me! Actually I'm not trying to recover from lexer error (at least for now), but my problem was that semantic actions being invoked with match_() with an unexpected prefix, which caused a panic.

Have you seen the reset_match() method (documented in README, added with #18)? Currently the current match (string returned by match_()) is only reset when you switch to the Init rule. If this is not what you need, you can call reset_match() in a semantic action, which resets the current match. In your case, if you call reset_match() before switching to the state with "\\u" $hex_digit $hex_digit $hex_digit $hex_digit =? ... you will only get the "\u..." part in match_().

As for error recoveries, I have a question; I somehow assumed that the rule with longest match is preferred, and then ties are broken by the order of appearance in the rule {} statement. Is this a correct assumption?

Correct.

I think the current interface is fine as long as there's a fixed precedence over rules so that we can write a "fallback case". If a user want to recover from errors, they can write a fallback case explicitly like

I haven't experimented with this idea myself, but it looks like a good way to handle errors.

@MiSawa
Copy link
Contributor Author

MiSawa commented Mar 17, 2022

Have you seen the reset_match() method (documented in README, added with #18)? Currently the current match (string returned by match_()) is only reset when you switch to the Init rule. If this is not what you need, you can call reset_match() in a semantic action, which resets the current match. In your case, if you call reset_match() before switching to the state with "\u" $hex_digit $hex_digit $hex_digit $hex_digit =? ... you will only get the "\u..." part in match_().

Yes, I knew the function existed, but in my case the leftover was coming from InvalidToken errors so there was no semantic action to call reset_match. An input example input would be \uxyzw\u1234. Here, since xyzw is invalid as hex characters, there was no rule handling it and thus I had an InvalidToken error. But it doesn't stop parsing, and when it comes to parse \u1234, the match_() was \uxyzw\u1234 which makes the action call u32::from_radix("xyzw\u1234", 16).unwrap(). In my case, I was able to resolve the issue by adding '\\' _ =? |lexer| lexer.return_(Err(...)) (or _ =? |lexer| lexer.return_(Err(..)) would have worked) so that there's no match failures causing InvalidToken and instead we consume the string fragment by manually return_ an error. And since I use return_ on all other rules which I believe consumes match_, the match_() on this unicode-escape rule always return a string that matches the regex. But with your fix, I don't need this hack(?) anymore!

@osa1
Copy link
Owner

osa1 commented Mar 17, 2022

Here, since xyzw is invalid as hex characters, there was no rule handling it and thus I had an InvalidToken error. But it doesn't stop parsing, and when it comes to parse \u1234, the match_() was \uxyzw\u1234

Maybe we should reset the match on InvalidToken. Any thoughts on this? Other kind of errors will already reset the match as they're raised in a semantic action.

@osa1
Copy link
Owner

osa1 commented Mar 17, 2022

Maybe we should reset the match on InvalidToken

Sorry for the confusion.. This is already what we do in #49.

@osa1 osa1 closed this as completed in #49 Mar 17, 2022
osa1 added a commit that referenced this issue Mar 17, 2022
@MiSawa
Copy link
Contributor Author

MiSawa commented Mar 17, 2022

Maybe we should reset the match on InvalidToken. Any thoughts on this? Other kind of errors will already reset the match as they're raised in a semantic action.

Yeah, that wold be ideal for me!

@MiSawa
Copy link
Contributor Author

MiSawa commented Mar 17, 2022

Thank you for the swift fix!!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants