New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
switch_and_return
does not consume the returned token?
#11
Comments
Hi @cr1901. Thanks for reporting this, and I'm glad you find lexgen useful. I think this is a bug, yes. lexgen lexers internally maintain two indices, for the start and end of the current match. The When switching states we don't reset the start index, so the character(s) you just matched in your current state will be included in the current match in the next state. You can see an example of this in my Lua lexer: Line 262 in 700700d
In this code, lexing a "long string" is done using 3 states: However, when we return a token, we should reset the current match, because the match is used to return the token. So in your code, since you use I think the reason why I didn't catch this bug so far is because there's another bug that hides this one. In the Lua lexer, I use Here's a regression test: #[test]
fn return_should_reset_match() {
lexer! {
Lexer -> &'input str;
rule Init {
"aaa" => |lexer| {
let match_ = lexer.match_();
lexer.switch_and_return(LexerRule::State1, match_)
},
}
rule State1 {
"bbb" => |lexer| {
let match_ = lexer.match_();
lexer.switch_and_return(LexerRule::Init, match_)
},
}
}
let mut lexer = Lexer::new("aaabbb");
assert_eq!(ignore_pos(lexer.next()), Some(Ok("aaa")));
assert_eq!(ignore_pos(lexer.next()), Some(Ok("bbb"))); // assertion current fails
assert_eq!(ignore_pos(lexer.next()), None);
} I'll fix this. |
Well, using
|
I agree that your use of I didn't fix the issue with resetting the current match when we switch to the initial rule though. That problem is a bit tricky, because apparently we rely on it to skip whitespace in the initial rule, without including the skipped whitespace in the current match. I'll have to think about this more. |
I also released 0.5.0 on crates.io, with the fix. |
So to summarize your fix:
Do I understand this correctly? Perhaps maybe document |
Yes, you got it. I'm not sure whether we want to document this behavior of |
First, I want to thank you for writing
lexgen
. I find it incredibly useful when paired with LALRPOP (and would at some point like to contribute some integration to it, though I'm not sure exactly what that should look like right now).I have a lexer with two rules-
Init
andNoteBody
. The purpose ofNoteBody
is to consume characters until a newline is found, and the change between rules happens when an=
(Tok::Equal
) is detected and certain conditions (in_note_name
state) are met. The logic looks like this:When I've detected a newline in the
NoteBody
rule bypeek()
ing, I want to return all the characters I've found after the equals sign and up to (not including) the newline. I would expect the logic for this to look something like:However, it turns out that this will return the
=
token prepended to the text I actually want to match. In order to return just the text after the=
token, I need to do:My question is: Is this a bug, or intended behavior? If the latter, is my workaround to strip the first character from the match correct, or can you suggest a method for removing the
=
sign before the transition from theInit
toNoteBody
rule?The text was updated successfully, but these errors were encountered: