-
Notifications
You must be signed in to change notification settings - Fork 7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement reset_match method #18
Conversation
343629d
to
93608ce
Compare
now I think about it, maybe |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @sahandevs for the PR!
This is not quite what we want to change/fix with #12, for a few reasons. First, when you do something like
rule ... {
$$whitespace,
...
}
In the semantic action for this match we don't call any of the LexerHandle
methods, we directly switch to the initial state of the current rule set:
Lines 564 to 566 in f69e94e
RuleRhs::None => quote!( | |
self.state = self.initial_state; | |
), |
Then the code for initial state of Init
rule set resets the match:
Lines 341 to 342 in f69e94e
self.current_match_start = char_idx; | |
self.current_match_end = char_idx; |
As long as this second code block persists the issue won't be fixed, because as described in #12, we want to remove the special case for resetting the current match in Init
rule set, which is implemented by the second code block above.
I think for this PR to work as expected you need a few more changes:
-
Common up
Lines 607 to 609 in f69e94e
#action_type_name::Continue => { self.state = self.initial_state; } Lines 564 to 566 in f69e94e
RuleRhs::None => quote!( self.state = self.initial_state; ),
So that they will call the same code generator for "continue" semantic action.
This is not strictly necessary, but it makes it clear that the rule
<regex>,
is the same as<regex> => |lexer| lexer.continue_()
.I think it's a good idea to do this in a separate PR as this is a good idea regardless of how to fix Confusing match reset behavior in initial state #12.
-
Remove the special case in
Init
rule set (second code snippet above). Just delete the code. -
We need to split "continue_" into two different methods, one for "continue and reset match", one for "continue extending the match". I feel like not resetting the match by default will be more intuitive. So I suggest keeping
continue_
method as-is, and adding a newcontinue_and_reset_match
or something like that. The seocnd method should updateself.state
ascontinue_
, but should also updateself.current_match_start
andself.current_match_end
as we do in the special case above.
Now, after all this, we need to decide how to desugar <regex>,
syntax. If we use continue_
then it will not skip the whitespace in Init
rule set. That's fine, but we may want to come up with a new syntax for resetting the match, to be used when skipping whitespace.
This requires some thought so I'd like to not rush that. For now, I think it makes sense to (after all of the above done) desugar <regex>,
in Init
as continue_and_reset_match
(or whatever we want to call that method), and desugar other <regex>,
s to continue_
. This won't fix #12 (the special case will be there), we will have everything in place to fix it once we decide how to fix (whether to invent a new syntax for "resetting skip", whether <regex>,
should reset the match or not).
How does this sound?
src/dfa/codegen.rs
Outdated
}, | ||
#action_type_name::SwitchAndContinue(rule_set) => { | ||
self.switch(rule_set); | ||
self.state = self.initial_state; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
switch
method already updates the current state
Line 240 in f69e94e
self.state = #state_idx |
switch
.
src/dfa/codegen.rs
Outdated
}, | ||
#action_type_name::SwitchAndContinue(rule_set) => { | ||
self.switch(rule_set); | ||
self.state = self.initial_state; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same here.
I've done this as a part of another PR, which is merged to |
93608ce
to
b6550ab
Compare
@osa1 Thank you for reviewing my PR. sorry I'm answering this late, I was trying out your idea by implementing the ModSecurity's lexer (here) I'll change my lexer based on your changes in main branch later (to test them) and close this PR if it's ok with you. |
No worries @sahandevs. Feel free to ping if you work on or use lexgen and need anything. |
b6550ab
to
ddbf89f
Compare
after re-reading your comment and #12, you are right this PR doesn't fix that issue, but this PR fixes my own issue. I updated the PR and added a test for it. This is my specific problem if you're curious. it doesn't add anything new:In my lexer (here I need to reset the match in rules that aren't one of the reasons that I need this feature is for string formats in modsec. something like this syntax in js: const a = `test ${1 +1}` in modsec is like this:
as you can see escaping rule differs in every context so before entering the rule FormatString we need to say what is escaping chars and which rule we should follow after these chars appear and the user didn't escape them (here). for example in I implemented this escaping logic here by waiting for a albeit this is not the only use case and as you can see in the full code I used it almost everywhere! |
OK, I think providing a way to reset the match makes sense. Instead of having two Any opinions on this? |
ddbf89f
to
3d07ee2
Compare
I also think something like |
Good point. I was thinking about eliminating the handle structs completely and passing a reference to the lexer struct to semantic actions to simplify things. This seems like another motivation to do that refactoring. If we do that then all of the lexer state will be available in methods like Let me try to do that and I'll ping you. |
This has a few benefits: - Not creating a handle struct in each accepting state reduces code size: -14.5% in Lua 5.1 lexer (8,953 to 7,649) - Having access to the lexer struct makes it possible to implement semantic action methods like `reset_match` (#18) - Semantic action functions no longer need `mut` modifier in the lexer handle, as the argument is now `&mut`. Lexer struct fields now have `__` suffix to avoid accidentally using lexer fields in semantic actions.
@sahandevs if you merge main (or rebase) you should be able to implement Btw, feel free to merge main instead of rebasing and squashing your branch for every update. I'm using GitHub's "squash and merge" feature so it will be merged as one commit. |
@osa1 Done! It's way better now! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, this looks good.
Interesting test case. I think it could be simplified further by removing Rule1
and moving rules in Rule1
to Init
(remove existing stuff in Init
first). WDYT?
tests/tests.rs
Outdated
if lexer.state().enable_reset_match { | ||
lexer.reset_match(); | ||
} | ||
lexer.switch(LexerRule::Rule1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not lexer.continue_()
here?
tests/tests.rs
Outdated
lexer.reset_match(); | ||
let enable_reset_match = &mut lexer.state().enable_reset_match; | ||
*enable_reset_match = !*enable_reset_match; | ||
lexer.switch(LexerRule::Rule1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same here.
let s = lexer.match_(); | ||
lexer.return_(s) | ||
}, | ||
$ = "<>", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add a space above this rule for consistency
We need a second rule because the Init rule resets match on every rule Init {
'c' => |lexer| {
if lexer.state().enable_reset_match {
lexer.reset_match();
}
lexer.continue_()
},
"!" => |lexer| {
lexer.reset_match();
let enable_reset_match = &mut lexer.state().enable_reset_match;
*enable_reset_match = !*enable_reset_match;
lexer.continue_()
},
['d' 'e']+ => |lexer| {
let s = lexer.match_();
lexer.return_(s)
},
$ = "<>",
} this fails: let mut lexer = Lexer::new("ccdeed!ccdeed");
assert_eq!(next(&mut lexer), Some(Ok("ccdeed"))); // ❌ left == "deed"
assert_eq!(next(&mut lexer), Some(Ok("deed"))); // ✔️
assert_eq!(next(&mut lexer), Some(Ok("<>"))); // ✔️ |
Makes sense, thanks for the explanation and the PR! |
No description provided.