Implement reset_match method #18

sahandevs · 2021-10-17T15:38:09Z

No description provided.

sahandevs · 2021-10-17T17:21:39Z

now I think about it, maybe switch_and_reset is a better name for it(?)

osa1

Thanks @sahandevs for the PR!

This is not quite what we want to change/fix with #12, for a few reasons. First, when you do something like

rule ... {
    $$whitespace,
    ...
}

In the semantic action for this match we don't call any of the LexerHandle methods, we directly switch to the initial state of the current rule set:

lexgen/src/dfa/codegen.rs

Lines 564 to 566 in f69e94e

    
           RuleRhs::None => quote!( 
        
               self.state = self.initial_state; 
        
           ),

Then the code for initial state of Init rule set resets the match:

lexgen/src/dfa/codegen.rs

Lines 341 to 342 in f69e94e

    
           self.current_match_start = char_idx; 
        
           self.current_match_end = char_idx;

As long as this second code block persists the issue won't be fixed, because as described in #12, we want to remove the special case for resetting the current match in Init rule set, which is implemented by the second code block above.

I think for this PR to work as expected you need a few more changes:

Common up

lexgen/src/dfa/codegen.rs

Lines 607 to 609 in f69e94e

    
           #action_type_name::Continue => { 
        
               self.state = self.initial_state; 
        
           }

lexgen/src/dfa/codegen.rs

Lines 564 to 566 in f69e94e

RuleRhs::None => quote!(

self.state = self.initial_state;

),

So that they will call the same code generator for "continue" semantic action.

This is not strictly necessary, but it makes it clear that the rule <regex>, is the same as <regex> => |lexer| lexer.continue_().

I think it's a good idea to do this in a separate PR as this is a good idea regardless of how to fix Confusing match reset behavior in initial state #12.

Remove the special case in Init rule set (second code snippet above). Just delete the code.
We need to split "continue_" into two different methods, one for "continue and reset match", one for "continue extending the match". I feel like not resetting the match by default will be more intuitive. So I suggest keeping continue_ method as-is, and adding a new continue_and_reset_match or something like that. The seocnd method should update self.state as continue_, but should also update self.current_match_start and self.current_match_end as we do in the special case above.

Now, after all this, we need to decide how to desugar <regex>, syntax. If we use continue_ then it will not skip the whitespace in Init rule set. That's fine, but we may want to come up with a new syntax for resetting the match, to be used when skipping whitespace.

This requires some thought so I'd like to not rush that. For now, I think it makes sense to (after all of the above done) desugar <regex>, in Init as continue_and_reset_match (or whatever we want to call that method), and desugar other <regex>,s to continue_. This won't fix #12 (the special case will be there), we will have everything in place to fix it once we decide how to fix (whether to invent a new syntax for "resetting skip", whether <regex>, should reset the match or not).

How does this sound?

osa1 · 2021-10-18T07:25:23Z

src/dfa/codegen.rs

+                    },
+                    #action_type_name::SwitchAndContinue(rule_set) => {
+                        self.switch(rule_set);
+                        self.state = self.initial_state;


switch method already updates the current state

lexgen/src/dfa/codegen.rs

Line 240 in f69e94e

self.state = #state_idx

so you shouldn't need to update it right after calling switch.

osa1 · 2021-10-18T07:25:32Z

src/dfa/codegen.rs

+                    },
+                    #action_type_name::SwitchAndContinue(rule_set) => {
+                        self.switch(rule_set);
+                        self.state = self.initial_state;


osa1 · 2021-10-19T13:16:23Z

Common up ... So that they will call the same code generator for "continue" semantic action.

I've done this as a part of another PR, which is merged to main.

sahandevs · 2021-10-19T14:13:34Z

@osa1 Thank you for reviewing my PR. sorry I'm answering this late, I was trying out your idea by implementing the ModSecurity's lexer (here)

I'll change my lexer based on your changes in main branch later (to test them) and close this PR if it's ok with you.

osa1 · 2021-10-19T14:16:54Z

No worries @sahandevs. Feel free to ping if you work on or use lexgen and need anything.

sahandevs · 2021-10-20T13:12:24Z

after re-reading your comment and #12, you are right this PR doesn't fix that issue, but this PR fixes my own issue. I updated the PR and added a test for it.

This is my specific problem if you're curious. it doesn't add anything new:

In my lexer (here I need to reset the match in rules that aren't Init. I took an approach to write my lexer by creating many mini-lexers (rules) to handle special cases (modsec is very weird! whitespace, escaping and even name of a variable affects the lexing (XML:/<xpath>, OTHERS:/<regex pattern>/. for XML we need one forward slash but for others we need two.)

one of the reasons that I need this feature is for string formats in modsec. something like this syntax in js:

const a = `test ${1 +1}`

in modsec is like this:

# var form:
VAR:/test %{a} test/ # escapes with `\/`
# operator form:
"@eq<SINGLE_SPACE> test %{a}" # escapes with `\"`
# action form with `'`
"setvar:'a=test %{a}' # escapes with `\"` and `\'`
# action form without `'`
"setvar:'a=test %{a}' # escapes with `\"` and `\,`

as you can see escaping rule differs in every context so before entering the rule FormatString we need to say what is escaping chars and which rule we should follow after these chars appear and the user didn't escape them (here). for example in setvar:'a=test %{a}' after we see a , we should follow rule ActionsInner and if we see a " we should follow rule Init

I implemented this escaping logic here by waiting for a \\ and peek the next char to see if it's in the escape list (if not we just throw an error) then we enable a flag and forget about the \\ match (by calling the switch_and_reset_match). after that, I have a _ pattern here to decide whether to exit the rule or just return a token based on the flag.

albeit this is not the only use case and as you can see in the full code I used it almost everywhere!

osa1 · 2021-10-20T13:51:38Z

OK, I think providing a way to reset the match makes sense. Instead of having two switch methods, I think I would find it more intuitive to have a reset_match method that I can call before switch to clear the current match. I could also use it to reset the match without switching. I could use switch_and_reset_match with the current rule set to reset the match and continue, but it seems less intuitive to me.

Any opinions on this?

tests/tests.rs

sahandevs · 2021-10-20T14:01:08Z

I also think something like reset_match is better, but I don't think we have access to the current_match_start in impl<'lexer, 'input> #handle_type_name<'lexer, 'input> {....

osa1 · 2021-10-20T14:05:44Z

Good point. I was thinking about eliminating the handle structs completely and passing a reference to the lexer struct to semantic actions to simplify things. This seems like another motivation to do that refactoring. If we do that then all of the lexer state will be available in methods like switch or reset_match.

Let me try to do that and I'll ping you.

This has a few benefits: - Not creating a handle struct in each accepting state reduces code size: -14.5% in Lua 5.1 lexer (8,953 to 7,649) - Having access to the lexer struct makes it possible to implement semantic action methods like `reset_match` (#18) - Semantic action functions no longer need `mut` modifier in the lexer handle, as the argument is now `&mut`. Lexer struct fields now have `__` suffix to avoid accidentally using lexer fields in semantic actions.

osa1 · 2021-10-21T07:18:49Z

@sahandevs if you merge main (or rebase) you should be able to implement reset_match method now.

Btw, feel free to merge main instead of rebasing and squashing your branch for every update. I'm using GitHub's "squash and merge" feature so it will be merged as one commit.

…inue`

sahandevs · 2021-10-22T07:11:30Z

@osa1 Done! It's way better now!

osa1

Thanks, this looks good.

Interesting test case. I think it could be simplified further by removing Rule1 and moving rules in Rule1 to Init (remove existing stuff in Init first). WDYT?

osa1 · 2021-10-22T08:02:46Z

tests/tests.rs

+                if lexer.state().enable_reset_match {
+                    lexer.reset_match();
+                }
+                lexer.switch(LexerRule::Rule1)


Why not lexer.continue_() here?

osa1 · 2021-10-22T08:02:51Z

tests/tests.rs

+                lexer.reset_match();
+                let enable_reset_match = &mut lexer.state().enable_reset_match;
+                *enable_reset_match = !*enable_reset_match;
+                lexer.switch(LexerRule::Rule1)


osa1 · 2021-10-22T08:04:31Z

tests/tests.rs

+                let s = lexer.match_();
+                lexer.return_(s)
+            },
+            $ = "<>",


Add a space above this rule for consistency

sahandevs · 2021-10-22T08:21:15Z

We need a second rule because the Init rule resets match on every switch or continue_ call. If I do this:

rule Init {
            'c' => |lexer| {
                if lexer.state().enable_reset_match {
                    lexer.reset_match();
                }
                lexer.continue_()
            },

            "!" => |lexer| {
                lexer.reset_match();
                let enable_reset_match = &mut lexer.state().enable_reset_match;
                *enable_reset_match = !*enable_reset_match;
                lexer.continue_()
            },

            ['d' 'e']+ => |lexer| {
                let s = lexer.match_();
                lexer.return_(s)
            },

            $ = "<>",
}

this fails:

let mut lexer = Lexer::new("ccdeed!ccdeed");
assert_eq!(next(&mut lexer), Some(Ok("ccdeed"))); // ❌ left == "deed"
assert_eq!(next(&mut lexer), Some(Ok("deed"))); // ✔️
assert_eq!(next(&mut lexer), Some(Ok("<>"))); // ✔️

osa1 · 2021-10-22T08:27:08Z

Makes sense, thanks for the explanation and the PR!

sahandevs force-pushed the impl-`SwitchAndContinue` branch from 343629d to 93608ce Compare October 17, 2021 15:48

osa1 reviewed Oct 18, 2021

View reviewed changes

sahandevs force-pushed the impl-`SwitchAndContinue` branch from 93608ce to b6550ab Compare October 19, 2021 13:33

sahandevs force-pushed the impl-`SwitchAndContinue` branch from b6550ab to ddbf89f Compare October 20, 2021 13:08

osa1 reviewed Oct 20, 2021

View reviewed changes

tests/tests.rs Outdated Show resolved Hide resolved

impl switch_and_reset_match

3d07ee2

sahandevs force-pushed the impl-`SwitchAndContinue` branch from ddbf89f to 3d07ee2 Compare October 20, 2021 13:58

sahandevs added 2 commits October 22, 2021 10:31

Merge remote-tracking branch 'upstream/main' into impl-`SwitchAndCont…

a9ebc36

…inue`

implement reset_match

4b2756e

osa1 changed the title ~~impl switch_and_continue~~ Implement reset_match method Oct 22, 2021

osa1 reviewed Oct 22, 2021

View reviewed changes

cleanup test

c5648da

osa1 merged commit f4e4e8b into osa1:main Oct 22, 2021

osa1 mentioned this pull request Mar 16, 2022

Invalid token doesn't consume input from match_ in non-initial rule #48

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement reset_match method #18

Implement reset_match method #18

sahandevs commented Oct 17, 2021 •

edited by osa1

Loading

sahandevs commented Oct 17, 2021

osa1 left a comment •

edited

Loading

osa1 Oct 18, 2021

osa1 Oct 18, 2021

osa1 commented Oct 19, 2021 •

edited

Loading

sahandevs commented Oct 19, 2021 •

edited

Loading

osa1 commented Oct 19, 2021

sahandevs commented Oct 20, 2021

osa1 commented Oct 20, 2021

sahandevs commented Oct 20, 2021

osa1 commented Oct 20, 2021

osa1 commented Oct 21, 2021

sahandevs commented Oct 22, 2021

osa1 left a comment

osa1 Oct 22, 2021

osa1 Oct 22, 2021

osa1 Oct 22, 2021

sahandevs commented Oct 22, 2021

osa1 commented Oct 22, 2021

	self.current_match_start = char_idx;
	self.current_match_end = char_idx;

	#action_type_name::Continue => {
	self.state = self.initial_state;
	}

Implement reset_match method #18

Implement reset_match method #18

Conversation

sahandevs commented Oct 17, 2021 • edited by osa1 Loading

sahandevs commented Oct 17, 2021

osa1 left a comment • edited Loading

Choose a reason for hiding this comment

osa1 Oct 18, 2021

Choose a reason for hiding this comment

osa1 Oct 18, 2021

Choose a reason for hiding this comment

osa1 commented Oct 19, 2021 • edited Loading

sahandevs commented Oct 19, 2021 • edited Loading

osa1 commented Oct 19, 2021

sahandevs commented Oct 20, 2021

This is my specific problem if you're curious. it doesn't add anything new:

osa1 commented Oct 20, 2021

sahandevs commented Oct 20, 2021

osa1 commented Oct 20, 2021

osa1 commented Oct 21, 2021

sahandevs commented Oct 22, 2021

osa1 left a comment

Choose a reason for hiding this comment

osa1 Oct 22, 2021

Choose a reason for hiding this comment

osa1 Oct 22, 2021

Choose a reason for hiding this comment

osa1 Oct 22, 2021

Choose a reason for hiding this comment

sahandevs commented Oct 22, 2021

osa1 commented Oct 22, 2021

sahandevs commented Oct 17, 2021 •

edited by osa1

Loading

osa1 left a comment •

edited

Loading

osa1 commented Oct 19, 2021 •

edited

Loading

sahandevs commented Oct 19, 2021 •

edited

Loading