Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement reset_match method #18

Merged
merged 4 commits into from
Oct 22, 2021
Merged

Conversation

sahandevs
Copy link
Contributor

@sahandevs sahandevs commented Oct 17, 2021

No description provided.

@sahandevs
Copy link
Contributor Author

now I think about it, maybe switch_and_reset is a better name for it(?)

Copy link
Owner

@osa1 osa1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @sahandevs for the PR!

This is not quite what we want to change/fix with #12, for a few reasons. First, when you do something like

rule ... {
    $$whitespace,
    ...
}

In the semantic action for this match we don't call any of the LexerHandle methods, we directly switch to the initial state of the current rule set:

lexgen/src/dfa/codegen.rs

Lines 564 to 566 in f69e94e

RuleRhs::None => quote!(
self.state = self.initial_state;
),

Then the code for initial state of Init rule set resets the match:

lexgen/src/dfa/codegen.rs

Lines 341 to 342 in f69e94e

self.current_match_start = char_idx;
self.current_match_end = char_idx;

As long as this second code block persists the issue won't be fixed, because as described in #12, we want to remove the special case for resetting the current match in Init rule set, which is implemented by the second code block above.

I think for this PR to work as expected you need a few more changes:

  • Common up

    1. lexgen/src/dfa/codegen.rs

      Lines 607 to 609 in f69e94e

      #action_type_name::Continue => {
      self.state = self.initial_state;
      }
    2. lexgen/src/dfa/codegen.rs

      Lines 564 to 566 in f69e94e

      RuleRhs::None => quote!(
      self.state = self.initial_state;
      ),

    So that they will call the same code generator for "continue" semantic action.

    This is not strictly necessary, but it makes it clear that the rule <regex>, is the same as <regex> => |lexer| lexer.continue_().

    I think it's a good idea to do this in a separate PR as this is a good idea regardless of how to fix Confusing match reset behavior in initial state #12.

  • Remove the special case in Init rule set (second code snippet above). Just delete the code.

  • We need to split "continue_" into two different methods, one for "continue and reset match", one for "continue extending the match". I feel like not resetting the match by default will be more intuitive. So I suggest keeping continue_ method as-is, and adding a new continue_and_reset_match or something like that. The seocnd method should update self.state as continue_, but should also update self.current_match_start and self.current_match_end as we do in the special case above.

Now, after all this, we need to decide how to desugar <regex>, syntax. If we use continue_ then it will not skip the whitespace in Init rule set. That's fine, but we may want to come up with a new syntax for resetting the match, to be used when skipping whitespace.

This requires some thought so I'd like to not rush that. For now, I think it makes sense to (after all of the above done) desugar <regex>, in Init as continue_and_reset_match (or whatever we want to call that method), and desugar other <regex>,s to continue_. This won't fix #12 (the special case will be there), we will have everything in place to fix it once we decide how to fix (whether to invent a new syntax for "resetting skip", whether <regex>, should reset the match or not).

How does this sound?

},
#action_type_name::SwitchAndContinue(rule_set) => {
self.switch(rule_set);
self.state = self.initial_state;
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

switch method already updates the current state

self.state = #state_idx
so you shouldn't need to update it right after calling switch.

},
#action_type_name::SwitchAndContinue(rule_set) => {
self.switch(rule_set);
self.state = self.initial_state;
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here.

@osa1
Copy link
Owner

osa1 commented Oct 19, 2021

Common up ... So that they will call the same code generator for "continue" semantic action.

I've done this as a part of another PR, which is merged to main.

@sahandevs
Copy link
Contributor Author

sahandevs commented Oct 19, 2021

@osa1 Thank you for reviewing my PR. sorry I'm answering this late, I was trying out your idea by implementing the ModSecurity's lexer (here)

I'll change my lexer based on your changes in main branch later (to test them) and close this PR if it's ok with you.

@osa1
Copy link
Owner

osa1 commented Oct 19, 2021

No worries @sahandevs. Feel free to ping if you work on or use lexgen and need anything.

@sahandevs
Copy link
Contributor Author

after re-reading your comment and #12, you are right this PR doesn't fix that issue, but this PR fixes my own issue. I updated the PR and added a test for it.

This is my specific problem if you're curious. it doesn't add anything new:

In my lexer (here I need to reset the match in rules that aren't Init. I took an approach to write my lexer by creating many mini-lexers (rules) to handle special cases (modsec is very weird! whitespace, escaping and even name of a variable affects the lexing (XML:/<xpath>, OTHERS:/<regex pattern>/. for XML we need one forward slash but for others we need two.)

one of the reasons that I need this feature is for string formats in modsec. something like this syntax in js:

const a = `test ${1 +1}`

in modsec is like this:

# var form:
VAR:/test %{a} test/ # escapes with `\/`
# operator form:
"@eq<SINGLE_SPACE> test %{a}" # escapes with `\"`
# action form with `'`
"setvar:'a=test %{a}' # escapes with `\"` and `\'`
# action form without `'`
"setvar:'a=test %{a}' # escapes with `\"` and `\,`

as you can see escaping rule differs in every context so before entering the rule FormatString we need to say what is escaping chars and which rule we should follow after these chars appear and the user didn't escape them (here). for example in setvar:'a=test %{a}' after we see a , we should follow rule ActionsInner and if we see a " we should follow rule Init

I implemented this escaping logic here by waiting for a \\ and peek the next char to see if it's in the escape list (if not we just throw an error) then we enable a flag and forget about the \\ match (by calling the switch_and_reset_match). after that, I have a _ pattern here to decide whether to exit the rule or just return a token based on the flag.

albeit this is not the only use case and as you can see in the full code I used it almost everywhere!

@osa1
Copy link
Owner

osa1 commented Oct 20, 2021

OK, I think providing a way to reset the match makes sense. Instead of having two switch methods, I think I would find it more intuitive to have a reset_match method that I can call before switch to clear the current match. I could also use it to reset the match without switching. I could use switch_and_reset_match with the current rule set to reset the match and continue, but it seems less intuitive to me.

Any opinions on this?

tests/tests.rs Outdated Show resolved Hide resolved
@sahandevs
Copy link
Contributor Author

I also think something like reset_match is better, but I don't think we have access to the current_match_start in impl<'lexer, 'input> #handle_type_name<'lexer, 'input> {....

@osa1
Copy link
Owner

osa1 commented Oct 20, 2021

Good point. I was thinking about eliminating the handle structs completely and passing a reference to the lexer struct to semantic actions to simplify things. This seems like another motivation to do that refactoring. If we do that then all of the lexer state will be available in methods like switch or reset_match.

Let me try to do that and I'll ping you.

osa1 added a commit that referenced this pull request Oct 21, 2021
This has a few benefits:

- Not creating a handle struct in each accepting state reduces code
  size: -14.5% in Lua 5.1 lexer (8,953 to 7,649)

- Having access to the lexer struct makes it possible to implement
  semantic action methods like `reset_match` (#18)

- Semantic action functions no longer need `mut` modifier in the lexer
  handle, as the argument is now `&mut`.

Lexer struct fields now have `__` suffix to avoid accidentally using
lexer fields in semantic actions.
@osa1
Copy link
Owner

osa1 commented Oct 21, 2021

@sahandevs if you merge main (or rebase) you should be able to implement reset_match method now.

Btw, feel free to merge main instead of rebasing and squashing your branch for every update. I'm using GitHub's "squash and merge" feature so it will be merged as one commit.

@sahandevs
Copy link
Contributor Author

@osa1 Done! It's way better now!

@osa1 osa1 changed the title impl switch_and_continue Implement reset_match method Oct 22, 2021
Copy link
Owner

@osa1 osa1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, this looks good.

Interesting test case. I think it could be simplified further by removing Rule1 and moving rules in Rule1 to Init (remove existing stuff in Init first). WDYT?

tests/tests.rs Outdated
if lexer.state().enable_reset_match {
lexer.reset_match();
}
lexer.switch(LexerRule::Rule1)
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not lexer.continue_() here?

tests/tests.rs Outdated
lexer.reset_match();
let enable_reset_match = &mut lexer.state().enable_reset_match;
*enable_reset_match = !*enable_reset_match;
lexer.switch(LexerRule::Rule1)
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here.

let s = lexer.match_();
lexer.return_(s)
},
$ = "<>",
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add a space above this rule for consistency

@sahandevs
Copy link
Contributor Author

We need a second rule because the Init rule resets match on every switch or continue_ call. If I do this:

rule Init {
            'c' => |lexer| {
                if lexer.state().enable_reset_match {
                    lexer.reset_match();
                }
                lexer.continue_()
            },

            "!" => |lexer| {
                lexer.reset_match();
                let enable_reset_match = &mut lexer.state().enable_reset_match;
                *enable_reset_match = !*enable_reset_match;
                lexer.continue_()
            },

            ['d' 'e']+ => |lexer| {
                let s = lexer.match_();
                lexer.return_(s)
            },

            $ = "<>",
}

this fails:

let mut lexer = Lexer::new("ccdeed!ccdeed");
assert_eq!(next(&mut lexer), Some(Ok("ccdeed"))); // ❌ left == "deed"
assert_eq!(next(&mut lexer), Some(Ok("deed"))); // ✔️
assert_eq!(next(&mut lexer), Some(Ok("<>"))); // ✔️

@osa1 osa1 merged commit f4e4e8b into osa1:main Oct 22, 2021
@osa1
Copy link
Owner

osa1 commented Oct 22, 2021

Makes sense, thanks for the explanation and the PR!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants