Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow initializing lexers with a char iterator #42

Merged
merged 6 commits into from Feb 7, 2022
Merged

Allow initializing lexers with a char iterator #42

merged 6 commits into from Feb 7, 2022

Conversation

osa1
Copy link
Owner

@osa1 osa1 commented Feb 7, 2022

This fixes #41 in an almost backwards compatible way. Generated lexers now have
an extra constructor:

impl<I: Iterator<Item = char> + Clone> Lexer<'static, I> {
    fn new_from_iter(iter: I) -> Self {
        Lexer(::lexgen_util::Lexer::new_from_iter(iter))
    }
}

API of the generated lexers are exactly the same, however, if a lexer is
constructed with new_from_iter instead of new or new_with_state, then
match_ method will panic in runtime. This is because in lexers constructed
with new_from_iter we don't have the input string, so cannot return a slice
to it. Instead use match_loc to get the start and end locations of the
current match.

Only breaking change is the generated types now have one more generic argument,
for the iterator type. So for a lexer like:

lexer! {
    MyLexer -> MyToken;
    ...
}

Instead of

struct MyLexer<'input>(...);

we now generate

struct MyLexer<'input, I: Iterator<Item = char> + Clone>(...);

So any code that refers to the lexer type will break.

Other than this the changes should be backwards compatible.

Fixes #41

This should allow adding generic parameters (not lifetimes) to the
semantic action functions
@osa1
Copy link
Owner Author

osa1 commented Feb 7, 2022

Performance seems to regress a little bit. My Lua lexer benchmark reports +3% compared to main branch. I'm guessing the reason is the cloning for __last_match as that should be the only difference in generated code.

@osa1
Copy link
Owner Author

osa1 commented Feb 7, 2022

I think we should be able optimize __last_match updates in code like this:

'>' => {
    self.0.set_accepting_state(Lexer_ACTION_13);          // 2
    match self.0.next() {
        None => {
            self.0.__done = true;
            match self.0.backtrack() {                    // 6
                ...
            }
        }
        Some(char) => match char {
            '>' => {
                self.0.reset_accepting_state();           // 12
                match Lexer_ACTION_31(self) {
                    ...
                }
            }
            '=' => {
                self.0.reset_accepting_state();           // 18
                match Lexer_ACTION_11(self) {
                    ...
                }
            }
            _ => match self.0.backtrack() {               // 23
                ...
            },
        },
    }
}

In the code above we set __last_match in line 2. However in the continuation we either use the value we set directly, or reset it:

  • Line 6 and 23 use Lexer_ACTION_13 indirectly. No need to set __last_match (cloning the iterator).
  • Line 12 and 18 ignore __last_match

@osa1
Copy link
Owner Author

osa1 commented Feb 7, 2022

The perf issue above reported as #43

@osa1 osa1 merged commit ce0c916 into main Feb 7, 2022
@osa1 osa1 deleted the from_iter_2 branch February 7, 2022 18:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Allow initializing lexers with a character iterator
1 participant