Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Incorrect regex #302

Closed
simensgreen opened this issue Apr 24, 2023 · 5 comments
Closed

Incorrect regex #302

simensgreen opened this issue Apr 24, 2023 · 5 comments

Comments

@simensgreen
Copy link

simensgreen commented Apr 24, 2023

#[derive(Debug, Logos)]
enum Token {
    #[regex(r"\(.*\)")]
    Unbrace
}

let mut lex = Token::lexer("(any text)"); // matches r"\(.*\)"

let result = lex.next();
assert!(matches!(result, Some(_)));     
assert!(matches!(result, Some(Ok(_)))); // fail

logos = "0.13.0"

@simensgreen
Copy link
Author

I have seen many issues about erroneous regular expressions. Why not use a ready-made, well-tested library such as regex?

@jeertmans
Copy link
Collaborator

Nice catch for this bug!

I think Logos does not directly use the regex crate for one important reason: performances.

If you look closely at how Logos is implemented, it uses regex-syntax (from which regex also depends on) and parses the regex into some intermediate representation. As Logos makes heavily use of optimized look-up tables, it has to "re-implement" a lot of the logic behind regex to make it fit into its model. That's at least how I understand Logos.

@maciejhirsz
Copy link
Owner

maciejhirsz commented Apr 25, 2023

This is an easy fix, just match for anything that's not a closing paren: r"\([^)]*\)"

This is also not a bug per se.

I have seen many issues about erroneous regular expressions. Why not use a ready-made, well-tested library such as regex?

So these are related questions. Logos never was, and never will be (see "Phasing out #[regex]" in last release notes) a fully featured regex engine.

Using the regex crate would absolutely kill the performance here. That's not throwing shade on regex, it's one of the fastest if not the fastest regex implementation on the planet. To use it as a backend for Logos however two things would happen:

  1. The whole compiling of the state machine would happen at runtime (it happens at compile time currently). There is some runtime cost to that, though it's amortized by only having to do it once per lexer, however...
  2. We would go from looking at each byte of input at most twice (and more often enough it manages with looking at any byte just once), to looking at least N times, where N is the number of token definitions, and at most N * X times, where X is some indefinite number of backtracks any given regex pattern can do. Even if the regex crate was so optimized that it only used one CPU cycle per byte when it looks at it, for a complex enough Lexer you are looking at an order of magnitude or higher performance loss.

(@jeertmans you replied just as I was typing this 🙃)

@maciejhirsz
Copy link
Owner

One last note I'll leave here: might be worth considering throwing a compile error on glob repeat (.*) pattern: it will always happily eat the entire input to the end, and I don't know a single scenario where that's actually a useful thing in a lexer.

@jeertmans
Copy link
Collaborator

I feel that it should also be documented somewhere @maciejhirsz, maybe in the future book or in the crate's docs, but I recognize that's not the first time someone falls into the .* trap ^^.

A compiler error might also help users!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants