Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Regex that works with the regex crate but not on a logos token #220

Open
jawadcode opened this issue Aug 17, 2021 · 3 comments
Open

Regex that works with the regex crate but not on a logos token #220

jawadcode opened this issue Aug 17, 2021 · 3 comments

Comments

@jawadcode
Copy link

Firstly, here is the output of cargo tree:

regextest v0.1.0 (<$HOME>/Desktop/regextest)
├── logos v0.12.0
│   └── logos-derive v0.12.0 (proc-macro)
│       ├── beef v0.5.1
│       ├── fnv v1.0.7
│       ├── proc-macro2 v1.0.28
│       │   └── unicode-xid v0.2.2
│       ├── quote v1.0.9
│       │   └── proc-macro2 v1.0.28 (*)
│       ├── regex-syntax v0.6.25
│       ├── syn v1.0.74
│       │   ├── proc-macro2 v1.0.28 (*)
│       │   ├── quote v1.0.9 (*)
│       │   └── unicode-xid v0.2.2
│       └── utf8-ranges v1.0.4
└── regex v1.5.4
    ├── aho-corasick v0.7.18
    │   └── memchr v2.4.0
    ├── memchr v2.4.0
    └── regex-syntax v0.6.25

And here is code that demonstrates the issue:

use logos::Logos;
use regex::Regex;

#[derive(Logos, Clone, Debug, PartialEq)]
pub enum LogosToken {
    #[regex(r"(?m)\(\*([^*]|\*+[^*)])*\*+\)")]
    Comment,

    #[error]
    Error,
}

fn main() {
    let test = "(* hello world *)";
    let r = Regex::new(r"(?m)\(\*([^*]|\*+[^*)])*\*+\)").unwrap();
    assert!(r.is_match(test));

    let mut lexer = LogosToken::lexer(test);
    assert!(matches!(lexer.next(), Some(LogosToken::Comment)));
}

As you can see, the same regex is being used and the same test string but, the first assert passes and the second panics. I'm not sure if I'm missing something or this is a bug

@maciejhirsz
Copy link
Owner

Logos isn't using the regex crate, it's using regex-syntax for syntax only, and implements only a subset of regex.

@jawadcode
Copy link
Author

Logos isn't using the regex crate, it's using regex-syntax for syntax only, and implements only a subset of regex.

I see, could you suggest some changes I could make to the regex in order for it to fit within that subset?

@evbo
Copy link

evbo commented Nov 21, 2022

@jawadcode see this really useful discussion on regexs for comments: #133

Also, I think ?m means multiline mode. If that's the case, one way you could solve this instead is by using separate tokens for (* and )*. Then, use extras to keep track of when a comment begins and ends. Everything in between is part of that comment only if a comment has begun.

Here's a simple example of extras:
https://github.com/maciejhirsz/ramhorns/blob/9f2ed4d3d173cc835dc0a281662b0a8e87a348f2/ramhorns/src/template/parse.rs#L21-L26

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants