Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Token or regex not matched as expected when it's prefix of an another, larger regex #227

Open
osa1 opened this issue Oct 14, 2021 · 5 comments

Comments

@osa1
Copy link

osa1 commented Oct 14, 2021

I'm using current git master branch (925c49e).

Repro:

use logos::Logos;

#[derive(Logos, Debug, PartialEq)]
enum Token {
    #[regex("a+b")]
    APlusB,

    #[token("a")]
    A,

    #[error]
    Error,
}

fn main() {
    // Works
    let mut lex = Token::lexer("aaaaaaaaaaaaaaab");
    assert_eq!(lex.next(), Some(Token::APlusB));
    assert_eq!(lex.next(), None);

    // Works
    let mut lex = Token::lexer("a");
    assert_eq!(lex.next(), Some(Token::A));
    assert_eq!(lex.next(), None);

    // Fails
    let mut lex = Token::lexer("aa");
    assert_eq!(lex.next(), Some(Token::A));
}

The first two tests are fine. In the third case, I would expect the lexer to yield two Token::As, but instead it yields Error.

Is this expected?

@evbo
Copy link

evbo commented Nov 22, 2022

I think longer regexs are preferred, see similar issue here: #255

@osa1
Copy link
Author

osa1 commented Nov 23, 2022

This is not about longest match rule, logos just doesn't support backtracking. If you try my example with any other lexer generator with the longest match rule (alex, ocamllex, lexgen, ...) you'll see that they all accept the last case.

@evbo
Copy link

evbo commented Nov 23, 2022

@osa1 hmmm... I assumed longer regex matches took precedence over Token...

I wonder as a worked around if (presuming spaces get Skip) this would work:

#[regex("a+b")]
    APlusB,

// make it longer.... or add higher priority...
 #[regex("a *")]
    A,

These are the sorts of hacks I often employ. I understand it's adding cruft and stink...

@ExoticMatter
Copy link
Contributor

Just ran into this myself.

Regex priority is irrelevant here--if a regex fails to match, Logos backtracks to a lower priority regex/token anyway. Or rather, is supposed to--it seems Logos fails to backtrack correctly when encountering +, *, or ?. Indeed, raising the priority with priority = N fails to work around the issue.

The cargo-expand from @osa1's example
impl<'s> ::logos::Logos<'s> for Token {
    type Error = ();
    type Extras = ();
    type Source = str;
    fn lex(lex: &mut ::logos::Lexer<'s, Self>) {
        use ::logos::internal::{LexerInternal, CallbackResult};
        type Lexer<'s> = ::logos::Lexer<'s, Token>;
        fn _end<'s>(lex: &mut Lexer<'s>) {
            lex.end()
        }
        fn _error<'s>(lex: &mut Lexer<'s>) {
            lex.bump_unchecked(1);
            lex.error();
        }
        #[inline]
        fn goto5_ctx5_x<'s>(lex: &mut Lexer<'s>) {
            lex.set(Ok(Token::A));
        }
        #[inline]
        fn goto1_ctx5_x<'s>(lex: &mut Lexer<'s>) {
            lex.set(Ok(Token::APlusB));
        }
        #[inline]
        fn goto1_x<'s>(lex: &mut Lexer<'s>) {
            lex.set(Ok(Token::APlusB));
        }
        #[inline]
        fn goto2_x<'s>(lex: &mut Lexer<'s>) {
            match lex.read::<&[u8; 1usize]>() {
                Some(b"b") => {
                    lex.bump_unchecked(1usize);
                    goto1_x(lex)
                }
                _ => lex.error(),
            }
        }
        #[inline]
        fn goto1_ctx2_x<'s>(lex: &mut Lexer<'s>) {
            lex.set(Ok(Token::APlusB));
        }
        #[inline]
        fn goto2_ctx2_x<'s>(lex: &mut Lexer<'s>) {
            match lex.read::<&[u8; 1usize]>() {
                Some(b"b") => {
                    lex.bump_unchecked(1usize);
                    goto1_ctx2_x(lex)
                }
                _ => goto2_x(lex),
            }
        }
        #[inline]
        fn goto3_ctx2_x<'s>(lex: &mut Lexer<'s>) {
            match lex.read::<&[u8; 1usize]>() {
                Some(b"a") => {
                    lex.bump_unchecked(1usize);
                    goto3_ctx2_x(lex)
                }
                _ => goto2_ctx2_x(lex),
            }
        }
        #[inline]
        fn goto6_ctx5_x<'s>(lex: &mut Lexer<'s>) {
            let byte = match lex.read::<u8>() {
                Some(byte) => byte,
                None => return goto5_ctx5_x(lex),
            };
            match byte {
                b'b' => {
                    lex.bump_unchecked(1usize);
                    goto1_ctx5_x(lex)
                }
                b'a' => {
                    lex.bump_unchecked(1usize);
                    goto3_ctx2_x(lex)
                }
                _ => goto5_ctx5_x(lex),
            }
        }
        #[inline]
        fn goto8<'s>(lex: &mut Lexer<'s>) {
            let byte = match lex.read::<u8>() {
                Some(byte) => byte,
                None => return _end(lex),
            };
            match byte {
                b'a' => {
                    lex.bump_unchecked(1usize);
                    goto6_ctx5_x(lex)
                }
                _ => _error(lex),
            }
        }
        goto8(lex)
    }
}

Notice how the generated code always calls bump_unchecked. It should only call bump_unchecked when it has a slice it can accept as a token. It seems Logos assumes that if a second 'a' is read, that it must match a+b.

For comparison, replacing the regex a+b with ab|aab|aaab causes Logos to backtrack correctly.

Here is the cargo expand for this altered lexer.
impl<'s> ::logos::Logos<'s> for Token {
    type Error = ();
    type Extras = ();
    type Source = str;
    fn lex(lex: &mut ::logos::Lexer<'s, Self>) {
        use ::logos::internal::{LexerInternal, CallbackResult};
        type Lexer<'s> = ::logos::Lexer<'s, Token>;
        fn _end<'s>(lex: &mut Lexer<'s>) {
            lex.end()
        }
        fn _error<'s>(lex: &mut Lexer<'s>) {
            lex.bump_unchecked(1);
            lex.error();
        }
        #[inline]
        fn goto9_ctx9_x<'s>(lex: &mut Lexer<'s>) {
            lex.set(Ok(Token::A));
        }
        #[inline]
        fn goto9_x<'s>(lex: &mut Lexer<'s>) {
            lex.set(Ok(Token::A));
        }
        #[inline]
        fn goto1_ctx9_x<'s>(lex: &mut Lexer<'s>) {
            lex.set(Ok(Token::AAB));
        }
        #[inline]
        fn goto3_at2_ctx9_x<'s>(lex: &mut Lexer<'s>) {
            match lex.read_at::<&[u8; 1usize]>(2usize) {
                Some(b"b") => {
                    lex.bump_unchecked(3usize);
                    goto1_ctx9_x(lex)
                }
                _ => goto9_x(lex),
            }
        }
        #[inline]
        fn goto5_at1_ctx9_x<'s>(lex: &mut Lexer<'s>) {
            let byte = match lex.read_at::<u8>(1usize) {
                Some(byte) => byte,
                None => return goto9_x(lex),
            };
            match byte {
                b'b' => {
                    lex.bump_unchecked(2usize);
                    goto1_ctx9_x(lex)
                }
                b'a' => goto3_at2_ctx9_x(lex),
                _ => goto9_x(lex),
            }
        }
        #[inline]
        fn goto10_ctx9_x<'s>(lex: &mut Lexer<'s>) {
            let byte = match lex.read::<u8>() {
                Some(byte) => byte,
                None => return goto9_ctx9_x(lex),
            };
            match byte {
                b'a' => goto5_at1_ctx9_x(lex),
                b'b' => {
                    lex.bump_unchecked(1usize);
                    goto1_ctx9_x(lex)
                }
                _ => goto9_ctx9_x(lex),
            }
        }
        #[inline]
        fn goto11<'s>(lex: &mut Lexer<'s>) {
            let byte = match lex.read::<u8>() {
                Some(byte) => byte,
                None => return _end(lex),
            };
            match byte {
                b'a' => {
                    lex.bump_unchecked(1usize);
                    goto10_ctx9_x(lex)
                }
                _ => _error(lex),
            }
        }
        goto11(lex)
    }
}

Notice how the generated code doesn't call bump_unchecked unless it has some slice it can accept as a token.

@maciejhirsz
Copy link
Owner

I believe this to be related to #279. I don't think this is fixable until the derive macro is rewritten (see 0.13 release notes).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants