Lexer produces wrong tokens when more input is provided #265

MalteJanz · 2022-10-10T12:59:32Z

My logos lexer implementation somehow does not match the TK_NOT token when there is more input (like a whitespace) after it. Instead it matches the TK_WORD token in that case, which should be wrong when it has a lower priority.

Reproducible example with tests:

use logos::Logos;

#[derive(Debug, Clone, Copy, PartialEq, Eq, PartialOrd, Ord, Hash, Logos)]
#[allow(non_camel_case_types)]
pub enum SyntaxKind {
    #[regex(r"[ \t]+", priority = 1)]
    TK_WHITESPACE = 0,
    #[regex(r"[a-zA-Z][a-zA-Z0-9]*", priority = 1)]
    TK_WORD,

    #[token("not", priority = 50)]
    TK_NOT,
    #[token("not in", priority = 60)]
    TK_NOT_IN,
}

#[cfg(test)]
mod tests {
    use super::*;

    #[test]
    fn single_not_works() {
        let mut lexer = SyntaxKind::lexer("not");
        assert_eq!(lexer.next(), Some(Ok(SyntaxKind::TK_NOT)));
    }

    #[test]
    fn word_then_not_works() {
        let mut lexer = SyntaxKind::lexer("word not");
        assert_eq!(lexer.next(), Some(Ok(SyntaxKind::TK_WORD)));
        assert_eq!(lexer.next(), Some(Ok(SyntaxKind::TK_WHITESPACE)));
        assert_eq!(lexer.next(), Some(Ok(SyntaxKind::TK_NOT)));
    }

    #[test]
    fn but_this_does_not_work() {
        let mut lexer = SyntaxKind::lexer("not word");
        // FAILED because
        // Left:  Some(Ok(TK_WORD)
        // Right: Some(Ok(TK_NOT)
        assert_eq!(lexer.next(), Some(Ok(SyntaxKind::TK_NOT)));
        assert_eq!(lexer.next(), Some(Ok(SyntaxKind::TK_WHITESPACE)));
        assert_eq!(lexer.next(), Some(Ok(SyntaxKind::TK_WORD)));
    }

    #[test]
    fn this_is_fine() {
        let mut lexer = SyntaxKind::lexer("not in ");
        assert_eq!(lexer.next(), Some(Ok(SyntaxKind::TK_NOT_IN)));
        assert_eq!(lexer.next(), Some(Ok(SyntaxKind::TK_WHITESPACE)));
    }
}

I know that the situation with TK_NOT and TK_NOT_IN is maybe not ideal (if I remove the latter it works again). But for my parser it would be way better to have these tokens rather than two separate TK_NOT and TK_IN tokens. I would be thankfully for any suggestions that don't require me to remove either of TK_WORD or TK_NOT_IN to make the test but_this_does_not_work run.

The text was updated successfully, but these errors were encountered:

MalteJanz · 2022-10-10T13:10:35Z

May also be related to #160

MalteJanz · 2024-02-08T16:37:40Z

@jeertmans Thank you for making this project active again 👍. it would be great if this bug gets some attention🙂, because it can be a real big surprise if encountered (did cost me a good amount of debugging and making a workaround in my parser for my bachelor thesis) 🤯 .

It was initially reported in logos 0.12 and I updated the reproduction code to 0.14 and verified it is still a problem. Also it doesn't seem to matter if there aren't any priorities set explicitly.

jeertmans · 2024-02-10T09:47:01Z

Indeed I think this is related to #160, and probably caused by two patterns matching the same things, but one pattern allowing a longer match. This issue should be handled by the priorities, but seems like it fails to do so.

I would be interesting to analyse the generated code by logos derive macro, and inspect it. Did you already do that?

MalteJanz · 2024-02-10T13:09:57Z

I guess you mean #160 right (maybe misspelled)?
Unfortunately I didn't had the time to investigate further back then and I'm not familiar with the logos codebase. But let me know if I can help in any other way here to move this forward 🙂

jeertmans · 2024-02-10T14:27:30Z

I guess you mean #160 right (maybe misspelled)?
Yes good catch! Fixing this!

I don't have much time right now, but I am posting the expanded macro code here,
so we can maybe take a first look?

#![feature(prelude_import)]
#[prelude_import]
use std::prelude::rust_2021::*;
#[macro_use]
extern crate std;
use logos::Logos;
pub enum SyntaxKind {
    #[regex(r"[ \t]+", priority = 1)]
    TK_WHITESPACE = 0,
    #[regex(r"[a-zA-Z][a-zA-Z0-9]*", priority = 1)]
    TK_WORD,
    #[token("not", priority = 50)]
    TK_NOT,
    #[token("not in", priority = 60)]
    TK_NOT_IN,
}
impl<'s> ::logos::Logos<'s> for SyntaxKind {
    type Error = ();
    type Extras = ();
    type Source = str;
    fn lex(lex: &mut ::logos::Lexer<'s, Self>) {
        use ::logos::internal::{LexerInternal, CallbackResult};
        type Lexer<'s> = ::logos::Lexer<'s, SyntaxKind>;
        fn _end<'s>(lex: &mut Lexer<'s>) {
            lex.end()
        }
        fn _error<'s>(lex: &mut Lexer<'s>) {
            lex.bump_unchecked(1);
            lex.error();
        }
        static COMPACT_TABLE_0: [u8; 256] = [
            0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
            0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
            0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
            1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
            1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
            0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
            0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
            0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
            0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        ];
        #[inline]
        fn goto4_ctx4_x<'s>(lex: &mut Lexer<'s>) {
            lex.set(Ok(SyntaxKind::TK_WORD));
        }
        #[inline]
        fn pattern0(byte: u8) -> bool {
            COMPACT_TABLE_0[byte as usize] & 1 > 0
        }
        #[inline]
        fn goto5_ctx4_x<'s>(lex: &mut Lexer<'s>) {
            while let Some(arr) = lex.read::<&[u8; 16]>() {
                if pattern0(arr[0]) {
                    if pattern0(arr[1]) {
                        if pattern0(arr[2]) {
                            if pattern0(arr[3]) {
                                if pattern0(arr[4]) {
                                    if pattern0(arr[5]) {
                                        if pattern0(arr[6]) {
                                            if pattern0(arr[7]) {
                                                if pattern0(arr[8]) {
                                                    if pattern0(arr[9]) {
                                                        if pattern0(arr[10]) {
                                                            if pattern0(arr[11]) {
                                                                if pattern0(arr[12]) {
                                                                    if pattern0(arr[13]) {
                                                                        if pattern0(arr[14]) {
                                                                            if pattern0(arr[15]) {
                                                                                lex.bump_unchecked(
                                                                                    16,
                                                                                );
                                                                                continue;
                                                                            }
                                                                            lex.bump_unchecked(15);
                                                                            return goto4_ctx4_x(
                                                                                lex,
                                                                            );
                                                                        }
                                                                        lex.bump_unchecked(14);
                                                                        return goto4_ctx4_x(lex);
                                                                    }
                                                                    lex.bump_unchecked(13);
                                                                    return goto4_ctx4_x(lex);
                                                                }
                                                                lex.bump_unchecked(12);
                                                                return goto4_ctx4_x(lex);
                                                            }
                                                            lex.bump_unchecked(11);
                                                            return goto4_ctx4_x(lex);
                                                        }
                                                        lex.bump_unchecked(10);
                                                        return goto4_ctx4_x(lex);
                                                    }
                                                    lex.bump_unchecked(9);
                                                    return goto4_ctx4_x(lex);
                                                }
                                                lex.bump_unchecked(8);
                                                return goto4_ctx4_x(lex);
                                            }
                                            lex.bump_unchecked(7);
                                            return goto4_ctx4_x(lex);
                                        }
                                        lex.bump_unchecked(6);
                                        return goto4_ctx4_x(lex);
                                    }
                                    lex.bump_unchecked(5);
                                    return goto4_ctx4_x(lex);
                                }
                                lex.bump_unchecked(4);
                                return goto4_ctx4_x(lex);
                            }
                            lex.bump_unchecked(3);
                            return goto4_ctx4_x(lex);
                        }
                        lex.bump_unchecked(2);
                        return goto4_ctx4_x(lex);
                    }
                    lex.bump_unchecked(1);
                    return goto4_ctx4_x(lex);
                }
                return goto4_ctx4_x(lex);
            }
            while lex.test(pattern0) {
                lex.bump_unchecked(1);
            }
            goto4_ctx4_x(lex);
        }
        #[inline]
        fn goto7_ctx5_x<'s>(lex: &mut Lexer<'s>) {
            lex.set(Ok(SyntaxKind::TK_NOT));
        }
        #[inline]
        fn goto8_ctx5_x<'s>(lex: &mut Lexer<'s>) {
            lex.set(Ok(SyntaxKind::TK_NOT_IN));
        }
        #[inline]
        fn goto16_at1_ctx5_x<'s>(lex: &mut Lexer<'s>) {
            match lex.read_at::<&[u8; 2usize]>(1usize) {
                Some(b"in") => {
                    lex.bump_unchecked(3usize);
                    goto8_ctx5_x(lex)
                }
                _ => goto5_ctx4_x(lex),
            }
        }
        #[inline]
        fn goto15_ctx5_x<'s>(lex: &mut Lexer<'s>) {
            let byte = match lex.read::<u8>() {
                Some(byte) => byte,
                None => return goto7_ctx5_x(lex),
            };
            match byte {
                byte if pattern0(byte) => {
                    lex.bump_unchecked(1usize);
                    goto5_ctx4_x(lex)
                }
                32u8 => goto16_at1_ctx5_x(lex),
                _ => goto7_ctx5_x(lex),
            }
        }
        #[inline]
        fn goto13_ctx5_x<'s>(lex: &mut Lexer<'s>) {
            match lex.read::<&[u8; 2usize]>() {
                Some(b"ot") => {
                    lex.bump_unchecked(2usize);
                    goto15_ctx5_x(lex)
                }
                _ => goto5_ctx4_x(lex),
            }
        }
        #[inline]
        fn goto1_ctx1_x<'s>(lex: &mut Lexer<'s>) {
            lex.set(Ok(SyntaxKind::TK_WHITESPACE));
        }
        #[inline]
        fn pattern1(byte: u8) -> bool {
            match byte {
                9u8 | 32u8 => true,
                _ => false,
            }
        }
        #[inline]
        fn goto2_ctx1_x<'s>(lex: &mut Lexer<'s>) {
            while let Some(arr) = lex.read::<&[u8; 16]>() {
                if pattern1(arr[0]) {
                    if pattern1(arr[1]) {
                        if pattern1(arr[2]) {
                            if pattern1(arr[3]) {
                                if pattern1(arr[4]) {
                                    if pattern1(arr[5]) {
                                        if pattern1(arr[6]) {
                                            if pattern1(arr[7]) {
                                                if pattern1(arr[8]) {
                                                    if pattern1(arr[9]) {
                                                        if pattern1(arr[10]) {
                                                            if pattern1(arr[11]) {
                                                                if pattern1(arr[12]) {
                                                                    if pattern1(arr[13]) {
                                                                        if pattern1(arr[14]) {
                                                                            if pattern1(arr[15]) {
                                                                                lex.bump_unchecked(
                                                                                    16,
                                                                                );
                                                                                continue;
                                                                            }
                                                                            lex.bump_unchecked(15);
                                                                            return goto1_ctx1_x(
                                                                                lex,
                                                                            );
                                                                        }
                                                                        lex.bump_unchecked(14);
                                                                        return goto1_ctx1_x(lex);
                                                                    }
                                                                    lex.bump_unchecked(13);
                                                                    return goto1_ctx1_x(lex);
                                                                }
                                                                lex.bump_unchecked(12);
                                                                return goto1_ctx1_x(lex);
                                                            }
                                                            lex.bump_unchecked(11);
                                                            return goto1_ctx1_x(lex);
                                                        }
                                                        lex.bump_unchecked(10);
                                                        return goto1_ctx1_x(lex);
                                                    }
                                                    lex.bump_unchecked(9);
                                                    return goto1_ctx1_x(lex);
                                                }
                                                lex.bump_unchecked(8);
                                                return goto1_ctx1_x(lex);
                                            }
                                            lex.bump_unchecked(7);
                                            return goto1_ctx1_x(lex);
                                        }
                                        lex.bump_unchecked(6);
                                        return goto1_ctx1_x(lex);
                                    }
                                    lex.bump_unchecked(5);
                                    return goto1_ctx1_x(lex);
                                }
                                lex.bump_unchecked(4);
                                return goto1_ctx1_x(lex);
                            }
                            lex.bump_unchecked(3);
                            return goto1_ctx1_x(lex);
                        }
                        lex.bump_unchecked(2);
                        return goto1_ctx1_x(lex);
                    }
                    lex.bump_unchecked(1);
                    return goto1_ctx1_x(lex);
                }
                return goto1_ctx1_x(lex);
            }
            while lex.test(pattern1) {
                lex.bump_unchecked(1);
            }
            goto1_ctx1_x(lex);
        }
        #[inline]
        fn goto17<'s>(lex: &mut Lexer<'s>) {
            enum Jump {
                __,
                J5,
                J13,
                J2,
            }
            const LUT: [Jump; 256] = {
                use Jump::*;
                [
                    __, __, __, __, __, __, __, __, __, J2, __, __, __, __, __, __, __, __, __, __,
                    __, __, __, __, __, __, __, __, __, __, __, __, J2, __, __, __, __, __, __, __,
                    __, __, __, __, __, __, __, __, __, __, __, __, __, __, __, __, __, __, __, __,
                    __, __, __, __, __, J5, J5, J5, J5, J5, J5, J5, J5, J5, J5, J5, J5, J5, J5, J5,
                    J5, J5, J5, J5, J5, J5, J5, J5, J5, J5, J5, __, __, __, __, __, __, J5, J5, J5,
                    J5, J5, J5, J5, J5, J5, J5, J5, J5, J5, J13, J5, J5, J5, J5, J5, J5, J5, J5,
                    J5, J5, J5, J5, __, __, __, __, __, __, __, __, __, __, __, __, __, __, __, __,
                    __, __, __, __, __, __, __, __, __, __, __, __, __, __, __, __, __, __, __, __,
                    __, __, __, __, __, __, __, __, __, __, __, __, __, __, __, __, __, __, __, __,
                    __, __, __, __, __, __, __, __, __, __, __, __, __, __, __, __, __, __, __, __,
                    __, __, __, __, __, __, __, __, __, __, __, __, __, __, __, __, __, __, __, __,
                    __, __, __, __, __, __, __, __, __, __, __, __, __, __, __, __, __, __, __, __,
                    __, __, __, __, __, __, __, __, __, __, __, __, __, __, __, __, __,
                ]
            };
            let byte = match lex.read::<u8>() {
                Some(byte) => byte,
                None => return _end(lex),
            };
            match LUT[byte as usize] {
                Jump::J5 => {
                    lex.bump_unchecked(1usize);
                    goto5_ctx4_x(lex)
                }
                Jump::J13 => {
                    lex.bump_unchecked(1usize);
                    goto13_ctx5_x(lex)
                }
                Jump::J2 => {
                    lex.bump_unchecked(1usize);
                    goto2_ctx1_x(lex)
                }
                Jump::__ => _error(lex),
            }
        }
        goto17(lex)
    }
}

jeertmans · 2024-02-10T14:30:40Z

It might also be worth trying a debugger, like rust-gdb or https://www.gdbgui.com/. I haven't used them myself, but just pointing to.

MalteJanz · 2024-02-11T17:05:58Z

I haven't figured out debugging of the logos generated source yet (somehow RustRover doesn't seem to recognise the macro code). But I was able to reduce the reproduction code a bit further (maybe makes it easier to grasp the generated code):

use logos::Logos;

#[derive(Debug, Clone, Copy, PartialEq, Logos)]
#[allow(non_camel_case_types)]
pub enum SyntaxKind {
    #[regex(r"[a-zA-Z][a-zA-Z0-9]*", priority = 2)]
    TK_WORD,

    #[token("not", priority = 10)]
    TK_NOT,
}

#[cfg(test)]
mod tests {
    use super::*;

    #[test]
    fn single_not_works() {
        let mut lexer = SyntaxKind::lexer("not");
        assert_eq!(lexer.next(), Some(Ok(SyntaxKind::TK_NOT)));
    }

    #[test]
    fn single_word_works() {
        let mut lexer = SyntaxKind::lexer("word");
        assert_eq!(lexer.next(), Some(Ok(SyntaxKind::TK_WORD)));
    }

    #[test]
    fn but_this_does_not_work() {
        let mut lexer = SyntaxKind::lexer("notword");
        // FAILED because
        // Left:  Some(Ok(TK_WORD)
        // Right: Some(Ok(TK_NOT)
        assert_eq!(lexer.next(), Some(Ok(SyntaxKind::TK_NOT)));
        assert_eq!(lexer.next(), Some(Ok(SyntaxKind::TK_WORD)));
    }
}

Logos macro expansion:

use logos::Logos;
#[allow(non_camel_case_types)]
pub enum SyntaxKind {
    #[regex(r"[a-zA-Z][a-zA-Z0-9]*", priority = 2)]
    TK_WORD,
    #[token("not", priority = 10)]
    TK_NOT,
}

impl<'s> ::logos::Logos<'s> for SyntaxKind {
    type Error = ();
    type Extras = ();
    type Source = str;
    fn lex(lex: &mut ::logos::Lexer<'s, Self>) {
        use ::logos::internal::{CallbackResult, LexerInternal};
        type Lexer<'s> = ::logos::Lexer<'s, SyntaxKind>;
        fn _end<'s>(lex: &mut Lexer<'s>) {
            lex.end()
        }
        fn _error<'s>(lex: &mut Lexer<'s>) {
            lex.bump_unchecked(1);
            lex.error();
        }
        static COMPACT_TABLE_0: [u8; 256] = [
            0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
            0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
            0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
            1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
            1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
            0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
            0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
            0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
            0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        ];
        #[inline]
        fn goto1_ctx1_x<'s>(lex: &mut Lexer<'s>) {
            lex.set(Ok(SyntaxKind::TK_WORD));
        }
        #[inline]
        fn pattern0(byte: u8) -> bool {
            COMPACT_TABLE_0[byte as usize] & 1 > 0
        }
        #[inline]
        fn goto2_ctx1_x<'s>(lex: &mut Lexer<'s>) {
            while let Some(arr) = lex.read::<&[u8; 16]>() {
                if pattern0(arr[0]) {
                    if pattern0(arr[1]) {
                        if pattern0(arr[2]) {
                            if pattern0(arr[3]) {
                                if pattern0(arr[4]) {
                                    if pattern0(arr[5]) {
                                        if pattern0(arr[6]) {
                                            if pattern0(arr[7]) {
                                                if pattern0(arr[8]) {
                                                    if pattern0(arr[9]) {
                                                        if pattern0(arr[10]) {
                                                            if pattern0(arr[11]) {
                                                                if pattern0(arr[12]) {
                                                                    if pattern0(arr[13]) {
                                                                        if pattern0(arr[14]) {
                                                                            if pattern0(arr[15]) {
                                                                                lex.bump_unchecked(
                                                                                    16,
                                                                                );
                                                                                continue;
                                                                            }
                                                                            lex.bump_unchecked(15);
                                                                            return goto1_ctx1_x(
                                                                                lex,
                                                                            );
                                                                        }
                                                                        lex.bump_unchecked(14);
                                                                        return goto1_ctx1_x(lex);
                                                                    }
                                                                    lex.bump_unchecked(13);
                                                                    return goto1_ctx1_x(lex);
                                                                }
                                                                lex.bump_unchecked(12);
                                                                return goto1_ctx1_x(lex);
                                                            }
                                                            lex.bump_unchecked(11);
                                                            return goto1_ctx1_x(lex);
                                                        }
                                                        lex.bump_unchecked(10);
                                                        return goto1_ctx1_x(lex);
                                                    }
                                                    lex.bump_unchecked(9);
                                                    return goto1_ctx1_x(lex);
                                                }
                                                lex.bump_unchecked(8);
                                                return goto1_ctx1_x(lex);
                                            }
                                            lex.bump_unchecked(7);
                                            return goto1_ctx1_x(lex);
                                        }
                                        lex.bump_unchecked(6);
                                        return goto1_ctx1_x(lex);
                                    }
                                    lex.bump_unchecked(5);
                                    return goto1_ctx1_x(lex);
                                }
                                lex.bump_unchecked(4);
                                return goto1_ctx1_x(lex);
                            }
                            lex.bump_unchecked(3);
                            return goto1_ctx1_x(lex);
                        }
                        lex.bump_unchecked(2);
                        return goto1_ctx1_x(lex);
                    }
                    lex.bump_unchecked(1);
                    return goto1_ctx1_x(lex);
                }
                return goto1_ctx1_x(lex);
            }
            while lex.test(pattern0) {
                lex.bump_unchecked(1);
            }
            goto1_ctx1_x(lex);
        }
        #[inline]
        fn goto4_ctx2_x<'s>(lex: &mut Lexer<'s>) {
            lex.set(Ok(SyntaxKind::TK_NOT));
        }
        #[inline]
        fn goto7_ctx2_x<'s>(lex: &mut Lexer<'s>) {
            let byte = match lex.read::<u8>() {
                Some(byte) => byte,
                None => return goto4_ctx2_x(lex),
            };
            match byte {
                byte if pattern0(byte) => {
                    lex.bump_unchecked(1usize);
                    goto2_ctx1_x(lex)
                }
                _ => goto4_ctx2_x(lex),
            }
        }
        #[inline]
        fn goto6_ctx2_x<'s>(lex: &mut Lexer<'s>) {
            match lex.read::<&[u8; 2usize]>() {
                Some(b"ot") => {
                    lex.bump_unchecked(2usize);
                    goto7_ctx2_x(lex)
                }
                _ => goto2_ctx1_x(lex),
            }
        }
        #[inline]
        fn pattern1(byte: u8) -> bool {
            const LUT: u64 = 576390375103528958u64;
            match 1u64.checked_shl(byte.wrapping_sub(64u8) as u32) {
                Some(shift) => LUT & shift != 0,
                None => false,
            }
        }
        #[inline]
        fn goto8<'s>(lex: &mut Lexer<'s>) {
            let byte = match lex.read::<u8>() {
                Some(byte) => byte,
                None => return _end(lex),
            };
            match byte {
                b'n' => {
                    lex.bump_unchecked(1usize);
                    goto6_ctx2_x(lex)
                }
                byte if pattern1(byte) => {
                    lex.bump_unchecked(1usize);
                    goto2_ctx1_x(lex)
                }
                _ => _error(lex),
            }
        }
        goto8(lex)
    }
}

MalteJanz · 2024-02-11T17:12:31Z

Maybe it's somehow related to the greedy matching of the regex, but I wouldn't expect that if there is another token definition with higher priority 🤔.

And as the other tests run just fine it is only a problem when there are more bytes after the not 🤷

jameshurst · 2024-02-12T01:09:38Z

I had been running into what seemed like a similar issue, where a longer match would fail and it would backtrack to the wrong token. What was also strange was that the &str (ie. lex.slice()) contained within the mismatched token was actually the slice for what should have been matched. So to hack around the issue I had simply created a callback that took lex.slice() and created a new lexer to re-parse slice on its own and correct the token.

The code to reproduce my issue is a lot simpler as I was able to reduce it down to not actually perform any regex matching at all, although it still needed a #[regex(...)] token to trigger the bug seemingly due to differences in how the graph is generated for regex vs text tokens.

use logos::Logos;

#[derive(Debug, PartialEq, Eq, Logos)]
pub enum Token<'s> {
    #[token("Foo")]
    Foo(&'s str),
    #[token("FooBar")]
    FooBar(&'s str),
    #[regex("FooBarQux")]
    FooBarQux(&'s str),
}

#[cfg(test)]
mod tests {
    use itertools::Itertools;
    use logos::Logos;

    use super::*;

    #[test]
    fn test() {
        let lexer = Token::lexer("FooBarQ");
        let expected: Vec<Result<_, ()>> = vec![Ok(Token::FooBar("FooBar"))];
        let results = lexer
            .zip_longest(expected)
            .map(|x| x.left_and_right())
            .unzip::<_, _, Vec<_>, Vec<_>>();
        //  left: [Some(Ok(Foo("FooBar")))]
        // right: [Some(Ok(FooBar("FooBar")))]
        assert_eq!(results.0, results.1);
    }
}

Potential Fix?

I took some time today to look into it and the issue appeared to be in the generator, which would explain why priority had no effect (both for me and @MalteJanz), as that's only used in the steps before generator. The issue appears to be with context tracking and that the backtrack point wasn't being updated, even though the slice was getting bumped correctly.

I was able to resolve my issue by adding || meta.min_read == 0 to the if logic at https://github.com/maciejhirsz/logos/blob/v0.14/logos-codegen/src/generator/mod.rs#L115-L119, as that appears to be what causes the slice to be bumped correctly in the match block following the if. After that change I was able to remove my re-lexing hack and all my tests still pass.

The change to the if logic also appears to fix the issue in the code originally posted by @MalteJanz, but strangely not in the newly posted minimal reproduction case.

I'm not familiar enough with the logos internals to say if this is the correct fix though, as it doesn't fix the issue with @MalteJanz's minimal case, although perhaps that's actually a separate bug.

The case originally posted by @MalteJanz does seem very similar to what I was doing, where I was matching a separator token (TK_WHITESPACE) and a sequence followed by a separator ("not in ") would match incorrectly.

jameshurst · 2024-02-12T16:39:39Z

Maybe it's somehow related to the greedy matching of the regex, but I wouldn't expect that if there is another token definition with higher priority 🤔.

I looked into this one a bit, and I do think that it's a separate issue than the original issue.

The following code causes logos to generate the following graph.

pub enum SyntaxKind {
    #[regex(r"[a-z]+", priority = 0)]
    TK_WORD,
    #[token("not", priority = 100)]
    TK_NOT,
}

{
    1: ::TK_WORD,
    2: {
        [a-z] ⇒ 2,
        _ ⇒ 1,
    },
    4: ::TK_NOT,
    6: [
        ot ⇒ 7,
        _ ⇒ 2*,
    ],
    7: {
        [a-z] ⇒ 2,
        _ ⇒ 4,
    },
    8: {
        [a-m] ⇒ 2,
        n ⇒ 6,
        [o-z] ⇒ 2,
    },
}

Following the path for "not", starting from the bottom node, we can see that:

@8 matches "n" and jumps to @6
@6 matches "ot" and jumps to @7
@7 only returns TK_NOT when the next character is not [a-z].

Sidenote: For anyone else that wants to check out the graph that logos is generating for their code, it turns out logos already has a nice Debug implementation that produces the above output. You can just throw a dbg!(graph) in Generator::new and run cargo build on your crate to run the codegen and see the graph.

It does indeed seem that the regex is too greedy. I previously had an issue with a greedy regex where I had a token, like TK_WORD, that was meant to consume text that wasn't matched by any other token. I was able to work around the issue by removing the greedy regex and modifying my lexing loop to consume any text up to the next separator whenever a lexing error is encountered. Although in my case a separator is always required between tokens, and any matched token must be followed by a separator or EOF, so something like "notword" would result in a text token, like TK_WORD.

let mut tokens = Vec::<Token>::new();
let mut lexer = Token::lexer(input);
while let Some(token) = lexer.next() {
    if let Ok(token) = token {
        tokens.push(token);
    } else {
        let consumed = lexer
            .remainder()
            .find(|c| CHARS_TEXT_END.contains(c))
            .unwrap_or(lexer.remainder().len());
        tokens.push(Token::Text(
            &input[lexer.span().start..lexer.span().end + consumed],
        ));
        lexer.bump(consumed);
    }
}

For a case that does not require separators between tokens, the loop could probably be modified to collect the spans of errors, and then when the next matched token is found, those spans could be collapsed into a text token and inserted before the matched token.

It seems like using greedy regexes with logos can cause some unexpected matching behaviours. In the "notword" case, I would probably expect it to return TK_NOT and TK_WORD.

jameshurst · 2024-02-12T17:54:24Z

I was able to resolve my issue by adding || meta.min_read == 0 to the if logic at https://github.com/maciejhirsz/logos/blob/v0.14/logos-codegen/src/generator/mod.rs#L115-L119, as that appears to be what causes the slice to be bumped correctly in the match block following the if.

After staring at the code some more, I believe the proper fix may actually be to just remove the if logic alltogether and always perform ctx.switch(self.graph[id].miss()).

Sidenote: After removing the if statement, the ctx.switch function can be simplified to not perform ctx.bump and no longer return a token stream. The match logic below the if statement can then be simplified to only if enters_loop || meta.min_read == 0.

From what I can see Generator::goto will be called to generate:

the root node
the _ branch for a match statement
any other branches for a match statement

For each of these cases I would expect self.graph[id].miss() to indicate the correct backtrack position, or None which would cause ctx.switch to keep the existing backtrack.

I tested this change against my own library, the original code for this issue, and #160 and they are all passing.

I'd be interested to hear your thoughts on this @jeertmans (and @maciejhirsz's if available).

jameshurst · 2024-02-12T19:10:53Z

I browsed through some more of the GitHub issues and found that the fix mentioned above also fixes #279.

jeertmans · 2024-02-13T09:11:40Z

Hey @jameshurst, thank you for your very deep analysis on this issue, and this might well solve a very long-running issue!

A few comments / suggestions / questions.

Sidenote: For anyone else that wants to check out the graph that logos is generating for their code, it turns out logos already has a nice Debug implementation that produces the above output. You can just throw a dbg!(graph) in Generator::new and run cargo build on your crate to run the codegen and see the graph.

Aah, I didn't know myself! I really should document this in the book :-)
Maybe you could create a PR that adds a new page to the book, e.g., Debugging tool? If you do not have time, I might try to do it :-)

I'm not familiar enough with the logos internals to say if this is the correct fix though, as it doesn't fix the issue with @MalteJanz's minimal case, although perhaps that's actually a separate bug.

My neither actually (internals were mostly written by one person, which is not me haha), but if we write enough tests (e.g., based on previous issues), and they now pass, I think it's safe to call it a fix!

I was able to resolve my issue by adding || meta.min_read == 0 to the if logic at https://github.com/maciejhirsz/logos/blob/v0.14/logos-codegen/src/generator/mod.rs#L115-L119, as that appears to be what causes the slice to be bumped correctly in the match block following the if.

After staring at the code some more, I believe the proper fix may actually be to just remove the if logic alltogether and always perform ctx.switch(self.graph[id].miss()).

Sidenote: After removing the if statement, the ctx.switch function can be simplified to not perform ctx.bump and no longer return a token stream. The match logic below the if statement can then be simplified to only if enters_loop || meta.min_read == 0.

From what I can see Generator::goto will be called to generate:

the root node

the _ branch for a match statement

any other branches for a match statement

For each of these cases I would expect self.graph[id].miss() to indicate the correct backtrack position, or None which would cause ctx.switch to keep the existing backtrack.

Well, if that indeed fixes many issues, that is very great news!
Could you please:

Create a PR without the fix;
Add tests that should all fail (from previous issues) and commit/push;
Commit/push the fix (in a separate commit, so we will be able to see that now tests are passing);
Add a few comments in the code to explain (what you think) is right. The code contains hardly any comment, which makes is hard to understand, so any small comment is more than welcome!
Also, if you can already list the issue that are fixes by your fix, so we can link them to the PR and close them once merge, this would be nice.

I'd be more than happy to help you with this and review your PR(s).

Thanks!

maciejhirsz · 2024-03-29T07:17:28Z

I'd be interested to hear your thoughts on this @jeertmans (and @maciejhirsz's if available).

My thoughts are that you likely have better understanding of what's going on there than I do currently since it's been a long while. If it fixes even one bug and all tests are still passing, I say ship it.

jeertmans · 2024-03-29T08:12:54Z

@maciejhirsz I'll send @jameshurst an email, in the hope to have some better details about this fix :-)

elenakrittik · 2024-04-09T12:25:01Z

@jameshurst You're our savior! In my case with a #[token("not")] Not and a #[regex(r"(\p{XID_Start}|_)\p{XID_Continue}*")] Identifier(&'a str) your fix worked flawlessly: all logos tests pass, all tests of mine pass too, and not is properly recognized as Not and not Identifier("not")! Here's the commit so that every person stumbling upon this doesn't have to "convert" words to patches: elenakrittik@148b6a0

jameshurst · 2024-04-09T13:41:33Z

While attempting to prepare this fix for a PR I had noticed that the change actually breaks the css::test_letter_spacing test case. When I had previously run the Logos tests I hadn't realized the tests directory had actually contained its own crate rather than Cargo integration tests. It'll probably require a bit more digging to determine the cause of that breakage and figure out a fix that works for both cases.

jeertmans · 2024-04-09T14:03:14Z

@elenakrittik thanks for your commit link!

I tested locally (on #378) and it passed all tests, except one:

elenakrittik · 2024-04-09T14:37:18Z

Oh, my bad, i didn't even notice that the tests were a separate package. I only did cargo test 😅 Hopefully the issue isn't anything major (no issues so far for me, the parser i built atop logos works fine)

rockie · 2024-04-28T14:51:50Z

hi @jameshurst Thanks for your contribution! It has solved the issue that was blocking me for several days. However, I noticed that the PR still does not address the following case:

pub enum Token {   
    #[regex(r"-?0+", priority = 10)]
    A,

    #[regex(r"(0|#)+")]
    B,
}

The PR works well when the leading -? is removed. I think the root cause is Fork.merge fails to take the priority settings into consideration when merging two Leaf node. I try to fix it with some ugly code and it works fine in all my own cases. However, my fix will break colors::match_colors, crunch::crunch, priority_disambiguate_1::priority_abc as well as the css::test_letter_spacing mentioned above.

In my scenario, I assume that 'priority' should take precedence over 'longest match'. I'm curious about the design principle of this library: does 'longest match' or 'priority' prevail? If priority is supposed to prevail, I could find some time to refine the code and submit a PR.

jeertmans mentioned this issue Feb 19, 2024

Regex seems to always take priority #377

Open

jeertmans added bug Something isn't working help wanted Extra attention is needed labels Feb 19, 2024

jeertmans mentioned this issue Feb 19, 2024

fix(lib): correctly match expected tokens #378

Open

afreeland mentioned this issue Feb 19, 2024

chore(docs): Adds graph debug documentation to book #379

Merged

porkbrain mentioned this issue Mar 11, 2024

Another example of regex stumping on tokens #383

Open

jeertmans mentioned this issue Jun 10, 2024

Compose non-exclusive token with regex w/ diff priorities #397

Open

jeertmans pinned this issue Jun 10, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Lexer produces wrong tokens when more input is provided #265

Lexer produces wrong tokens when more input is provided #265

MalteJanz commented Oct 10, 2022 •

edited

Loading

MalteJanz commented Oct 10, 2022

MalteJanz commented Feb 8, 2024

jeertmans commented Feb 10, 2024 •

edited

Loading

MalteJanz commented Feb 10, 2024

jeertmans commented Feb 10, 2024

jeertmans commented Feb 10, 2024

MalteJanz commented Feb 11, 2024

MalteJanz commented Feb 11, 2024

jameshurst commented Feb 12, 2024

jameshurst commented Feb 12, 2024

jameshurst commented Feb 12, 2024

jameshurst commented Feb 12, 2024

jeertmans commented Feb 13, 2024

maciejhirsz commented Mar 29, 2024

jeertmans commented Mar 29, 2024

elenakrittik commented Apr 9, 2024

jameshurst commented Apr 9, 2024

jeertmans commented Apr 9, 2024

elenakrittik commented Apr 9, 2024 •

edited

Loading

rockie commented Apr 28, 2024

Lexer produces wrong tokens when more input is provided #265

Lexer produces wrong tokens when more input is provided #265

Comments

MalteJanz commented Oct 10, 2022 • edited Loading

MalteJanz commented Oct 10, 2022

MalteJanz commented Feb 8, 2024

jeertmans commented Feb 10, 2024 • edited Loading

MalteJanz commented Feb 10, 2024

jeertmans commented Feb 10, 2024

jeertmans commented Feb 10, 2024

MalteJanz commented Feb 11, 2024

MalteJanz commented Feb 11, 2024

jameshurst commented Feb 12, 2024

Potential Fix?

jameshurst commented Feb 12, 2024

jameshurst commented Feb 12, 2024

jameshurst commented Feb 12, 2024

jeertmans commented Feb 13, 2024

maciejhirsz commented Mar 29, 2024

jeertmans commented Mar 29, 2024

elenakrittik commented Apr 9, 2024

jameshurst commented Apr 9, 2024

jeertmans commented Apr 9, 2024

elenakrittik commented Apr 9, 2024 • edited Loading

rockie commented Apr 28, 2024

MalteJanz commented Oct 10, 2022 •

edited

Loading

jeertmans commented Feb 10, 2024 •

edited

Loading

elenakrittik commented Apr 9, 2024 •

edited

Loading