Add support for case insensitivity #198

gymore-io · 2021-01-29T00:06:09Z

Issue: #196

Case insensitivity is opt-in using the ignore(...) attribute.

#[derive(Logos)]
struct Token {
    /* ... */
    #[token("été", ignore(case))]
    Ete,
    #[token("DRôLE", ignore(ascii_case))]
    Drole,
    #[regex("[abc]+", ignore(case))]
    Abc,
}

let mut lex = Token::lex("été ÉTÉ DrÔle Drôle AaabbCC");
assert_eq!((Some(Token::Ete), "été"), (lex.next(), lex.slice()));
assert_eq!((Some(Token::Ete), "ÉTÉ"), (lex.next(), lex.slice()));
// Then a bunch of errors because Ô is not an ascii-uppercase ô.
assert_eq!((Some(Token::Drole), "Drôle"),  (lex.next(), lex.slice()));
assert_eq!((Some(Token::Abc), "AaabbCC"),  (lex.next(), lex.slice()));
assert_eq!(None, lex.next());

For now, I prevented the user to use both ascii_case and case at the same time. This can be easily changed.

Internally, all this attribute does is converting the original literal (for tokens) or the MIR (for regular expression) into another MIR that accepts both uppercase and lowercase variants of the input.

// For example, this ...
#[token("hello", ignore(ascii_case))
Hello,

// ... is treated internally exactly like this
#[regex("[Hh][Ee][Ll][Oo]")]
Hello,

Unanswered questions

I created the IgnoreFlags struct to make the ignore attribute more easily expandable. It uses a basic bit field. I wasn't sure whether a new dependency was ok so I implemented myself what the bitflags crate do (in much much simpler). Do we want a new dependency to make the code cleaner or the little features I implemented are enough?
I am using MIR to handle case insensitivity. Is this ok or would directly creating forks be faster? If this is not acceptable/maintenable, I think it is possible to adapt the the whole thing trait to make it work. I might need a little help for that though; I found the generator module quite complicated to dig into and understand.
Ranges (eg. regex("[a-z]", ...)) are currently using ClassUnicode::case_fold_simple and normal literals (eg. regex("abc", ...)) are using Rust's to_lowercase and to_uppercase functions. This might lead to inconsistencies if those two methods do not convert characters the same way. Now everything uses simple folding.

PS: This is my first PR ever. If I've done anything wrong, please tell me! I'm both doing this because I needed this feature and to get reviews of my code. Feel free to say it needs some (or a lot) rework :)

* Add the `IgnoreFlags` structure * Parsing logic of `IgnoreFlags`

* Add `ApplieIgnoreFlags` trait * Implementation of that trait for a whole lot of structures * Updated `lib.rs` to take the ignore flags in account

gymore-io · 2021-01-29T03:49:24Z

I just noticed that ParserBuilder had this neat method case_insensitive which basically does what I implemented with ApplieIgnoreFlags. But more clearly and probably more efficiently. The problem is that it does not support ascii case folding so we still have to implement that part ourselves. Plus I'm not sure how to integrate it with the current Mir system. Would add a parameter to Mir::utf8 be that much of a good idea? Or maybe another constructor like Mir::utf8_ignore_case. I'm not exactly sure.

Moreover, using that method basically makes my ApplieIgnoreFlags very complicated just to turn regular expressions into ascii-case-insensitive ones and turn token literals into ascii-case-insensitive regular expressions.

I will rewrite a good portion of that part of the system, using ParserBuilder::case_insensitive this time. That will answer part of the third unanswered question though. After that, only token literals will have a different case folding algorithm. If we turn them into regular expression with ParserBuilder, that could fix the issue. Does this sound like a good idea?

* Remove the `ApplieIgnoreFlags` trait -> Replaced with with `MakeAsciiInsensitive` trait which serves the same purpose but only for the flag `IgnoreAsciiCase`.

* `Literal::to_mir` now takes ignore flags and parses mir following those flags. * Add `Mir::utf8_ignore_case` and `Mir::binary_ignore_case` * Add `Literal::escape_regex`

gymore-io · 2021-01-29T13:42:07Z

Those three last commits removed the ApplieIgnoreCase trait to make use of the ParserBuilder::case_insensitive function. This makes the code less modular (for future ignore flags for example) but more reliable to my opinion. As that function does not support ascii-case-folding, I still had to implement this part myself with a new MakeAsciiInsensitive trait which serves the same purpose.

maciejhirsz · 2021-02-01T08:55:22Z

logos-derive/src/lib.rs

+                    } else {
+                        let mir = definition
+                            .literal
+                            .escape_regex()


Is this necessary?

Well, as I'm calling .to_mir right after, we need to ensure the literal will actually be used as a string and not some regex semantics.

For example, #[token("[Some Header]", ignore(case))] needs to be properly escaped so that it does not turn into Mir::Alternation('S' ... 'r') but Mir::Concat('[', 'S', ... , 'r', ']').

Originally, I had written a function that directly converts any literal into a case insensitive MIR but I later found out about the case_insensitive function on HIR's ParserBuilder. I thought it would be clearer to use it instead of writing it myself (and it is also less error-prone) For this reason, I used the to_mir function along with a escape_regex.

maciejhirsz · 2021-02-01T19:55:44Z

Published in 0.12

gymore-io added 3 commits January 29, 2021 00:20

Create the IgnoreFlags structure

fe412d6

* Add the `IgnoreFlags` structure * Parsing logic of `IgnoreFlags`

Add ApplieIgnoreFlags trait

04effe1

* Add `ApplieIgnoreFlags` trait * Implementation of that trait for a whole lot of structures * Updated `lib.rs` to take the ignore flags in account

Add tests for ignore(case/ascii_case)

d75a960

gymore-io marked this pull request as draft January 29, 2021 01:18

gymore-io added 3 commits January 29, 2021 14:25

Remove the ApplieIgnoreFlags trait

af7fd8c

* Remove the `ApplieIgnoreFlags` trait -> Replaced with with `MakeAsciiInsensitive` trait which serves the same purpose but only for the flag `IgnoreAsciiCase`.

Add ignore flags support for to_mir

3fa0ef4

* `Literal::to_mir` now takes ignore flags and parses mir following those flags. * Add `Mir::utf8_ignore_case` and `Mir::binary_ignore_case` * Add `Literal::escape_regex`

Add a simple test for escaped regex

0a9b10b

gymore-io marked this pull request as ready for review January 29, 2021 14:29

maciejhirsz reviewed Feb 1, 2021

View reviewed changes

maciejhirsz mentioned this pull request Feb 1, 2021

logos-derive: fix RUSTSEC-2020-0122 by upgrading beef: 0.4.1 -> 0.5.0 #197

Merged

maciejhirsz approved these changes Feb 1, 2021

View reviewed changes

maciejhirsz merged commit 4181591 into maciejhirsz:master Feb 1, 2021

gymore-io mentioned this pull request Feb 2, 2021

Feature request: allowing ignoring case in tokens #196

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add support for case insensitivity #198

Add support for case insensitivity #198

gymore-io commented Jan 29, 2021 •

edited

gymore-io commented Jan 29, 2021 •

edited

gymore-io commented Jan 29, 2021

maciejhirsz Feb 1, 2021

gymore-io Feb 1, 2021 •

edited

maciejhirsz commented Feb 1, 2021

Add support for case insensitivity #198

Add support for case insensitivity #198

Conversation

gymore-io commented Jan 29, 2021 • edited

Unanswered questions

gymore-io commented Jan 29, 2021 • edited

gymore-io commented Jan 29, 2021

maciejhirsz Feb 1, 2021

Choose a reason for hiding this comment

gymore-io Feb 1, 2021 • edited

Choose a reason for hiding this comment

maciejhirsz commented Feb 1, 2021

gymore-io commented Jan 29, 2021 •

edited

gymore-io commented Jan 29, 2021 •

edited

gymore-io Feb 1, 2021 •

edited