Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for case insensitivity #198

Merged
merged 6 commits into from Feb 1, 2021
Merged

Add support for case insensitivity #198

merged 6 commits into from Feb 1, 2021

Conversation

gymore-io
Copy link
Contributor

@gymore-io gymore-io commented Jan 29, 2021

Issue: #196

Case insensitivity is opt-in using the ignore(...) attribute.

#[derive(Logos)]
struct Token {
    /* ... */
    #[token("été", ignore(case))]
    Ete,
    #[token("DRôLE", ignore(ascii_case))]
    Drole,
    #[regex("[abc]+", ignore(case))]
    Abc,
}

let mut lex = Token::lex("été ÉTÉ DrÔle Drôle AaabbCC");
assert_eq!((Some(Token::Ete), "été"), (lex.next(), lex.slice()));
assert_eq!((Some(Token::Ete), "ÉTÉ"), (lex.next(), lex.slice()));
// Then a bunch of errors because Ô is not an ascii-uppercase ô.
assert_eq!((Some(Token::Drole), "Drôle"),  (lex.next(), lex.slice()));
assert_eq!((Some(Token::Abc), "AaabbCC"),  (lex.next(), lex.slice()));
assert_eq!(None, lex.next());

For now, I prevented the user to use both ascii_case and case at the same time. This can be easily changed.


Internally, all this attribute does is converting the original literal (for tokens) or the MIR (for regular expression) into another MIR that accepts both uppercase and lowercase variants of the input.

// For example, this ...
#[token("hello", ignore(ascii_case))
Hello,

// ... is treated internally exactly like this
#[regex("[Hh][Ee][Ll][Oo]")]
Hello,

Unanswered questions

  • I created the IgnoreFlags struct to make the ignore attribute more easily expandable. It uses a basic bit field. I wasn't sure whether a new dependency was ok so I implemented myself what the bitflags crate do (in much much simpler). Do we want a new dependency to make the code cleaner or the little features I implemented are enough?
  • I am using MIR to handle case insensitivity. Is this ok or would directly creating forks be faster? If this is not acceptable/maintenable, I think it is possible to adapt the the whole thing trait to make it work. I might need a little help for that though; I found the generator module quite complicated to dig into and understand.
  • Ranges (eg. regex("[a-z]", ...)) are currently using ClassUnicode::case_fold_simple and normal literals (eg. regex("abc", ...)) are using Rust's to_lowercase and to_uppercase functions. This might lead to inconsistencies if those two methods do not convert characters the same way. Now everything uses simple folding.

PS: This is my first PR ever. If I've done anything wrong, please tell me! I'm both doing this because I needed this feature and to get reviews of my code. Feel free to say it needs some (or a lot) rework :)

* Add the `IgnoreFlags` structure
* Parsing logic of `IgnoreFlags`
* Add `ApplieIgnoreFlags` trait
* Implementation of that trait for a whole lot of structures
* Updated `lib.rs` to take the ignore flags in account
@gymore-io gymore-io marked this pull request as draft January 29, 2021 01:18
@gymore-io
Copy link
Contributor Author

gymore-io commented Jan 29, 2021

I just noticed that ParserBuilder had this neat method case_insensitive which basically does what I implemented with ApplieIgnoreFlags. But more clearly and probably more efficiently. The problem is that it does not support ascii case folding so we still have to implement that part ourselves. Plus I'm not sure how to integrate it with the current Mir system. Would add a parameter to Mir::utf8 be that much of a good idea? Or maybe another constructor like Mir::utf8_ignore_case. I'm not exactly sure.

Moreover, using that method basically makes my ApplieIgnoreFlags very complicated just to turn regular expressions into ascii-case-insensitive ones and turn token literals into ascii-case-insensitive regular expressions.

I will rewrite a good portion of that part of the system, using ParserBuilder::case_insensitive this time. That will answer part of the third unanswered question though. After that, only token literals will have a different case folding algorithm. If we turn them into regular expression with ParserBuilder, that could fix the issue. Does this sound like a good idea?

* Remove the `ApplieIgnoreFlags` trait
-> Replaced with with `MakeAsciiInsensitive` trait which serves the same purpose but only for the flag `IgnoreAsciiCase`.
* `Literal::to_mir` now takes ignore flags and parses mir following those flags.
* Add `Mir::utf8_ignore_case` and `Mir::binary_ignore_case`
* Add `Literal::escape_regex`
@gymore-io
Copy link
Contributor Author

Those three last commits removed the ApplieIgnoreCase trait to make use of the ParserBuilder::case_insensitive function. This makes the code less modular (for future ignore flags for example) but more reliable to my opinion. As that function does not support ascii-case-folding, I still had to implement this part myself with a new MakeAsciiInsensitive trait which serves the same purpose.

@gymore-io gymore-io marked this pull request as ready for review January 29, 2021 14:29
} else {
let mir = definition
.literal
.escape_regex()
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this necessary?

Copy link
Contributor Author

@gymore-io gymore-io Feb 1, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, as I'm calling .to_mir right after, we need to ensure the literal will actually be used as a string and not some regex semantics.

For example, #[token("[Some Header]", ignore(case))] needs to be properly escaped so that it does not turn into Mir::Alternation('S' ... 'r') but Mir::Concat('[', 'S', ... , 'r', ']').

Originally, I had written a function that directly converts any literal into a case insensitive MIR but I later found out about the case_insensitive function on HIR's ParserBuilder. I thought it would be clearer to use it instead of writing it myself (and it is also less error-prone) For this reason, I used the to_mir function along with a escape_regex.

@maciejhirsz maciejhirsz merged commit 4181591 into maciejhirsz:master Feb 1, 2021
@maciejhirsz
Copy link
Owner

Published in 0.12

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants