chore(deps): bumping regex-syntax #320

jeertmans · 2023-06-26T11:04:26Z

The regex-syntax introduces breaking changes between 0.6 and 0.8.

This PR tries to bump the regex-syntax dependency, to make Logos a bit more future-proof regarding that :-)

Major changes

Since regex-syntax seems to do some optimization on the HIR, some patterns see their priority be affected by this.

E.g.: "a|b" (priority = 2) is optimized into "[a-b]" (priority = 2).

Those optimizations also result in better performances for Logos, so I suggest to not go against them :-) See benchmark below.

As a result, this PR proposes to change the priority of classes, so that equivalent regex patterns have the same priority. This "rule" is tested in logos-codegen/src/mir.rs.

This is a breaking change, but will hopefully never change in the future, as regex HIRs cannot be optimized in something else than an equivalent pattern.

NOTE: this should be documented in the release notes, as well as in the book. Maybe after #319 is merged?

logos-codegen/src/mir.rs

logos-codegen/src/parser/ignore_flags.rs

logos-codegen/src/mir.rs

hellow554 · 2023-06-26T13:25:15Z

Funny enough: I also tried to update all dependencies in the logos eco system (not just regex-syntax).
Now only the test fail for me and this is two folded.

First, literals are not just one byte/char at a time, but instead a whole string/byte slice.

Therefore the following line should be changed as following:

logos/logos-codegen/src/mir.rs

Line 75 in e2702f6

Mir::Literal(_) => 2,

Mir::Literal(lit) => 2 * lit.0.len()

Second, regex-syntax does some crazy optimization, e.g. a|b is not an alternation, but instead a class, e.g. a..=b and this is correct. Even a|c is not an alternation, but a Class of a..=a, c..=c, which is all correct, but this means that some calculations are in the test cases are "wrong".

I'm not entirely sure, if they should be dropped or relaxed, but this works:

--- a/logos-codegen/src/graph/regex.rs
+++ b/logos-codegen/src/graph/regex.rs
@@ -207,7 +206,7 @@ mod tests {

         let mir = Mir::utf8("a|b").unwrap();

-        assert_eq!(mir.priority(), 2);
+        assert!((1..=2).contains(&mir.priority()));

diff --git a/logos-codegen/src/mir.rs b/logos-codegen/src/mir.rs
index 18a6307..5f258e8 100644
--- a/logos-codegen/src/mir.rs
+++ b/logos-codegen/src/mir.rs
@@ -207,17 +204,17 @@ mod tests {
     #[test]
     fn priorities() {
         let regexes = [
-            ("[a-z]+", 1),
-            ("a|b", 2),
-            ("a|[b-z]", 1),
-            ("(foo)+", 6),
-            ("foobar", 12),
-            ("(fooz|bar)+qux", 12),
+            ("[a-z]+", (1..=1)),
+            ("a|b", (1..=2)),
+            ("a|[b-z]", (1..=1)),
+            ("(foo)+", (6..=6)),
+            ("foobar", (12..=12)),
+            ("(fooz|bar)+qux", (12..=12)),
         ];

         for (regex, expected) in regexes.iter() {
             let mir = Mir::utf8(regex).unwrap();
-            assert_eq!(mir.priority(), *expected);
+            assert!(expected.contains(&mir.priority()));
         }
     }
 }

hellow554 · 2023-06-26T18:06:28Z

logos-codegen/src/mir.rs

+            Mir::Literal(lit) => {
+                match std::str::from_utf8(&lit.0) {
+                    Ok(s) => 2 * s.len(),
+                    Err(_) => 2 * lit.0.len(),
+                }


I guess you're doing this because of unicode strings?
Is it possible to mix string and byte pattern in logos?
If so, can you provide a testcase where this is covered?

If not, I think doing the simple solution, e.g. 2 * lit.0.len() is enough here. But I don't have the last word of course ;)

Yes, I should come with a test for that :-)

But I did that because Mir::Literal can be constructed either from a char or from an u8. But a single char, which should have priority 2, can be made of up to 4 bytes. So using 2 * lit.0.len() for such a char could make the priority much higher. I will try to write some tests for that :-)

Ohh I made the dummy mistake, but str's length should be s.chars().count() :'-)

But then we talk about grapheme clusters? I mean ẽ could either be just ẽ or e~.
Please take a look at: https://stackoverflow.com/questions/58770462/how-to-iterate-over-unicode-grapheme-clusters-in-rust

We talk about unicode chars, see regex#54.

I don't think regex-syntax or logos supports grapheme clusters, so I don't see why we should support them now?

jeertmans · 2023-06-26T19:07:29Z

Funny enough: I also tried to update all dependencies in the logos eco system (not just regex-syntax). Now only the test fail for me and this is two folded.

First, literals are not just one byte/char at a time, but instead a whole string/byte slice.

Therefore the following line should be changed as following:

logos/logos-codegen/src/mir.rs

Line 75 in e2702f6

Mir::Literal(_) => 2,

Mir::Literal(lit) => 2 * lit.0.len()

Second, regex-syntax does some crazy optimization, e.g. a|b is not an alternation, but instead a class, e.g. a..=b and this is correct. Even a|c is not an alternation, but a Class of a..=a, c..=c, which is all correct, but this means that some calculations are in the test cases are "wrong".

I'm not entirely sure, if they should be dropped or relaxed, but this works:
--- a/logos-codegen/src/graph/regex.rs
+++ b/logos-codegen/src/graph/regex.rs
@@ -207,7 +206,7 @@ mod tests {

         let mir = Mir::utf8("a|b").unwrap();

-        assert_eq!(mir.priority(), 2);
+        assert!((1..=2).contains(&mir.priority()));

diff --git a/logos-codegen/src/mir.rs b/logos-codegen/src/mir.rs
index 18a6307..5f258e8 100644
--- a/logos-codegen/src/mir.rs
+++ b/logos-codegen/src/mir.rs
@@ -207,17 +204,17 @@ mod tests {
     #[test]
     fn priorities() {
         let regexes = [
-            ("[a-z]+", 1),
-            ("a|b", 2),
-            ("a|[b-z]", 1),
-            ("(foo)+", 6),
-            ("foobar", 12),
-            ("(fooz|bar)+qux", 12),
+            ("[a-z]+", (1..=1)),
+            ("a|b", (1..=2)),
+            ("a|[b-z]", (1..=1)),
+            ("(foo)+", (6..=6)),
+            ("foobar", (12..=12)),
+            ("(fooz|bar)+qux", (12..=12)),
         ];

         for (regex, expected) in regexes.iter() {
             let mir = Mir::utf8(regex).unwrap();
-            assert_eq!(mir.priority(), *expected);
+            assert!(expected.contains(&mir.priority()));
         }
     }
 }

Oh yeah... this optimization is quite annoying, and seems to be caused by the AST to HIR translator, see https://play.rust-lang.org/?version=stable&mode=debug&edition=2021&gist=6e993bda9d1d375ea2dd0bd3927a38a6.

If think there are three possible solutions to this:

revert back to regex-syntax v0.6 and never upgrade;
update the class' priority to 2, which is a breaking change, but satisfy the rule that "equivalent regexes the same priority". I worked a bit on that, but a lot of tests must be updated;
use regex-syntax's AST, which seems to be not optimized, instead of the HIR, or use a custom AST to HIR translator (I am not sure that optimization can be turned off).

It is also important to check whether regex-syntax's optimizations are actually impacting (or not) Logos' performances :'-)

jeertmans · 2023-06-26T20:39:53Z

Ok so I did a small benchmark to compare equivalent patterns, and it seems that optimizing alternations into classes (at least a range) can improve performances, see this example:

hellow554 · 2023-06-27T07:09:29Z

can improve performances

Yeah I thought so ;) Why should they do that, if not for performance :D

I mean, honestly: There's no reason for us so stick to "old" behavoir. We just need to make sure, that priorities do not conflict with each other. And I think we can do this with more tests, but IMHO we shouldn't block the upgrade just because old tests are not working.

jeertmans · 2023-06-27T08:03:39Z

can improve performances

Yeah I thought so ;) Why should they do that, if not for performance :D

True! But optimizations in the regex-syntax crate do not necessarily translate into performance gains with Logos, so testing is always good ;-)

I mean, honestly: There's no reason for us so stick to "old" behavoir. We just need to make sure, that priorities do not conflict with each other. And I think we can do this with more tests, but IMHO we shouldn't block the upgrade just because old tests are not working.

I agree, but this is a major change, and hopefully we should be able to produce a stable result such that "equivalent regex patterns have the same priority". Maybe it is as simple as multiplicating the class priority by 2 ^^'

Anyway, I'd like to have @maciejhirsz's opinion on that too.

hellow554 · 2023-06-27T10:57:51Z

logos/src/lib.rs

@@ -231,7 +231,7 @@ pub trait Logos<'source>: Sized {
 /// enum Token<'a> {
 ///     // We will treat "abc" as if it was whitespace.
 ///     // This is identical to using `logos::skip`.
-///     #[regex(" |abc", |_| Skip)]
+///     #[regex(" |abc", |_| Skip, priority = 3)]


I'm not sure why this is necessary? Is this because the alternate form gets shortened to a class?

Nope, this is because the next pattern has seen its priority go from 1 to 2. So we increase this priority to have the same behaviour as before.

hellow554 · 2023-06-27T10:59:00Z

Your commits seem to be messed up. You have duplicate commits.

Could you rebase your commits, or maybe squash them (I don't like squashing, but it's okay to use).

hellow554 · 2023-06-27T11:02:29Z

logos-codegen/src/mir.rs

@@ -71,8 +68,11 @@ impl Mir {
            Mir::Empty | Mir::Loop(_) | Mir::Maybe(_) => 0,
            Mir::Concat(concat) => concat.iter().map(Mir::priority).sum(),
            Mir::Alternation(alt) => alt.iter().map(Mir::priority).min().unwrap_or(0),
-            Mir::Class(_) => 1,
-            Mir::Literal(_) => 2,
+            Mir::Class(_) => 2,


Maybe we need to adjust Class to something different.
There is the iter function on ClassUnicode and those items have a len method.
Maybe that is what we want instead of a hard coded class?! I'm not really sure tbh.

I don't think it will have a different result: a non-empty class means that at it must match one element from the class. All "special" characters are escaped in a class, and you can only match one byte or one char → priority is 2

hellow554 · 2023-06-27T11:03:54Z

tests/tests/ignore_case.rs

Yeah, I'm not really sure what you need to adjust all the priorities.
Can you explain that please?

Because bumping Mir::Class's priority from 1 to 2 broke a lot of tests or regexes, and we now need to specify the priority in such cases

jeertmans marked this pull request as draft June 26, 2023 11:04

jeertmans changed the title ~~chore(deps):~~ chore(deps): bumping regex-syntax Jun 26, 2023