Weird incorrect match for named group #96

rabeet · 2019-10-27T07:52:38Z

I am trying to grab two numbers followed by an optional space followed by another optional number or one optional letter. No other letter afterwards. This is the test code:

@Test
  public void test() {
    com.google.re2j.Pattern p1 = com.google.re2j.Pattern.compile("(?P<elem>\\d{2} ?(\\d|[a-z])?)(?P<end>$|[^a-zA-Z])");
    com.google.re2j.Pattern p2 = com.google.re2j.Pattern.compile("(?P<elem>\\d{2}(?U) ?(\\d|[a-z])?)(?P<end>$|[^a-zA-Z])");
    String input = "22 bored";
    com.google.re2j.Matcher m1 = p1.matcher(input);
    com.google.re2j.Matcher m2 = p2.matcher(input);
    while (m1.find()) {
      System.out.println(m1.group("elem")); // 22 b
    }
    while (m2.find()) {
      System.out.println(m2.group("elem")); // 22
    }
  }

Both should print 22, instead first one (without the non-greedy flag) prints 22 b
Works in go: https://play.golang.org/p/Q0gCyLD013V
Regex: https://regex101.com/r/lF9aG7/6

sjamesr · 2020-06-04T23:02:42Z

I suspect the problem is in here: https://github.com/google/re2j/blob/master/java/com/google/re2j/RE2.java#L283

When loading groups, RE2J only scans the portion of the input that was matched in a call to find(). This causes the b in the example above to be treated as the end of input, therefore it matches the first alternative and the $ in the second group also matches when it shouldn't.

This is just a guess, my digging continues. @adonovan, does this sound plausible? The TODO in the linked code looks like this problem was anticipated by Russ and Afroz when they were writing this.

sjamesr · 2020-06-05T06:14:31Z

The following patch fixes this particular issue, but I assume there are other patterns for which this wouldn't work.

I think the proper solution is to extend MachineInput's idea of the "end" of its input in order to account for two cases:

there is no more input because we have reached the end of the source data, or
while there is more data in the sequence, the user does not want to match on it

This would allow us to generate the correct empty-width assertions for each case.

@adonovan could this work? Is there something obvious that I'm missing here?

The Go regexp implementation on which RE2/J is based does not suffer from this issue, because in that API the user must declare in advance that they require matching groups. RE2/J's Matcher does not have this luxury, because we want to have a similar API to java.util.regex.Matcher.

diff --git a/java/com/google/re2j/Matcher.java b/java/com/google/re2j/Matcher.java
index 9efcad5..a203c23 100644
--- a/java/com/google/re2j/Matcher.java
+++ b/java/com/google/re2j/Matcher.java
@@ -251,7 +251,7 @@ public final class Matcher {
     // If we don't, they evaluate to new String[] {"ab", "a", "b", null}
     // We know it won't affect the total matched because the previous call
     // to match included the extra character, and it was not matched then.
-    int end = groups[1] + 1;
+    int end = groups[1] + 2;
     if (end > inputLength) {
       end = inputLength;
     }
diff --git a/javatests/com/google/re2j/MatcherTest.java b/javatests/com/google/re2j/MatcherTest.java
index 854fa46..560ace3 100644
--- a/javatests/com/google/re2j/MatcherTest.java
+++ b/javatests/com/google/re2j/MatcherTest.java
@@ -7,6 +7,7 @@ import static org.junit.Assert.assertFalse;
 import static org.junit.Assert.assertTrue;
 import static org.junit.Assert.fail;
 
+import com.google.common.truth.Truth;
 import org.junit.Test;
 import org.junit.runner.RunWith;
 import org.junit.runners.JUnit4;
@@ -484,4 +485,14 @@ public class MatcherTest {
       assertEquals("aaa bbb", text.substring(matcher.start(), matcher.end()));
     }
   }
+
+  @Test
+  public void testMatchingGroupIssue96() {
+    String regexp = "(\\d{2} ?(\\d|[a-z])?)($|[^a-zA-Z])";
+    String input = "22 bored";
+    Matcher m = Pattern.compile(regexp).matcher(input);
+    Truth.assertThat(m.find()).isTrue();
+    Truth.assertThat(m.group(0)).isEqualTo("22 ");
+    Truth.assertThat(m.group(1)).isEqualTo("22");
+  }
 }

This change makes `RE2.match(CharSequence input, int start, int end, ...)` treat `input` as extending from `(0, input.size()]` for the purpose of zero-width assertions. Previously, `input` was considered to extend from `(start, end]`. This is required when evaluating capturing groups. RE2/J re-matches the capturing group within a previous successful match, e.g. ``` Matcher m = Pattern.compile("(he)llo").matcher("hello world"); m.find() -> true m.group(0) -> "hello" m.group(1) -> "he" ``` During the evaluation of the last statement, RE2/J re-evaluates the pattern within group(0) (i.e. "hello"). Before this change, RE2/J would consider the position after 'o' and before 'w' to be both end-of-text and end-of-line. This would cause zero-width matchers (e.g. $) to incorrectly match, generating erroneous group matches in some cases. Fixes google#96, see that issue for more information.

This change makes `RE2.match(CharSequence input, int start, int end, ...)` treat `input` as extending from `(0, input.size()]` for the purpose of zero-width assertions. Previously, `input` was considered to extend from `(start, end]`. This is required when evaluating capturing groups. RE2/J re-matches the capturing group within a previous successful match, e.g. ``` Matcher m = Pattern.compile("(he)llo").matcher("hello world"); m.find() -> true m.group(0) -> "hello" m.group(1) -> "he" ``` During the evaluation of the last statement, RE2/J re-evaluates the pattern within group(0) (i.e. "hello"). Before this change, RE2/J would consider the position after 'o' and before 'w' to be both end-of-text and end-of-line. This would cause zero-width matchers (e.g. $) to incorrectly match, generating erroneous group matches in some cases. Fixes #96, see that issue for more information.

sjamesr mentioned this issue Jun 6, 2020

Fix zero-width assertions in capture group eval #111

Merged

sjamesr closed this as completed in #111 Jun 9, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Weird incorrect match for named group #96

Weird incorrect match for named group #96

rabeet commented Oct 27, 2019 •

edited

Loading

sjamesr commented Jun 4, 2020

sjamesr commented Jun 5, 2020 •

edited

Loading

Weird incorrect match for named group #96

Weird incorrect match for named group #96

Comments

rabeet commented Oct 27, 2019 • edited Loading

sjamesr commented Jun 4, 2020

sjamesr commented Jun 5, 2020 • edited Loading

rabeet commented Oct 27, 2019 •

edited

Loading

sjamesr commented Jun 5, 2020 •

edited

Loading