Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Weird incorrect match for named group #96

Closed
rabeet opened this issue Oct 27, 2019 · 2 comments · Fixed by #111
Closed

Weird incorrect match for named group #96

rabeet opened this issue Oct 27, 2019 · 2 comments · Fixed by #111

Comments

@rabeet
Copy link

rabeet commented Oct 27, 2019

I am trying to grab two numbers followed by an optional space followed by another optional number or one optional letter. No other letter afterwards. This is the test code:

@Test
  public void test() {
    com.google.re2j.Pattern p1 = com.google.re2j.Pattern.compile("(?P<elem>\\d{2} ?(\\d|[a-z])?)(?P<end>$|[^a-zA-Z])");
    com.google.re2j.Pattern p2 = com.google.re2j.Pattern.compile("(?P<elem>\\d{2}(?U) ?(\\d|[a-z])?)(?P<end>$|[^a-zA-Z])");
    String input = "22 bored";
    com.google.re2j.Matcher m1 = p1.matcher(input);
    com.google.re2j.Matcher m2 = p2.matcher(input);
    while (m1.find()) {
      System.out.println(m1.group("elem")); // 22 b
    }
    while (m2.find()) {
      System.out.println(m2.group("elem")); // 22
    }
  }

Both should print 22, instead first one (without the non-greedy flag) prints 22 b
Works in go: https://play.golang.org/p/Q0gCyLD013V
Regex: https://regex101.com/r/lF9aG7/6

@sjamesr
Copy link
Contributor

sjamesr commented Jun 4, 2020

I suspect the problem is in here: https://github.com/google/re2j/blob/master/java/com/google/re2j/RE2.java#L283

When loading groups, RE2J only scans the portion of the input that was matched in a call to find(). This causes the b in the example above to be treated as the end of input, therefore it matches the first alternative and the $ in the second group also matches when it shouldn't.

This is just a guess, my digging continues. @adonovan, does this sound plausible? The TODO in the linked code looks like this problem was anticipated by Russ and Afroz when they were writing this.

@sjamesr
Copy link
Contributor

sjamesr commented Jun 5, 2020

The following patch fixes this particular issue, but I assume there are other patterns for which this wouldn't work.

I think the proper solution is to extend MachineInput's idea of the "end" of its input in order to account for two cases:

  • there is no more input because we have reached the end of the source data, or
  • while there is more data in the sequence, the user does not want to match on it

This would allow us to generate the correct empty-width assertions for each case.

@adonovan could this work? Is there something obvious that I'm missing here?

The Go regexp implementation on which RE2/J is based does not suffer from this issue, because in that API the user must declare in advance that they require matching groups. RE2/J's Matcher does not have this luxury, because we want to have a similar API to java.util.regex.Matcher.

diff --git a/java/com/google/re2j/Matcher.java b/java/com/google/re2j/Matcher.java
index 9efcad5..a203c23 100644
--- a/java/com/google/re2j/Matcher.java
+++ b/java/com/google/re2j/Matcher.java
@@ -251,7 +251,7 @@ public final class Matcher {
     // If we don't, they evaluate to new String[] {"ab", "a", "b", null}
     // We know it won't affect the total matched because the previous call
     // to match included the extra character, and it was not matched then.
-    int end = groups[1] + 1;
+    int end = groups[1] + 2;
     if (end > inputLength) {
       end = inputLength;
     }
diff --git a/javatests/com/google/re2j/MatcherTest.java b/javatests/com/google/re2j/MatcherTest.java
index 854fa46..560ace3 100644
--- a/javatests/com/google/re2j/MatcherTest.java
+++ b/javatests/com/google/re2j/MatcherTest.java
@@ -7,6 +7,7 @@ import static org.junit.Assert.assertFalse;
 import static org.junit.Assert.assertTrue;
 import static org.junit.Assert.fail;
 
+import com.google.common.truth.Truth;
 import org.junit.Test;
 import org.junit.runner.RunWith;
 import org.junit.runners.JUnit4;
@@ -484,4 +485,14 @@ public class MatcherTest {
       assertEquals("aaa bbb", text.substring(matcher.start(), matcher.end()));
     }
   }
+
+  @Test
+  public void testMatchingGroupIssue96() {
+    String regexp = "(\\d{2} ?(\\d|[a-z])?)($|[^a-zA-Z])";
+    String input = "22 bored";
+    Matcher m = Pattern.compile(regexp).matcher(input);
+    Truth.assertThat(m.find()).isTrue();
+    Truth.assertThat(m.group(0)).isEqualTo("22 ");
+    Truth.assertThat(m.group(1)).isEqualTo("22");
+  }
 }

sjamesr added a commit to sjamesr/re2j that referenced this issue Jun 6, 2020
This change makes `RE2.match(CharSequence input, int start, int end,
...)` treat `input` as extending from `(0, input.size()]` for the
purpose of zero-width assertions. Previously, `input` was considered to
extend from `(start, end]`.

This is required when evaluating capturing groups. RE2/J re-matches the
capturing group within a previous successful match, e.g.

```
Matcher m = Pattern.compile("(he)llo").matcher("hello world");
m.find() -> true
m.group(0) -> "hello"
m.group(1) -> "he"
```

During the evaluation of the last statement, RE2/J re-evaluates the
pattern within group(0) (i.e. "hello"). Before this change, RE2/J would
consider the position after 'o' and before 'w' to be both end-of-text
and end-of-line. This would cause zero-width matchers (e.g. $) to
incorrectly match, generating erroneous group matches in some cases.

Fixes google#96, see that issue for more
information.
sjamesr added a commit to sjamesr/re2j that referenced this issue Jun 6, 2020
This change makes `RE2.match(CharSequence input, int start, int end,
...)` treat `input` as extending from `(0, input.size()]` for the
purpose of zero-width assertions. Previously, `input` was considered to
extend from `(start, end]`.

This is required when evaluating capturing groups. RE2/J re-matches the
capturing group within a previous successful match, e.g.

```
Matcher m = Pattern.compile("(he)llo").matcher("hello world");
m.find() -> true
m.group(0) -> "hello"
m.group(1) -> "he"
```

During the evaluation of the last statement, RE2/J re-evaluates the
pattern within group(0) (i.e. "hello"). Before this change, RE2/J would
consider the position after 'o' and before 'w' to be both end-of-text
and end-of-line. This would cause zero-width matchers (e.g. $) to
incorrectly match, generating erroneous group matches in some cases.

Fixes google#96, see that issue for more
information.
sjamesr added a commit to sjamesr/re2j that referenced this issue Jun 6, 2020
This change makes `RE2.match(CharSequence input, int start, int end,
...)` treat `input` as extending from `(0, input.size()]` for the
purpose of zero-width assertions. Previously, `input` was considered to
extend from `(start, end]`.

This is required when evaluating capturing groups. RE2/J re-matches the
capturing group within a previous successful match, e.g.

```
Matcher m = Pattern.compile("(he)llo").matcher("hello world");
m.find() -> true
m.group(0) -> "hello"
m.group(1) -> "he"
```

During the evaluation of the last statement, RE2/J re-evaluates the
pattern within group(0) (i.e. "hello"). Before this change, RE2/J would
consider the position after 'o' and before 'w' to be both end-of-text
and end-of-line. This would cause zero-width matchers (e.g. $) to
incorrectly match, generating erroneous group matches in some cases.

Fixes google#96, see that issue for more
information.
sjamesr added a commit that referenced this issue Jun 9, 2020
This change makes `RE2.match(CharSequence input, int start, int end,
...)` treat `input` as extending from `(0, input.size()]` for the
purpose of zero-width assertions. Previously, `input` was considered to
extend from `(start, end]`.

This is required when evaluating capturing groups. RE2/J re-matches the
capturing group within a previous successful match, e.g.

```
Matcher m = Pattern.compile("(he)llo").matcher("hello world");
m.find() -> true
m.group(0) -> "hello"
m.group(1) -> "he"
```

During the evaluation of the last statement, RE2/J re-evaluates the
pattern within group(0) (i.e. "hello"). Before this change, RE2/J would
consider the position after 'o' and before 'w' to be both end-of-text
and end-of-line. This would cause zero-width matchers (e.g. $) to
incorrectly match, generating erroneous group matches in some cases.

Fixes #96, see that issue for more
information.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants