-
Notifications
You must be signed in to change notification settings - Fork 160
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Weird incorrect match for named group #96
Comments
I suspect the problem is in here: https://github.com/google/re2j/blob/master/java/com/google/re2j/RE2.java#L283 When loading groups, RE2J only scans the portion of the input that was matched in a call to This is just a guess, my digging continues. @adonovan, does this sound plausible? The TODO in the linked code looks like this problem was anticipated by Russ and Afroz when they were writing this. |
The following patch fixes this particular issue, but I assume there are other patterns for which this wouldn't work. I think the proper solution is to extend
This would allow us to generate the correct empty-width assertions for each case. @adonovan could this work? Is there something obvious that I'm missing here? The Go diff --git a/java/com/google/re2j/Matcher.java b/java/com/google/re2j/Matcher.java
index 9efcad5..a203c23 100644
--- a/java/com/google/re2j/Matcher.java
+++ b/java/com/google/re2j/Matcher.java
@@ -251,7 +251,7 @@ public final class Matcher {
// If we don't, they evaluate to new String[] {"ab", "a", "b", null}
// We know it won't affect the total matched because the previous call
// to match included the extra character, and it was not matched then.
- int end = groups[1] + 1;
+ int end = groups[1] + 2;
if (end > inputLength) {
end = inputLength;
}
diff --git a/javatests/com/google/re2j/MatcherTest.java b/javatests/com/google/re2j/MatcherTest.java
index 854fa46..560ace3 100644
--- a/javatests/com/google/re2j/MatcherTest.java
+++ b/javatests/com/google/re2j/MatcherTest.java
@@ -7,6 +7,7 @@ import static org.junit.Assert.assertFalse;
import static org.junit.Assert.assertTrue;
import static org.junit.Assert.fail;
+import com.google.common.truth.Truth;
import org.junit.Test;
import org.junit.runner.RunWith;
import org.junit.runners.JUnit4;
@@ -484,4 +485,14 @@ public class MatcherTest {
assertEquals("aaa bbb", text.substring(matcher.start(), matcher.end()));
}
}
+
+ @Test
+ public void testMatchingGroupIssue96() {
+ String regexp = "(\\d{2} ?(\\d|[a-z])?)($|[^a-zA-Z])";
+ String input = "22 bored";
+ Matcher m = Pattern.compile(regexp).matcher(input);
+ Truth.assertThat(m.find()).isTrue();
+ Truth.assertThat(m.group(0)).isEqualTo("22 ");
+ Truth.assertThat(m.group(1)).isEqualTo("22");
+ }
} |
This change makes `RE2.match(CharSequence input, int start, int end, ...)` treat `input` as extending from `(0, input.size()]` for the purpose of zero-width assertions. Previously, `input` was considered to extend from `(start, end]`. This is required when evaluating capturing groups. RE2/J re-matches the capturing group within a previous successful match, e.g. ``` Matcher m = Pattern.compile("(he)llo").matcher("hello world"); m.find() -> true m.group(0) -> "hello" m.group(1) -> "he" ``` During the evaluation of the last statement, RE2/J re-evaluates the pattern within group(0) (i.e. "hello"). Before this change, RE2/J would consider the position after 'o' and before 'w' to be both end-of-text and end-of-line. This would cause zero-width matchers (e.g. $) to incorrectly match, generating erroneous group matches in some cases. Fixes google#96, see that issue for more information.
This change makes `RE2.match(CharSequence input, int start, int end, ...)` treat `input` as extending from `(0, input.size()]` for the purpose of zero-width assertions. Previously, `input` was considered to extend from `(start, end]`. This is required when evaluating capturing groups. RE2/J re-matches the capturing group within a previous successful match, e.g. ``` Matcher m = Pattern.compile("(he)llo").matcher("hello world"); m.find() -> true m.group(0) -> "hello" m.group(1) -> "he" ``` During the evaluation of the last statement, RE2/J re-evaluates the pattern within group(0) (i.e. "hello"). Before this change, RE2/J would consider the position after 'o' and before 'w' to be both end-of-text and end-of-line. This would cause zero-width matchers (e.g. $) to incorrectly match, generating erroneous group matches in some cases. Fixes google#96, see that issue for more information.
This change makes `RE2.match(CharSequence input, int start, int end, ...)` treat `input` as extending from `(0, input.size()]` for the purpose of zero-width assertions. Previously, `input` was considered to extend from `(start, end]`. This is required when evaluating capturing groups. RE2/J re-matches the capturing group within a previous successful match, e.g. ``` Matcher m = Pattern.compile("(he)llo").matcher("hello world"); m.find() -> true m.group(0) -> "hello" m.group(1) -> "he" ``` During the evaluation of the last statement, RE2/J re-evaluates the pattern within group(0) (i.e. "hello"). Before this change, RE2/J would consider the position after 'o' and before 'w' to be both end-of-text and end-of-line. This would cause zero-width matchers (e.g. $) to incorrectly match, generating erroneous group matches in some cases. Fixes google#96, see that issue for more information.
This change makes `RE2.match(CharSequence input, int start, int end, ...)` treat `input` as extending from `(0, input.size()]` for the purpose of zero-width assertions. Previously, `input` was considered to extend from `(start, end]`. This is required when evaluating capturing groups. RE2/J re-matches the capturing group within a previous successful match, e.g. ``` Matcher m = Pattern.compile("(he)llo").matcher("hello world"); m.find() -> true m.group(0) -> "hello" m.group(1) -> "he" ``` During the evaluation of the last statement, RE2/J re-evaluates the pattern within group(0) (i.e. "hello"). Before this change, RE2/J would consider the position after 'o' and before 'w' to be both end-of-text and end-of-line. This would cause zero-width matchers (e.g. $) to incorrectly match, generating erroneous group matches in some cases. Fixes #96, see that issue for more information.
I am trying to grab two numbers followed by an optional space followed by another optional number or one optional letter. No other letter afterwards. This is the test code:
Both should print 22, instead first one (without the non-greedy flag) prints 22 b
Works in go: https://play.golang.org/p/Q0gCyLD013V
Regex: https://regex101.com/r/lF9aG7/6
The text was updated successfully, but these errors were encountered: