Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Quoted codepoint is not matched while unquoted is matched #123

Closed
akhomchenko opened this issue Nov 13, 2020 · 1 comment · Fixed by #143
Closed

Quoted codepoint is not matched while unquoted is matched #123

akhomchenko opened this issue Nov 13, 2020 · 1 comment · Fixed by #143
Assignees

Comments

@akhomchenko
Copy link

akhomchenko commented Nov 13, 2020

I am using re2j (thx for library!) and use randomly generated strings to test that patterns and logic that I wrote works correctly. I recently found a weird case and I am not sure if it is a bug but feels so because golang regexp (also re2, right?) behavior is different.

Example

import com.google.re2j.Pattern;

public class Main {
    public static void main(String[] args) {
        String source = Character.toString(110781);
        System.out.println("unquoted: " + Pattern.matches(source, source));  // true
        System.out.println("quoted: " + Pattern.matches("\\Q" + source + "\\E", source));  // false
    }
}
package main

import (
	"fmt"
	"regexp"
)

func main() {
	source := string([]rune{110781})
	matched, _ := regexp.MatchString(source, source)
	fmt.Printf("unquoted: %v\n", matched)  // true
	matched, _ = regexp.MatchString(`\Q` + source + `\E`, source)
	fmt.Printf("quoted: %v\n", matched)  // true
}

(I hope I did right with that rune to string conversion)

(link: https://play.golang.org/p/EPbFTmzsZm4)

sjamesr added a commit to sjamesr/re2j that referenced this issue Jun 2, 2021
Previously, the parser would match each individual character within a
\Q...\E section. Runes requiring a surrogate pair would be incorrectly
treated as two individual characters.

E.g.

String source = new StringBuilder().appendCodePoint(110781).toString();

Before this change:
Parser.parse(source, ...) matches \x{1b0bd}
Parser.parse("\\Q" + source + "\\E", ...) matches \x{d82c}\x{dcbd}

After this change:
Parser.parse(source, ...) matches \x{1b0bd}
Parser.parse("\\Q" + source + "\\E", ...) matches \x{1b0bd}

Fixes google#123.
sjamesr added a commit to sjamesr/re2j that referenced this issue Jun 2, 2021
Previously, the parser would match each individual character within a
\Q...\E section. Runes requiring a surrogate pair would be incorrectly
treated as two individual characters.

E.g.

String source = new StringBuilder().appendCodePoint(110781).toString();

Before this change:
Parser.parse(source, ...) matches \x{1b0bd}
Parser.parse("\\Q" + source + "\\E", ...) matches \x{d82c}\x{dcbd}

After this change:
Parser.parse(source, ...) matches \x{1b0bd}
Parser.parse("\\Q" + source + "\\E", ...) matches \x{1b0bd}

Fixes google#123.
sjamesr added a commit to sjamesr/re2j that referenced this issue Jun 2, 2021
Previously, the parser would match each individual character within a
\Q...\E section. Runes requiring a surrogate pair would be incorrectly
treated as two individual characters.

E.g.

String source = new StringBuilder().appendCodePoint(110781).toString();

Before this change:
Parser.parse(source, ...) matches \x{1b0bd}
Parser.parse("\\Q" + source + "\\E", ...) matches \x{d82c}\x{dcbd}

After this change:
Parser.parse(source, ...) matches \x{1b0bd}
Parser.parse("\\Q" + source + "\\E", ...) matches \x{1b0bd}

Fixes google#123.
@sjamesr
Copy link
Contributor

sjamesr commented Jun 2, 2021

Thank you for the report, I captured the issue in a unit test and added a potential fix.

@adonovan if you could cast a quick glance over the fix to see if it's right, that would be great

@sjamesr sjamesr self-assigned this Jun 2, 2021
sjamesr added a commit to sjamesr/re2j that referenced this issue Jun 2, 2021
Previously, the parser would match each individual character within a
\Q...\E section. Runes requiring a surrogate pair would be incorrectly
treated as two individual characters.

E.g.

String source = new StringBuilder().appendCodePoint(110781).toString();

Before this change:
Parser.parse(source, ...) matches \x{1b0bd}
Parser.parse("\\Q" + source + "\\E", ...) matches \x{d82c}\x{dcbd}

After this change:
Parser.parse(source, ...) matches \x{1b0bd}
Parser.parse("\\Q" + source + "\\E", ...) matches \x{1b0bd}

Fixes google#123.
sjamesr added a commit that referenced this issue Jun 2, 2021
Previously, the parser would match each individual character within a
\Q...\E section. Runes requiring a surrogate pair would be incorrectly
treated as two individual characters.

E.g.

String source = new StringBuilder().appendCodePoint(110781).toString();

Before this change:
Parser.parse(source, ...) matches \x{1b0bd}
Parser.parse("\\Q" + source + "\\E", ...) matches \x{d82c}\x{dcbd}

After this change:
Parser.parse(source, ...) matches \x{1b0bd}
Parser.parse("\\Q" + source + "\\E", ...) matches \x{1b0bd}

Fixes #123.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants