Quoted codepoint is not matched while unquoted is matched #123

akhomchenko · 2020-11-13T01:30:29Z

I am using re2j (thx for library!) and use randomly generated strings to test that patterns and logic that I wrote works correctly. I recently found a weird case and I am not sure if it is a bug but feels so because golang regexp (also re2, right?) behavior is different.

Example

import com.google.re2j.Pattern;

public class Main {
    public static void main(String[] args) {
        String source = Character.toString(110781);
        System.out.println("unquoted: " + Pattern.matches(source, source));  // true
        System.out.println("quoted: " + Pattern.matches("\\Q" + source + "\\E", source));  // false
    }
}

package main

import (
	"fmt"
	"regexp"
)

func main() {
	source := string([]rune{110781})
	matched, _ := regexp.MatchString(source, source)
	fmt.Printf("unquoted: %v\n", matched)  // true
	matched, _ = regexp.MatchString(`\Q` + source + `\E`, source)
	fmt.Printf("quoted: %v\n", matched)  // true
}

(I hope I did right with that rune to string conversion)

(link: https://play.golang.org/p/EPbFTmzsZm4)

The text was updated successfully, but these errors were encountered:

Previously, the parser would match each individual character within a \Q...\E section. Runes requiring a surrogate pair would be incorrectly treated as two individual characters. E.g. String source = new StringBuilder().appendCodePoint(110781).toString(); Before this change: Parser.parse(source, ...) matches \x{1b0bd} Parser.parse("\\Q" + source + "\\E", ...) matches \x{d82c}\x{dcbd} After this change: Parser.parse(source, ...) matches \x{1b0bd} Parser.parse("\\Q" + source + "\\E", ...) matches \x{1b0bd} Fixes google#123.

sjamesr · 2021-06-02T05:38:26Z

Thank you for the report, I captured the issue in a unit test and added a potential fix.

@adonovan if you could cast a quick glance over the fix to see if it's right, that would be great

Previously, the parser would match each individual character within a \Q...\E section. Runes requiring a surrogate pair would be incorrectly treated as two individual characters. E.g. String source = new StringBuilder().appendCodePoint(110781).toString(); Before this change: Parser.parse(source, ...) matches \x{1b0bd} Parser.parse("\\Q" + source + "\\E", ...) matches \x{d82c}\x{dcbd} After this change: Parser.parse(source, ...) matches \x{1b0bd} Parser.parse("\\Q" + source + "\\E", ...) matches \x{1b0bd} Fixes google#123.

Previously, the parser would match each individual character within a \Q...\E section. Runes requiring a surrogate pair would be incorrectly treated as two individual characters. E.g. String source = new StringBuilder().appendCodePoint(110781).toString(); Before this change: Parser.parse(source, ...) matches \x{1b0bd} Parser.parse("\\Q" + source + "\\E", ...) matches \x{d82c}\x{dcbd} After this change: Parser.parse(source, ...) matches \x{1b0bd} Parser.parse("\\Q" + source + "\\E", ...) matches \x{1b0bd} Fixes #123.

sjamesr mentioned this issue Jun 2, 2021

Fix quoting of codepoints requiring surrogate pairs. #143

Merged

sjamesr self-assigned this Jun 2, 2021

sjamesr closed this as completed in #143 Jun 2, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Quoted codepoint is not matched while unquoted is matched #123

Quoted codepoint is not matched while unquoted is matched #123

akhomchenko commented Nov 13, 2020 •

edited

Loading

sjamesr commented Jun 2, 2021

Quoted codepoint is not matched while unquoted is matched #123

Quoted codepoint is not matched while unquoted is matched #123

Comments

akhomchenko commented Nov 13, 2020 • edited Loading

Example

sjamesr commented Jun 2, 2021

akhomchenko commented Nov 13, 2020 •

edited

Loading