Skip to content

regexp/syntax: canonicalization of letter classes breaks script selection #77698

@DanielMorsing

Description

@DanielMorsing

Program:

package main

import "regexp"

var r = regexp.MustCompile(`\p{Canadian_Aboriginal}`)

func main() {}

Output:

panic: regexp: Compile(`\p{Canadian_Aboriginal}`): error parsing regexp: invalid character class range: `\p{Canadian_Aboriginal}`

goroutine 1 [running]:
regexp.MustCompile({0x4b1f10, 0x17})
	/usr/local/go-faketime/src/regexp/regexp.go:313 +0xb4
main.init()
	/tmp/sandbox2354615464/prog.go:5 +0x1f

This was a program that compiled and ran on Go 1.24, but does not on any later version. The implementation of #70781 canonicalizes all strings within \p{} brackets before lookup. This causes selection of unicode scripts that contain underscores to no longer match.

Metadata

Metadata

Assignees

No one assigned

    Labels

    BugReportIssues describing a possible bug in the Go implementation.NeedsInvestigationSomeone must examine and confirm this is a valid issue and not a duplicate of an existing one.

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions