Join GitHub today
GitHub is home to over 40 million developers working together to host and review code, manage projects, and build software together.Sign up
x/text: Support UnicodeSet as per UTR35 #22920
Feature request: Support the UnicodeSet syntax as defined in Unicode Technical Report 35. This would be needed to implement CLDR transliteration rules which use UnicodeSets for filtering and matching; to support CLDR exemplar characters which are also defined in terms of UTR35 UnicodeSets; and other Unicode stuff such as UTR39 Unicode Security Mechanisms that make use of UnicodeSets.
See Unicode’s list-unicodeset tool for an online demo (and its documentation); and the ICU documentation for the ICU API to UnicodeSets. For reference, you might want to have a look at the C++ implementation and the Java implementation inside the ICU sources.
Not sure if this could be implemented by rewriting the string syntax to Go regular expressions, or if this would need more work.
I'm pretty sure the RE2 regexp package of Go is incompatible with the exact definition of UnicodeSets in TR35. It is pretty close though.
Enter package regexp/syntax. It exposes the internals of parsing and compilation of this package. It probably won't be too much effort to write an alternative regexp parser and bolt it on the existing engine.