Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

regexp: match additional Unicode properties #14509

Open
s4y opened this issue Feb 25, 2016 · 7 comments
Open

regexp: match additional Unicode properties #14509

s4y opened this issue Feb 25, 2016 · 7 comments

Comments

@s4y
Copy link

@s4y s4y commented Feb 25, 2016

The docs (and the RE2 docs) describe how to match Unicode properties by using \p{…} inside a character class:

[snip]

[\p{Name}]named Unicode property inside character class (≡ \p{Name})
[^\p{Name}]named Unicode property inside negated character class (≡ \P{Name})

[snip]

The current implementation doesn't seem to do it, though. The code which matches \p uses a single unicodeTable() helper function which only looks at unicode.Categories and unicode.Scripts. There probably needs to be a separate helper, or a flag passed to this one, which makes it look at properties.

@ianlancetaylor

This comment has been minimized.

Copy link
Contributor

@ianlancetaylor ianlancetaylor commented Feb 25, 2016

Can you provide a small self-contained test case showing the problem? Thanks.

@ianlancetaylor ianlancetaylor added this to the Go1.7 milestone Feb 25, 2016
@ianlancetaylor

This comment has been minimized.

Copy link
Contributor

@ianlancetaylor ianlancetaylor commented Feb 25, 2016

CC @mpvl

@s4y

This comment has been minimized.

Copy link
Author

@s4y s4y commented Feb 25, 2016

Sure thing.

package main

import (
    "log"
    "os"
    "regexp"
)

func main() {
    matched, err := regexp.MatchString(`[\p{ASCII_Hex_Digit}]`, "A")
    if err != nil {
        log.Fatal(err)
    }
    if !matched {
        os.Exit(1)
    }
}

The above program should exit cleanly but prints an error instead:

error parsing regexp: invalid character class range: `\p{ASCII_Hex_Digit}``
@junyer

This comment has been minimized.

Copy link
Contributor

@junyer junyer commented Feb 26, 2016

FWIW, the Unicode support in RE2 is also categories and scripts only. (google/re2@a6b34ea recently added the ability to build against ICU in order to have full Unicode properties support.)

@mpvl

This comment has been minimized.

Copy link
Member

@mpvl mpvl commented Feb 26, 2016

In principle I don't see an issue with extending unicodetable to also check for references in Properties. There are a few drawbacks:

  1. it will unconditionally pull in additional data (about ~10k as a very rough first estimate)
  2. it doesn't stop there: there are many more property lists that are not in the core unicode package and would be weird not to support if the existing ones are supported (e.g. Alphabetic and Math are derived properties and are currently not defined in the unicode package).

If we want to go the same route as Google re2, this means pulling in ALL unicode binary property data. This adds up (all properties marked as binary in http://userguide.icu-project.org/strings/properties). My first estimate is that there are about 25 missing tables, roughly double what is in unicode.Properties now. A very wild guess is that supporting this would add about 30k worth of tables total. Not the worst, but still 30k.

So the possible solutions are:

  1. Included Properties and extend it to include the missing tables.
  2. Included Properties and compute the tables for derived properties other tables on the fly.
  3. implement a mechanism for users to add more tables.
  4. Don't support Properties at all.

It all depends what is considered to be an acceptable table increase. My first wild estimate is supporting all properties would add about 30k in tables to anyone using the regexp packages. Also people using unicode.Properties would be saddled up with 20k of additional (but possibly useful and welcome) data.

I think 1 is the best choice as it is the simplest and makes the sets of Properties defined in unicode a bit more comprehensive and consistent.

@rsc rsc modified the milestones: Go1.8, Go1.7 May 18, 2016
@rsc rsc changed the title regexp: Matching Unicode properties with [\p{...}] not supported even though docs say it is regexp: match additional Unicode properties May 18, 2016
@rsc rsc modified the milestones: Go1.9Early, Go1.8 Oct 18, 2016
@bradfitz bradfitz modified the milestones: Go1.10Early, Go1.9Early May 3, 2017
@alercah

This comment has been minimized.

Copy link

@alercah alercah commented May 21, 2017

At the very least, the documentation should be updated to clearly reflect the valid values for \p, regardless of the implementation decision.

A plugin approach might be an option, similar to the way that image packages work, so that you only need to pull in the data for properties you care about. For instance, say, load emoji properties in unicode/properties/emoji, and only allow them to work if you load regexp/properties/emoji in your binary.

@rsc

This comment has been minimized.

Copy link
Contributor

@rsc rsc commented May 22, 2017

Current behavior is intended (scripts and categories only) but maybe we should reconsider. Will chat with @mpvl.

@bradfitz bradfitz modified the milestones: Go1.10Early, Go1.10 Jun 14, 2017
@bradfitz bradfitz modified the milestones: Go1.10, Go1.11 Nov 29, 2017
@bradfitz bradfitz modified the milestones: Go1.11, Unplanned May 18, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

None yet
8 participants
You can’t perform that action at this time.