Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[[:punct:]] and \p{Punct} #42

Closed
k-takata opened this issue Aug 9, 2014 · 4 comments
Closed

[[:punct:]] and \p{Punct} #42

k-takata opened this issue Aug 9, 2014 · 4 comments
Labels

Comments

@k-takata
Copy link
Owner

k-takata commented Aug 9, 2014

Perl's document (perlrecharclass) says that:

\p{PosixPunct} and [[:punct:]] in the ASCII range match all non-controls, non-alphanumeric, non-space characters: [-!"#$%&'()*+,./:;<=>?@[\\\]^_{|}~]`

The similarly named property, \p{Punct} , matches a somewhat different set in the ASCII range, namely [-!"#%&'()*,./:;?@[\\\]_{}]. That is, it is missing the nine characters [$+<=>^|~]`.

In current Onigmo, [[:punct:]] and \p{Punct} is the same in the ASCII range and they depend on the encoding.
If the encoding is Unicode encoding, [[:punct:]] and \p{Punct} don't match the nine characters.
If the encoding is not Unicode encoding, [[:punct:]] and \p{Punct} match the nine characters.

Is it OK?

@k-takata k-takata added the spec label Aug 9, 2014
@tom-lord
Copy link
Contributor

I think this is wrong; both Unicode and non-Unicode should match the nine characters.

http://search.cpan.org/~shay/perl-5.20.2/pod/perlreref.pod

I believe the difference should actually be that under Unicode enoding, [[:punct:]] should additionally match non-ASCII punctuation. The "symbols" ($+<=>^|~`) should always be matched.

k-takata added a commit that referenced this issue Oct 15, 2016
Now /(?u)[[:punct:]]/ and /\p{XPosixPunct}/ have the same meaning when
Unicode encodings are used. On the other hand, /\p{Punct}/ is not
changed.

    /(?u)[[:punct:]]/ == /\p{XPosixPunct}/ == /[\p{Punct}$+<=>^`|~]/

\p{XPosixPunct} can be used only with Unicode encodings. For other
encodings, /[[:punct:]]/ is the same with /\p{Punct}/. They both
includes the nine characters: "$+<=>^`|~".
@k-takata
Copy link
Owner Author

k-takata commented Oct 19, 2016

I have decided to change the behavior of [[:punct:]] on Unicode encodings, and already committed into devel-6.0 branch.
Now [[:punct:]] matches the nine characters $+<=>^`|~ on all encodings.
New property \p{XPosixPunct} can be used on Unicode encodings. This is the same as (?u)[[:punct:]].
However \p{Punct} still works differently on Unicode encodings and non-Unicode encodings. It matches the nine characters on non-Unicode encodings, and doesn't match on Unicode encodings.

@k-takata
Copy link
Owner Author

Closing.

@k-takata
Copy link
Owner Author

k-takata commented Dec 1, 2016

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants