Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[[:punct:]] isn't matching all expected symbols #268

Closed
david-waterworth opened this issue Aug 26, 2022 · 6 comments
Closed

[[:punct:]] isn't matching all expected symbols #268

david-waterworth opened this issue Aug 26, 2022 · 6 comments

Comments

@david-waterworth
Copy link

It appears that the [[:punct:]] class does not match the following characters $+<>=^`|~ which is different behaviour from other regex libraries I've used (including the online regex101 tool).

According to https://en.wikibooks.org/wiki/Regular_Expressions/POSIX_Basic_Regular_Expressions :punct: should be all "all graphic characters except letters and digits"

I'm using the rust onig bindings, the code below produces no matches using onigruma but works with the alternative regex crate. Are you using a different definition of the POSIX character classes? If so I cannot find a reference in the documentation.

@iwillspeak
Copy link
Contributor

I've managed to reproduce this. It appears to be down to the encoding uses. The onig crate defaults to UTF8 as the encoding as Rust strings are encoded with UTF8. If the search is performed with ASCII encoding instead then those characters are considered POSIX :punct::

use onig::*;

fn main() {
    {
        let regex = Regex::with_options(
            "[[:punct:]]+",
            RegexOptions::REGEX_OPTION_NONE,
            Syntax::posix_extended(),
        )
        .unwrap();
    
        match regex.find("$+<>=^`|~") {
            Some(mat) => println!("{:?}", &mat),
            None => println!("NO match"),
        }
    }

    {
        let regex = Regex::with_options_and_encoding(
            EncodedBytes::ascii(b"[[:punct:]]+"),
            RegexOptions::REGEX_OPTION_NONE,
            Syntax::posix_extended(),
        )
        .unwrap();
    
        match regex.find_with_encoding(EncodedBytes::ascii(b"$+<>=^`|~")) {
            Some(mat) => println!("{:?}", &mat),
            None => println!("NO match"),
        }
    }
}

In this eaxample find* boils down to an onig_search_with_param with an empty param and SEARCH_OPTION_NONE:

https://github.com/rust-onig/rust-onig/blob/36dd97a8db49c8f089449f0605da627fcfc6d74b/onig/src/lib.rs#L782-L784

https://github.com/rust-onig/rust-onig/blob/36dd97a8db49c8f089449f0605da627fcfc6d74b/onig/src/lib.rs#L723-L735

I'm not sure if this is an issue with the way that onig uses encodings, or with the way that the UTF8 encoding treats those characters.

@iwillspeak
Copy link
Contributor

I can reproduce this without onig just by setting the encoding in samples/simple.c to ONIG_ENCODING_UTF8 and using the same pattern and search text.

@kkos
Copy link
Owner

kkos commented Aug 27, 2022

The current punct data can be found below.
CR_Punct in oniguruma/src/unicode_property_data.c
This is made from Unicode data.
https://www.unicode.org/Public/14.0.0/ucd/UnicodeData.txt
I will now see if I can change the value of [[:punct:]] as you said.

@iwillspeak
Copy link
Contributor

I've been doing some digging on Unicode Explorer. It seems that this is because those characters aren't punctuation as far as Unicode is concerned. E.g. Unicode Explorer indicates that ~ (tilde) is a Symbol, not a punctuation character:

https://unicode-explorer.com/c/007E

For comparison here's full stop, which is matched by [:punct:] in UTF8:

https://unicode-explorer.com/c/002E

I guess the question here is should Oniguruma consider Symbol characters as [:punct:]?

@david-waterworth
Copy link
Author

david-waterworth commented Aug 27, 2022

So it looks to me that it depends if you define punctuation as "Standard" or "POSIX Compatible" using the table from Annex C of UTS#18

The following table shows recommended assignments for compatibility property names, for use in Regular Expressions. The standard recommendation is shown in the column labeled "Standard"; applications should use this definition wherever possible. If populated with a different value, the column labeled "POSIX Compatible" shows modifications to the standard recommendation required to meet the formal requirements of [POSIX], and also to maintain (as much as possible) compatibility with the POSIX usage in practice. That modification involves some compromises, because POSIX does not have as fine-grained a set of character properties as in the Unicode Standard, and also has some additional constraints. So, for example, POSIX does not allow more than 20 characters to be categorized as digits, whereas there are many more than 20 digit characters in Unicode.

image

(I've only copied part of the table)

As you mention the difference is that UTF defines symbols and POSIX doesn't make a distinction between punctuation and symbols.

So which is correct I couldn't say - it does say prefer Standard, but it does seem most engines I've used follow the POSIX definition. I suspect that's historical and tends to work less well for non-english text?

In my case I think using \p{P} and \p{S} rather than [[:punct]] is the better option as it's clearer what it contains.

@kkos
Copy link
Owner

kkos commented Aug 28, 2022

Changed the definition of [:punct:] in Unicode encodings from \p{P} to \p{PosixPunct} = \p{P} + \p{S}.
(PosixPunct is a new addition.)

@kkos kkos closed this as completed in a6faca7 Aug 28, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants