[[:punct:]] isn't matching all expected symbols #268

david-waterworth · 2022-08-26T23:03:34Z

It appears that the [[:punct:]] class does not match the following characters $+<>=^`|~ which is different behaviour from other regex libraries I've used (including the online regex101 tool).

According to https://en.wikibooks.org/wiki/Regular_Expressions/POSIX_Basic_Regular_Expressions :punct: should be all "all graphic characters except letters and digits"

I'm using the rust onig bindings, the code below produces no matches using onigruma but works with the alternative regex crate. Are you using a different definition of the POSIX character classes? If so I cannot find a reference in the documentation.

The text was updated successfully, but these errors were encountered:

iwillspeak · 2022-08-27T11:58:25Z

I've managed to reproduce this. It appears to be down to the encoding uses. The onig crate defaults to UTF8 as the encoding as Rust strings are encoded with UTF8. If the search is performed with ASCII encoding instead then those characters are considered POSIX :punct::

use onig::*;

fn main() {
    {
        let regex = Regex::with_options(
            "[[:punct:]]+",
            RegexOptions::REGEX_OPTION_NONE,
            Syntax::posix_extended(),
        )
        .unwrap();
    
        match regex.find("$+<>=^`|~") {
            Some(mat) => println!("{:?}", &mat),
            None => println!("NO match"),
        }
    }

    {
        let regex = Regex::with_options_and_encoding(
            EncodedBytes::ascii(b"[[:punct:]]+"),
            RegexOptions::REGEX_OPTION_NONE,
            Syntax::posix_extended(),
        )
        .unwrap();
    
        match regex.find_with_encoding(EncodedBytes::ascii(b"$+<>=^`|~")) {
            Some(mat) => println!("{:?}", &mat),
            None => println!("NO match"),
        }
    }
}

In this eaxample find* boils down to an onig_search_with_param with an empty param and SEARCH_OPTION_NONE:

https://github.com/rust-onig/rust-onig/blob/36dd97a8db49c8f089449f0605da627fcfc6d74b/onig/src/lib.rs#L782-L784

https://github.com/rust-onig/rust-onig/blob/36dd97a8db49c8f089449f0605da627fcfc6d74b/onig/src/lib.rs#L723-L735

I'm not sure if this is an issue with the way that onig uses encodings, or with the way that the UTF8 encoding treats those characters.

iwillspeak · 2022-08-27T12:08:21Z

I can reproduce this without onig just by setting the encoding in samples/simple.c to ONIG_ENCODING_UTF8 and using the same pattern and search text.

kkos · 2022-08-27T14:48:07Z

The current punct data can be found below.
CR_Punct in oniguruma/src/unicode_property_data.c
This is made from Unicode data.
https://www.unicode.org/Public/14.0.0/ucd/UnicodeData.txt
I will now see if I can change the value of [[:punct:]] as you said.

iwillspeak · 2022-08-27T16:59:51Z

I've been doing some digging on Unicode Explorer. It seems that this is because those characters aren't punctuation as far as Unicode is concerned. E.g. Unicode Explorer indicates that ~ (tilde) is a Symbol, not a punctuation character:

https://unicode-explorer.com/c/007E

For comparison here's full stop, which is matched by [:punct:] in UTF8:

https://unicode-explorer.com/c/002E

I guess the question here is should Oniguruma consider Symbol characters as [:punct:]?

david-waterworth · 2022-08-27T22:42:19Z

So it looks to me that it depends if you define punctuation as "Standard" or "POSIX Compatible" using the table from Annex C of UTS#18

The following table shows recommended assignments for compatibility property names, for use in Regular Expressions. The standard recommendation is shown in the column labeled "Standard"; applications should use this definition wherever possible. If populated with a different value, the column labeled "POSIX Compatible" shows modifications to the standard recommendation required to meet the formal requirements of [POSIX], and also to maintain (as much as possible) compatibility with the POSIX usage in practice. That modification involves some compromises, because POSIX does not have as fine-grained a set of character properties as in the Unicode Standard, and also has some additional constraints. So, for example, POSIX does not allow more than 20 characters to be categorized as digits, whereas there are many more than 20 digit characters in Unicode.

(I've only copied part of the table)

As you mention the difference is that UTF defines symbols and POSIX doesn't make a distinction between punctuation and symbols.

So which is correct I couldn't say - it does say prefer Standard, but it does seem most engines I've used follow the POSIX definition. I suspect that's historical and tends to work less well for non-english text?

In my case I think using \p{P} and \p{S} rather than [[:punct]] is the better option as it's clearer what it contains.

kkos · 2022-08-28T02:53:32Z

Changed the definition of [:punct:] in Unicode encodings from \p{P} to \p{PosixPunct} = \p{P} + \p{S}.
(PosixPunct is a new addition.)

kkos closed this as completed in a6faca7 Aug 28, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[[:punct:]] isn't matching all expected symbols #268

[[:punct:]] isn't matching all expected symbols #268

david-waterworth commented Aug 26, 2022

iwillspeak commented Aug 27, 2022

iwillspeak commented Aug 27, 2022

kkos commented Aug 27, 2022

iwillspeak commented Aug 27, 2022

david-waterworth commented Aug 27, 2022 •

edited

kkos commented Aug 28, 2022

[[:punct:]] isn't matching all expected symbols #268

[[:punct:]] isn't matching all expected symbols #268

Comments

david-waterworth commented Aug 26, 2022

iwillspeak commented Aug 27, 2022

iwillspeak commented Aug 27, 2022

kkos commented Aug 27, 2022

iwillspeak commented Aug 27, 2022

david-waterworth commented Aug 27, 2022 • edited

kkos commented Aug 28, 2022

david-waterworth commented Aug 27, 2022 •

edited