-
Notifications
You must be signed in to change notification settings - Fork 310
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[[:punct:]] isn't matching all expected symbols #268
Comments
I've managed to reproduce this. It appears to be down to the encoding uses. The use onig::*;
fn main() {
{
let regex = Regex::with_options(
"[[:punct:]]+",
RegexOptions::REGEX_OPTION_NONE,
Syntax::posix_extended(),
)
.unwrap();
match regex.find("$+<>=^`|~") {
Some(mat) => println!("{:?}", &mat),
None => println!("NO match"),
}
}
{
let regex = Regex::with_options_and_encoding(
EncodedBytes::ascii(b"[[:punct:]]+"),
RegexOptions::REGEX_OPTION_NONE,
Syntax::posix_extended(),
)
.unwrap();
match regex.find_with_encoding(EncodedBytes::ascii(b"$+<>=^`|~")) {
Some(mat) => println!("{:?}", &mat),
None => println!("NO match"),
}
}
} In this eaxample I'm not sure if this is an issue with the way that |
I can reproduce this without |
The current punct data can be found below. |
I've been doing some digging on Unicode Explorer. It seems that this is because those characters aren't punctuation as far as Unicode is concerned. E.g. Unicode Explorer indicates that https://unicode-explorer.com/c/007E For comparison here's full stop, which is matched by https://unicode-explorer.com/c/002E I guess the question here is should Oniguruma consider |
So it looks to me that it depends if you define punctuation as "Standard" or "POSIX Compatible" using the table from Annex C of UTS#18
(I've only copied part of the table) As you mention the difference is that UTF defines symbols and POSIX doesn't make a distinction between punctuation and symbols. So which is correct I couldn't say - it does say prefer In my case I think using |
Changed the definition of [:punct:] in Unicode encodings from \p{P} to \p{PosixPunct} = \p{P} + \p{S}. |
It appears that the [[:punct:]] class does not match the following characters $+<>=^`|~ which is different behaviour from other regex libraries I've used (including the online regex101 tool).
According to https://en.wikibooks.org/wiki/Regular_Expressions/POSIX_Basic_Regular_Expressions :punct: should be all "all graphic characters except letters and digits"
I'm using the rust
onig
bindings, the code below produces no matches using onigruma but works with the alternativeregex
crate. Are you using a different definition of the POSIX character classes? If so I cannot find a reference in the documentation.The text was updated successfully, but these errors were encountered: