Regex matching errors when using \W character class and /i option #4

k-takata · 2012-04-18T12:04:32Z

see: http://bugs.ruby-lang.org/issues/4044

k-takata · 2012-12-09T04:57:18Z

Also related:
http://bugs.ruby-lang.org/issues/7533
http://bugs.ruby-lang.org/issues/7534

k-takata · 2014-07-02T14:29:12Z

I tested several patterns with Perl 5.14 and 5.16:

                                    Perl 5.14  Perl 5.16
"\x{17f}" =~ /(?iu)\w/            → match
"\x{17f}" =~ /(?iu)[\w]/          → match
"\x{17f}" =~ /(?iu)\W/            → unmatch
"\x{17f}" =~ /(?iu)[\W]/          → unmatch

"\x{17f}" =~ /(?ia)\w/            → unmatch ★
"\x{17f}" =~ /(?ia)[\w]/          → match ★   unmatch
"\x{17f}" =~ /(?ia)\W/            → match
"\x{17f}" =~ /(?ia)[\W]/          → match

"s" =~ /(?ia)\w/                  → match
"s" =~ /(?ia)[\w]/                → match
"s" =~ /(?ia)\W/                  → unmatch
"s" =~ /(?ia)[\W]/                → unmatch

"\x{17f}" =~ /(?ia)\p{ASCII}/     → unmatch
"\x{17f}" =~ /(?ia)[\p{ASCII}]/   → unmatch
"\x{17f}" =~ /(?ia)\P{ASCII}/     → match
"\x{17f}" =~ /(?ia)[\P{ASCII}]/   → match

"s" =~ /(?ia)\p{ASCII}/           → match
"s" =~ /(?ia)[\p{ASCII}]/         → match
"s" =~ /(?ia)\P{ASCII}/           → unmatch
"s" =~ /(?ia)[\P{ASCII}]/         → unmatch

"\x{17f}" =~ /(?iaa)\w/           → unmatch
"\x{17f}" =~ /(?iaa)[\w]/         → unmatch
"\x{17f}" =~ /(?iaa)\W/           → match
"\x{17f}" =~ /(?iaa)[\W]/         → match

★: Be careful with these results.

k-takata · 2014-07-02T14:30:48Z

Other test patterns:

                                    Perl 5.14  Perl 5.16
# LATIN SMALL LETTER LONG S
"\x{17f}" =~ /(?u)[[:lower:]]/    → match
"\x{17f}" =~ /(?a)[[:lower:]]/    → unmatch ★
"\x{17f}" =~ /(?aa)[[:lower:]]/   → unmatch

"\x{17f}" =~ /(?iu)[[:lower:]]/   → match
"\x{17f}" =~ /(?ia)[[:lower:]]/   → match ★   unmatch
"\x{17f}" =~ /(?iaa)[[:lower:]]/  → unmatch

# KELVIN SIGN
"\x{212a}" =~ /(?u)[[:upper:]]/   → match
"\x{212a}" =~ /(?a)[[:upper:]]/   → unmatch ★
"\x{212a}" =~ /(?aa)[[:upper:]]/  → unmatch

"\x{212a}" =~ /(?iu)[[:upper:]]/  → match
"\x{212a}" =~ /(?ia)[[:upper:]]/  → match ★   unmatch
"\x{212a}" =~ /(?iaa)[[:upper:]]/ → unmatch

# LATIN SMALL LETTER SHARP S (eszett)
"\x{df}" =~ /(?u)[[:lower:]]/     → match
"\x{df}" =~ /(?a)[[:lower:]]/     → unmatch ★
"\x{df}" =~ /(?aa)[[:lower:]]/    → unmatch

"\x{df}" =~ /(?iu)[[:lower:]]/    → match
"\x{df}" =~ /(?ia)[[:lower:]]/    → unmatch ★
"\x{df}" =~ /(?iaa)[[:lower:]]/   → unmatch

k-takata · 2014-07-02T14:33:02Z

Perl 5.14 has some inconsistency, but they are fixed with Perl 5.16.
Maybe, it is better to use Perl 5.16 or later as a reference.

k-takata · 2014-07-02T14:39:47Z

test script for onigmo:
https://gist.github.com/k-takata/85cf4de23d8194b2da34

Preparation for issue #4.

Issue #4

Preparation for issue #4.

Issue #4

k-takata · 2014-07-05T04:30:15Z

More test patterns with Perl 5.16:

"\x{17f}" =~ /(?iu)[^\w]/    → unmatch
"\x{17f}" =~ /(?ia)[^\w]/    → match
"\x{17f}" =~ /(?iaa)[^\w]/   → match

"\x{17f}" =~ /(?iu)[^s]/     → unmatch
"\x{17f}" =~ /(?ia)[^s]/     → unmatch
"\x{17f}" =~ /(?iaa)[^s]/    → match

"\x{212a}" =~ /(?iu)[^\w]/   → unmatch
"\x{212a}" =~ /(?ia)[^\w]/   → match
"\x{212a}" =~ /(?iaa)[^\w]/  → match

"\x{212a}" =~ /(?iu)[^k]/    → unmatch
"\x{212a}" =~ /(?ia)[^k]/    → unmatch
"\x{212a}" =~ /(?iaa)[^k]/   → match

\p{ASCII}, [[:ascii:]], \p{Word}, \w, [[:word:]] and their negated patterns inside a character class must be handled specially. They should not match across ASCII/non-ASCII boundary. Exclude them from ASCII/non-ASCII case folding.

All POSIX brackets should not match across ASCII boundary when ASCII flag is on.

/(?ia)[[:lower:]][[:upper:]]/ =~ "Ab" failed.

\p{ASCII}, \w, all POSIX brackets and their negated patterns inside a character class must be handled specially. They should not match across ASCII/non-ASCII boundary. Exclude them from ASCII/non-ASCII case folding. * \p{ASCII} and [[:ascii:]] should not match across ASCII boundary. They don't depend on ASCII flag. * \w and all POSIX brackets should not match across ASCII boundary when ASCII flag is on.

Fix character class with ignore case. (Issue #4) Conflicts: regparse.c

k-takata · 2014-08-08T23:55:17Z

Details of this bug in Japanese.

現象

(?ia)\w と (?ia)[\w] の動作が異なる。
前者は ASCII の範囲内にしかマッチしないのに対し、後者は U+017F ("ſ" , LATIN SMALL LETTER LONG S), U+212a ("K", KELVIN SIGN) にもマッチしてしまう。
ASCII フラグ (?a) が指定されている場合、\w は ASCII の範囲内にしかマッチすべきではないが、文字クラスの中に入れると ASCII の範囲外にもマッチしてしまっている。
他に同様のものとしては、\p{ASCII}, [[:ascii:]] ((?a) に非依存) や、それ以外の POSIX クラス ((?a) 指定時のみ) がある。

影響するバージョン

Ruby 1.9.1 ~ 1.9.3 に組み込みの Oniguruma 5.9.x
（オリジナルの Oniguruma 5.9.x は、\w は Unicode の範囲でマッチするため、U+017F, U+212a は元々マッチする。）
Onigmo 5.10.0 ~ 5.14.2

原因

ignore case フラグ (?i) が指定されている場合、文字クラスをパースする際に、文字プロパティーやPOSIXクラスを個々の文字に展開し、次に case fold を行い、大文字小文字を展開した文字クラスを作成する。(E.g. (?ia)[[:upper:]] → (?i)[A-Z] → [A-Za-z])
その際、一部の文字プロパティーなどは ASCII の範囲を超えて case fold を行ってはいけないにもかかわらず、それを行ってしまっている。

対策

文字クラスをパースして、文字プロパティーやPOSIXクラスを個々の文字に展開する際、ASCII の範囲を超えて case fold を行ってよいかを判定し、行ってよい文字だけを集めた文字クラス(asc_cc)を別に用意する。
元の文字クラスを case fold する際に、個々の文字が asc_cc に含まれているかを判定し、含まれていなければ ASCII の範囲を超えた case fold は行わないようにする。

* Onigmo (Oniguruma-mod) 5.15.0 for bregonig.dll を使用。 https://github.com/k-takata/Onigmo/tree/Onigmo-5.15.0_for_bregonig - Unicode 7.0 に対応 - Oniguruma 5.9.5 をマージ - 大量のグループを使うと落ちる問題を修正 k-takata/Onigmo#24 - /\x{1ffc}/i =~ "\x1ff3" がマッチしない問題を修正 - UTF-16/32 で /[a-c#]+\W/ =~ "def#" がマッチしない問題を修正 - /(?i)\u0149\u0149/ =~ "\u0149\u0149" がマッチしない問題を修正 k-takata/Onigmo#40 - 文字クラスの中で /w を使い、/i オプションを指定したときの問題を修正 k-takata/Onigmo#4 - 文字プロパティが /i オプションを無視する問題を修正 k-takata/Onigmo#41 - "ab" =~ /(?!^a).*b/ がマッチしない問題を修正 k-takata/Onigmo#44

k-takata added a commit that referenced this issue Jul 2, 2014

add_ctype_to_cc: change char_prop parameter to ascii_range.

61d77b6

Preparation for issue #4.

k-takata added a commit that referenced this issue Jul 2, 2014

respect ascii flag when applying case fold to a char class

913a6e5

Issue #4

k-takata added a commit that referenced this issue Jul 4, 2014

add_ctype_to_cc: change char_prop parameter to ascii_range.

62b2486

Preparation for issue #4.

k-takata added a commit that referenced this issue Jul 4, 2014

respect ascii flag when applying case fold to a char class

a66c08b

Issue #4

k-takata added a commit that referenced this issue Aug 6, 2014

Fix for POSIX brackets (Issue #4)

f764a5f

All POSIX brackets should not match across ASCII boundary when ASCII flag is on.

k-takata added a commit that referenced this issue Aug 6, 2014

Remove unused code (Issue #4)

5d7f6f7

k-takata added a commit that referenced this issue Aug 7, 2014

\p{Word} should not be affected by ASCII flag (Issue #4)

9879b87

k-takata added a commit that referenced this issue Aug 7, 2014

Add more tests (Issue #4)

fba69ed

k-takata added a commit that referenced this issue Aug 8, 2014

Some POSIX brackets still fail (Issue #4)

40ea8cb

/(?ia)[[:lower:]][[:upper:]]/ =~ "Ab" failed.

k-takata added a commit that referenced this issue Aug 8, 2014

Remove unused code (Issue #4)

30e6d9a

k-takata added a commit that referenced this issue Aug 8, 2014

Merge branch 'master' into ruby-2.x

9cb01b2

Fix character class with ignore case. (Issue #4) Conflicts: regparse.c

k-takata closed this as completed in 1225fe1 Aug 8, 2014

k-takata mentioned this issue Dec 4, 2016

Regex matching errors when using \W character class and /i option (contd.) #76

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Regex matching errors when using \W character class and /i option #4

Regex matching errors when using \W character class and /i option #4

k-takata commented Apr 18, 2012

k-takata commented Dec 9, 2012

k-takata commented Jul 2, 2014

k-takata commented Jul 2, 2014

k-takata commented Jul 2, 2014

k-takata commented Jul 2, 2014

k-takata commented Jul 5, 2014

k-takata commented Aug 8, 2014

Regex matching errors when using \W character class and /i option #4

Regex matching errors when using \W character class and /i option #4

Comments

k-takata commented Apr 18, 2012

k-takata commented Dec 9, 2012

k-takata commented Jul 2, 2014

k-takata commented Jul 2, 2014

k-takata commented Jul 2, 2014

k-takata commented Jul 2, 2014

k-takata commented Jul 5, 2014

k-takata commented Aug 8, 2014

現象

影響するバージョン

原因

対策