Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Regex matching errors when using \W character class and /i option #4

Closed
k-takata opened this issue Apr 18, 2012 · 7 comments
Closed
Labels

Comments

@k-takata
Copy link
Owner

see: http://bugs.ruby-lang.org/issues/4044

@k-takata
Copy link
Owner Author

k-takata commented Dec 9, 2012

@k-takata
Copy link
Owner Author

k-takata commented Jul 2, 2014

I tested several patterns with Perl 5.14 and 5.16:

                                    Perl 5.14  Perl 5.16
"\x{17f}" =~ /(?iu)\w/            → match
"\x{17f}" =~ /(?iu)[\w]/          → match
"\x{17f}" =~ /(?iu)\W/            → unmatch
"\x{17f}" =~ /(?iu)[\W]/          → unmatch

"\x{17f}" =~ /(?ia)\w/            → unmatch ★
"\x{17f}" =~ /(?ia)[\w]/          → match ★   unmatch
"\x{17f}" =~ /(?ia)\W/            → match
"\x{17f}" =~ /(?ia)[\W]/          → match

"s" =~ /(?ia)\w/                  → match
"s" =~ /(?ia)[\w]/                → match
"s" =~ /(?ia)\W/                  → unmatch
"s" =~ /(?ia)[\W]/                → unmatch

"\x{17f}" =~ /(?ia)\p{ASCII}/     → unmatch
"\x{17f}" =~ /(?ia)[\p{ASCII}]/   → unmatch
"\x{17f}" =~ /(?ia)\P{ASCII}/     → match
"\x{17f}" =~ /(?ia)[\P{ASCII}]/   → match

"s" =~ /(?ia)\p{ASCII}/           → match
"s" =~ /(?ia)[\p{ASCII}]/         → match
"s" =~ /(?ia)\P{ASCII}/           → unmatch
"s" =~ /(?ia)[\P{ASCII}]/         → unmatch

"\x{17f}" =~ /(?iaa)\w/           → unmatch
"\x{17f}" =~ /(?iaa)[\w]/         → unmatch
"\x{17f}" =~ /(?iaa)\W/           → match
"\x{17f}" =~ /(?iaa)[\W]/         → match

★: Be careful with these results.

@k-takata
Copy link
Owner Author

k-takata commented Jul 2, 2014

Other test patterns:

                                    Perl 5.14  Perl 5.16
# LATIN SMALL LETTER LONG S
"\x{17f}" =~ /(?u)[[:lower:]]/    → match
"\x{17f}" =~ /(?a)[[:lower:]]/    → unmatch ★
"\x{17f}" =~ /(?aa)[[:lower:]]/   → unmatch

"\x{17f}" =~ /(?iu)[[:lower:]]/   → match
"\x{17f}" =~ /(?ia)[[:lower:]]/   → match ★   unmatch
"\x{17f}" =~ /(?iaa)[[:lower:]]/  → unmatch

# KELVIN SIGN
"\x{212a}" =~ /(?u)[[:upper:]]/   → match
"\x{212a}" =~ /(?a)[[:upper:]]/   → unmatch ★
"\x{212a}" =~ /(?aa)[[:upper:]]/  → unmatch

"\x{212a}" =~ /(?iu)[[:upper:]]/  → match
"\x{212a}" =~ /(?ia)[[:upper:]]/  → match ★   unmatch
"\x{212a}" =~ /(?iaa)[[:upper:]]/ → unmatch

# LATIN SMALL LETTER SHARP S (eszett)
"\x{df}" =~ /(?u)[[:lower:]]/     → match
"\x{df}" =~ /(?a)[[:lower:]]/     → unmatch ★
"\x{df}" =~ /(?aa)[[:lower:]]/    → unmatch

"\x{df}" =~ /(?iu)[[:lower:]]/    → match
"\x{df}" =~ /(?ia)[[:lower:]]/    → unmatch ★
"\x{df}" =~ /(?iaa)[[:lower:]]/   → unmatch

@k-takata
Copy link
Owner Author

k-takata commented Jul 2, 2014

Perl 5.14 has some inconsistency, but they are fixed with Perl 5.16.
Maybe, it is better to use Perl 5.16 or later as a reference.

@k-takata
Copy link
Owner Author

k-takata commented Jul 2, 2014

test script for onigmo:
https://gist.github.com/k-takata/85cf4de23d8194b2da34

k-takata added a commit that referenced this issue Jul 2, 2014
k-takata added a commit that referenced this issue Jul 4, 2014
@k-takata
Copy link
Owner Author

k-takata commented Jul 5, 2014

More test patterns with Perl 5.16:

"\x{17f}" =~ /(?iu)[^\w]/    → unmatch
"\x{17f}" =~ /(?ia)[^\w]/    → match
"\x{17f}" =~ /(?iaa)[^\w]/   → match

"\x{17f}" =~ /(?iu)[^s]/     → unmatch
"\x{17f}" =~ /(?ia)[^s]/     → unmatch
"\x{17f}" =~ /(?iaa)[^s]/    → match

"\x{212a}" =~ /(?iu)[^\w]/   → unmatch
"\x{212a}" =~ /(?ia)[^\w]/   → match
"\x{212a}" =~ /(?iaa)[^\w]/  → match

"\x{212a}" =~ /(?iu)[^k]/    → unmatch
"\x{212a}" =~ /(?ia)[^k]/    → unmatch
"\x{212a}" =~ /(?iaa)[^k]/   → match

k-takata added a commit that referenced this issue Aug 6, 2014
\p{ASCII}, [[:ascii:]], \p{Word}, \w, [[:word:]] and their negated
patterns inside a character class must be handled specially. They should
not match across ASCII/non-ASCII boundary. Exclude them from
ASCII/non-ASCII case folding.
k-takata added a commit that referenced this issue Aug 6, 2014
All POSIX brackets should not match across ASCII boundary when ASCII
flag is on.
k-takata added a commit that referenced this issue Aug 6, 2014
k-takata added a commit that referenced this issue Aug 7, 2014
k-takata added a commit that referenced this issue Aug 8, 2014
/(?ia)[[:lower:]][[:upper:]]/ =~ "Ab" failed.
k-takata added a commit that referenced this issue Aug 8, 2014
\p{ASCII}, \w, all POSIX brackets and their negated patterns inside a
character class must be handled specially. They should not match across
ASCII/non-ASCII boundary. Exclude them from ASCII/non-ASCII case folding.

* \p{ASCII} and [[:ascii:]] should not match across ASCII boundary. They
  don't depend on ASCII flag.
* \w and all POSIX brackets should not match across ASCII boundary when
  ASCII flag is on.
k-takata added a commit that referenced this issue Aug 8, 2014
k-takata added a commit that referenced this issue Aug 8, 2014
Fix character class with ignore case. (Issue #4)

Conflicts:
	regparse.c
@k-takata
Copy link
Owner Author

k-takata commented Aug 8, 2014

Details of this bug in Japanese.

現象

(?ia)\w(?ia)[\w] の動作が異なる。
前者は ASCII の範囲内にしかマッチしないのに対し、後者は U+017F ("ſ" , LATIN SMALL LETTER LONG S), U+212a ("K", KELVIN SIGN) にもマッチしてしまう。
ASCII フラグ (?a) が指定されている場合、\w は ASCII の範囲内にしかマッチすべきではないが、文字クラスの中に入れると ASCII の範囲外にもマッチしてしまっている。
他に同様のものとしては、\p{ASCII}, [[:ascii:]] ((?a) に非依存) や、それ以外の POSIX クラス ((?a) 指定時のみ) がある。

影響するバージョン

  • Ruby 1.9.1 ~ 1.9.3 に組み込みの Oniguruma 5.9.x
    (オリジナルの Oniguruma 5.9.x は、\w は Unicode の範囲でマッチするため、U+017F, U+212a は元々マッチする。)
  • Onigmo 5.10.0 ~ 5.14.2

原因

ignore case フラグ (?i) が指定されている場合、文字クラスをパースする際に、文字プロパティーやPOSIXクラスを個々の文字に展開し、次に case fold を行い、大文字小文字を展開した文字クラスを作成する。(E.g. (?ia)[[:upper:]](?i)[A-Z][A-Za-z])
その際、一部の文字プロパティーなどは ASCII の範囲を超えて case fold を行ってはいけないにもかかわらず、それを行ってしまっている。

対策

文字クラスをパースして、文字プロパティーやPOSIXクラスを個々の文字に展開する際、ASCII の範囲を超えて case fold を行ってよいかを判定し、行ってよい文字だけを集めた文字クラス(asc_cc)を別に用意する。
元の文字クラスを case fold する際に、個々の文字が asc_cc に含まれているかを判定し、含まれていなければ ASCII の範囲を超えた case fold は行わないようにする。

k-takata added a commit to k-takata/bregonig that referenced this issue Sep 13, 2014
* Onigmo (Oniguruma-mod) 5.15.0 for bregonig.dll を使用。
  https://github.com/k-takata/Onigmo/tree/Onigmo-5.15.0_for_bregonig
  - Unicode 7.0 に対応
  - Oniguruma 5.9.5 をマージ
  - 大量のグループを使うと落ちる問題を修正
    k-takata/Onigmo#24
  - /\x{1ffc}/i =~ "\x1ff3" がマッチしない問題を修正
  - UTF-16/32 で /[a-c#]+\W/ =~ "def#" がマッチしない問題を修正
  - /(?i)\u0149\u0149/ =~ "\u0149\u0149" がマッチしない問題を修正
    k-takata/Onigmo#40
  - 文字クラスの中で /w を使い、/i オプションを指定したときの問題を修正
    k-takata/Onigmo#4
  - 文字プロパティが /i オプションを無視する問題を修正
    k-takata/Onigmo#41
  - "ab" =~ /(?!^a).*b/ がマッチしない問題を修正
    k-takata/Onigmo#44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

1 participant