New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Regex matching errors when using \W character class and /i option #4

Closed
k-takata opened this Issue Apr 18, 2012 · 7 comments

Comments

Projects
None yet
1 participant
@k-takata
Owner

k-takata commented Apr 18, 2012

@k-takata

This comment has been minimized.

Show comment
Hide comment
@k-takata

This comment has been minimized.

Show comment
Hide comment
@k-takata

k-takata Jul 2, 2014

Owner

I tested several patterns with Perl 5.14 and 5.16:

                                    Perl 5.14  Perl 5.16
"\x{17f}" =~ /(?iu)\w/            → match
"\x{17f}" =~ /(?iu)[\w]/          → match
"\x{17f}" =~ /(?iu)\W/            → unmatch
"\x{17f}" =~ /(?iu)[\W]/          → unmatch

"\x{17f}" =~ /(?ia)\w/            → unmatch ★
"\x{17f}" =~ /(?ia)[\w]/          → match ★   unmatch
"\x{17f}" =~ /(?ia)\W/            → match
"\x{17f}" =~ /(?ia)[\W]/          → match

"s" =~ /(?ia)\w/                  → match
"s" =~ /(?ia)[\w]/                → match
"s" =~ /(?ia)\W/                  → unmatch
"s" =~ /(?ia)[\W]/                → unmatch

"\x{17f}" =~ /(?ia)\p{ASCII}/     → unmatch
"\x{17f}" =~ /(?ia)[\p{ASCII}]/   → unmatch
"\x{17f}" =~ /(?ia)\P{ASCII}/     → match
"\x{17f}" =~ /(?ia)[\P{ASCII}]/   → match

"s" =~ /(?ia)\p{ASCII}/           → match
"s" =~ /(?ia)[\p{ASCII}]/         → match
"s" =~ /(?ia)\P{ASCII}/           → unmatch
"s" =~ /(?ia)[\P{ASCII}]/         → unmatch

"\x{17f}" =~ /(?iaa)\w/           → unmatch
"\x{17f}" =~ /(?iaa)[\w]/         → unmatch
"\x{17f}" =~ /(?iaa)\W/           → match
"\x{17f}" =~ /(?iaa)[\W]/         → match

★: Be careful with these results.

Owner

k-takata commented Jul 2, 2014

I tested several patterns with Perl 5.14 and 5.16:

                                    Perl 5.14  Perl 5.16
"\x{17f}" =~ /(?iu)\w/            → match
"\x{17f}" =~ /(?iu)[\w]/          → match
"\x{17f}" =~ /(?iu)\W/            → unmatch
"\x{17f}" =~ /(?iu)[\W]/          → unmatch

"\x{17f}" =~ /(?ia)\w/            → unmatch ★
"\x{17f}" =~ /(?ia)[\w]/          → match ★   unmatch
"\x{17f}" =~ /(?ia)\W/            → match
"\x{17f}" =~ /(?ia)[\W]/          → match

"s" =~ /(?ia)\w/                  → match
"s" =~ /(?ia)[\w]/                → match
"s" =~ /(?ia)\W/                  → unmatch
"s" =~ /(?ia)[\W]/                → unmatch

"\x{17f}" =~ /(?ia)\p{ASCII}/     → unmatch
"\x{17f}" =~ /(?ia)[\p{ASCII}]/   → unmatch
"\x{17f}" =~ /(?ia)\P{ASCII}/     → match
"\x{17f}" =~ /(?ia)[\P{ASCII}]/   → match

"s" =~ /(?ia)\p{ASCII}/           → match
"s" =~ /(?ia)[\p{ASCII}]/         → match
"s" =~ /(?ia)\P{ASCII}/           → unmatch
"s" =~ /(?ia)[\P{ASCII}]/         → unmatch

"\x{17f}" =~ /(?iaa)\w/           → unmatch
"\x{17f}" =~ /(?iaa)[\w]/         → unmatch
"\x{17f}" =~ /(?iaa)\W/           → match
"\x{17f}" =~ /(?iaa)[\W]/         → match

★: Be careful with these results.

@k-takata

This comment has been minimized.

Show comment
Hide comment
@k-takata

k-takata Jul 2, 2014

Owner

Other test patterns:

                                    Perl 5.14  Perl 5.16
# LATIN SMALL LETTER LONG S
"\x{17f}" =~ /(?u)[[:lower:]]/    → match
"\x{17f}" =~ /(?a)[[:lower:]]/    → unmatch ★
"\x{17f}" =~ /(?aa)[[:lower:]]/   → unmatch

"\x{17f}" =~ /(?iu)[[:lower:]]/   → match
"\x{17f}" =~ /(?ia)[[:lower:]]/   → match ★   unmatch
"\x{17f}" =~ /(?iaa)[[:lower:]]/  → unmatch

# KELVIN SIGN
"\x{212a}" =~ /(?u)[[:upper:]]/   → match
"\x{212a}" =~ /(?a)[[:upper:]]/   → unmatch ★
"\x{212a}" =~ /(?aa)[[:upper:]]/  → unmatch

"\x{212a}" =~ /(?iu)[[:upper:]]/  → match
"\x{212a}" =~ /(?ia)[[:upper:]]/  → match ★   unmatch
"\x{212a}" =~ /(?iaa)[[:upper:]]/ → unmatch

# LATIN SMALL LETTER SHARP S (eszett)
"\x{df}" =~ /(?u)[[:lower:]]/     → match
"\x{df}" =~ /(?a)[[:lower:]]/     → unmatch ★
"\x{df}" =~ /(?aa)[[:lower:]]/    → unmatch

"\x{df}" =~ /(?iu)[[:lower:]]/    → match
"\x{df}" =~ /(?ia)[[:lower:]]/    → unmatch ★
"\x{df}" =~ /(?iaa)[[:lower:]]/   → unmatch
Owner

k-takata commented Jul 2, 2014

Other test patterns:

                                    Perl 5.14  Perl 5.16
# LATIN SMALL LETTER LONG S
"\x{17f}" =~ /(?u)[[:lower:]]/    → match
"\x{17f}" =~ /(?a)[[:lower:]]/    → unmatch ★
"\x{17f}" =~ /(?aa)[[:lower:]]/   → unmatch

"\x{17f}" =~ /(?iu)[[:lower:]]/   → match
"\x{17f}" =~ /(?ia)[[:lower:]]/   → match ★   unmatch
"\x{17f}" =~ /(?iaa)[[:lower:]]/  → unmatch

# KELVIN SIGN
"\x{212a}" =~ /(?u)[[:upper:]]/   → match
"\x{212a}" =~ /(?a)[[:upper:]]/   → unmatch ★
"\x{212a}" =~ /(?aa)[[:upper:]]/  → unmatch

"\x{212a}" =~ /(?iu)[[:upper:]]/  → match
"\x{212a}" =~ /(?ia)[[:upper:]]/  → match ★   unmatch
"\x{212a}" =~ /(?iaa)[[:upper:]]/ → unmatch

# LATIN SMALL LETTER SHARP S (eszett)
"\x{df}" =~ /(?u)[[:lower:]]/     → match
"\x{df}" =~ /(?a)[[:lower:]]/     → unmatch ★
"\x{df}" =~ /(?aa)[[:lower:]]/    → unmatch

"\x{df}" =~ /(?iu)[[:lower:]]/    → match
"\x{df}" =~ /(?ia)[[:lower:]]/    → unmatch ★
"\x{df}" =~ /(?iaa)[[:lower:]]/   → unmatch
@k-takata

This comment has been minimized.

Show comment
Hide comment
@k-takata

k-takata Jul 2, 2014

Owner

Perl 5.14 has some inconsistency, but they are fixed with Perl 5.16.
Maybe, it is better to use Perl 5.16 or later as a reference.

Owner

k-takata commented Jul 2, 2014

Perl 5.14 has some inconsistency, but they are fixed with Perl 5.16.
Maybe, it is better to use Perl 5.16 or later as a reference.

@k-takata

This comment has been minimized.

Show comment
Hide comment
@k-takata
Owner

k-takata commented Jul 2, 2014

k-takata added a commit that referenced this issue Jul 2, 2014

k-takata added a commit that referenced this issue Jul 4, 2014

@k-takata

This comment has been minimized.

Show comment
Hide comment
@k-takata

k-takata Jul 5, 2014

Owner

More test patterns with Perl 5.16:

"\x{17f}" =~ /(?iu)[^\w]/    → unmatch
"\x{17f}" =~ /(?ia)[^\w]/    → match
"\x{17f}" =~ /(?iaa)[^\w]/   → match

"\x{17f}" =~ /(?iu)[^s]/     → unmatch
"\x{17f}" =~ /(?ia)[^s]/     → unmatch
"\x{17f}" =~ /(?iaa)[^s]/    → match

"\x{212a}" =~ /(?iu)[^\w]/   → unmatch
"\x{212a}" =~ /(?ia)[^\w]/   → match
"\x{212a}" =~ /(?iaa)[^\w]/  → match

"\x{212a}" =~ /(?iu)[^k]/    → unmatch
"\x{212a}" =~ /(?ia)[^k]/    → unmatch
"\x{212a}" =~ /(?iaa)[^k]/   → match
Owner

k-takata commented Jul 5, 2014

More test patterns with Perl 5.16:

"\x{17f}" =~ /(?iu)[^\w]/    → unmatch
"\x{17f}" =~ /(?ia)[^\w]/    → match
"\x{17f}" =~ /(?iaa)[^\w]/   → match

"\x{17f}" =~ /(?iu)[^s]/     → unmatch
"\x{17f}" =~ /(?ia)[^s]/     → unmatch
"\x{17f}" =~ /(?iaa)[^s]/    → match

"\x{212a}" =~ /(?iu)[^\w]/   → unmatch
"\x{212a}" =~ /(?ia)[^\w]/   → match
"\x{212a}" =~ /(?iaa)[^\w]/  → match

"\x{212a}" =~ /(?iu)[^k]/    → unmatch
"\x{212a}" =~ /(?ia)[^k]/    → unmatch
"\x{212a}" =~ /(?iaa)[^k]/   → match

k-takata added a commit that referenced this issue Aug 6, 2014

Fix character class with ignore case (Issue #4)
\p{ASCII}, [[:ascii:]], \p{Word}, \w, [[:word:]] and their negated
patterns inside a character class must be handled specially. They should
not match across ASCII/non-ASCII boundary. Exclude them from
ASCII/non-ASCII case folding.

k-takata added a commit that referenced this issue Aug 6, 2014

Fix for POSIX brackets (Issue #4)
All POSIX brackets should not match across ASCII boundary when ASCII
flag is on.

k-takata added a commit that referenced this issue Aug 6, 2014

k-takata added a commit that referenced this issue Aug 7, 2014

k-takata added a commit that referenced this issue Aug 8, 2014

Some POSIX brackets still fail (Issue #4)
/(?ia)[[:lower:]][[:upper:]]/ =~ "Ab" failed.

k-takata added a commit that referenced this issue Aug 8, 2014

Fix character class with ignore case (Issue #4)
\p{ASCII}, \w, all POSIX brackets and their negated patterns inside a
character class must be handled specially. They should not match across
ASCII/non-ASCII boundary. Exclude them from ASCII/non-ASCII case folding.

* \p{ASCII} and [[:ascii:]] should not match across ASCII boundary. They
  don't depend on ASCII flag.
* \w and all POSIX brackets should not match across ASCII boundary when
  ASCII flag is on.

k-takata added a commit that referenced this issue Aug 8, 2014

k-takata added a commit that referenced this issue Aug 8, 2014

Merge branch 'master' into ruby-2.x
Fix character class with ignore case. (Issue #4)

Conflicts:
	regparse.c

@k-takata k-takata closed this in 1225fe1 Aug 8, 2014

@k-takata

This comment has been minimized.

Show comment
Hide comment
@k-takata

k-takata Aug 8, 2014

Owner

Details of this bug in Japanese.

現象

(?ia)\w(?ia)[\w] の動作が異なる。
前者は ASCII の範囲内にしかマッチしないのに対し、後者は U+017F ("ſ" , LATIN SMALL LETTER LONG S), U+212a ("K", KELVIN SIGN) にもマッチしてしまう。
ASCII フラグ (?a) が指定されている場合、\w は ASCII の範囲内にしかマッチすべきではないが、文字クラスの中に入れると ASCII の範囲外にもマッチしてしまっている。
他に同様のものとしては、\p{ASCII}, [[:ascii:]] ((?a) に非依存) や、それ以外の POSIX クラス ((?a) 指定時のみ) がある。

影響するバージョン

  • Ruby 1.9.1 ~ 1.9.3 に組み込みの Oniguruma 5.9.x
    (オリジナルの Oniguruma 5.9.x は、\w は Unicode の範囲でマッチするため、U+017F, U+212a は元々マッチする。)
  • Onigmo 5.10.0 ~ 5.14.2

原因

ignore case フラグ (?i) が指定されている場合、文字クラスをパースする際に、文字プロパティーやPOSIXクラスを個々の文字に展開し、次に case fold を行い、大文字小文字を展開した文字クラスを作成する。(E.g. (?ia)[[:upper:]](?i)[A-Z][A-Za-z])
その際、一部の文字プロパティーなどは ASCII の範囲を超えて case fold を行ってはいけないにもかかわらず、それを行ってしまっている。

対策

文字クラスをパースして、文字プロパティーやPOSIXクラスを個々の文字に展開する際、ASCII の範囲を超えて case fold を行ってよいかを判定し、行ってよい文字だけを集めた文字クラス(asc_cc)を別に用意する。
元の文字クラスを case fold する際に、個々の文字が asc_cc に含まれているかを判定し、含まれていなければ ASCII の範囲を超えた case fold は行わないようにする。

Owner

k-takata commented Aug 8, 2014

Details of this bug in Japanese.

現象

(?ia)\w(?ia)[\w] の動作が異なる。
前者は ASCII の範囲内にしかマッチしないのに対し、後者は U+017F ("ſ" , LATIN SMALL LETTER LONG S), U+212a ("K", KELVIN SIGN) にもマッチしてしまう。
ASCII フラグ (?a) が指定されている場合、\w は ASCII の範囲内にしかマッチすべきではないが、文字クラスの中に入れると ASCII の範囲外にもマッチしてしまっている。
他に同様のものとしては、\p{ASCII}, [[:ascii:]] ((?a) に非依存) や、それ以外の POSIX クラス ((?a) 指定時のみ) がある。

影響するバージョン

  • Ruby 1.9.1 ~ 1.9.3 に組み込みの Oniguruma 5.9.x
    (オリジナルの Oniguruma 5.9.x は、\w は Unicode の範囲でマッチするため、U+017F, U+212a は元々マッチする。)
  • Onigmo 5.10.0 ~ 5.14.2

原因

ignore case フラグ (?i) が指定されている場合、文字クラスをパースする際に、文字プロパティーやPOSIXクラスを個々の文字に展開し、次に case fold を行い、大文字小文字を展開した文字クラスを作成する。(E.g. (?ia)[[:upper:]](?i)[A-Z][A-Za-z])
その際、一部の文字プロパティーなどは ASCII の範囲を超えて case fold を行ってはいけないにもかかわらず、それを行ってしまっている。

対策

文字クラスをパースして、文字プロパティーやPOSIXクラスを個々の文字に展開する際、ASCII の範囲を超えて case fold を行ってよいかを判定し、行ってよい文字だけを集めた文字クラス(asc_cc)を別に用意する。
元の文字クラスを case fold する際に、個々の文字が asc_cc に含まれているかを判定し、含まれていなければ ASCII の範囲を超えた case fold は行わないようにする。

k-takata added a commit to k-takata/bregonig that referenced this issue Sep 13, 2014

Ver.3.06
* Onigmo (Oniguruma-mod) 5.15.0 for bregonig.dll を使用。
  https://github.com/k-takata/Onigmo/tree/Onigmo-5.15.0_for_bregonig
  - Unicode 7.0 に対応
  - Oniguruma 5.9.5 をマージ
  - 大量のグループを使うと落ちる問題を修正
    k-takata/Onigmo#24
  - /\x{1ffc}/i =~ "\x1ff3" がマッチしない問題を修正
  - UTF-16/32 で /[a-c#]+\W/ =~ "def#" がマッチしない問題を修正
  - /(?i)\u0149\u0149/ =~ "\u0149\u0149" がマッチしない問題を修正
    k-takata/Onigmo#40
  - 文字クラスの中で /w を使い、/i オプションを指定したときの問題を修正
    k-takata/Onigmo#4
  - 文字プロパティが /i オプションを無視する問題を修正
    k-takata/Onigmo#41
  - "ab" =~ /(?!^a).*b/ がマッチしない問題を修正
    k-takata/Onigmo#44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment