Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ASCII only word , ignore case , Character class #264

Closed
tonco-miyazawa opened this issue Jun 22, 2022 · 8 comments
Closed

ASCII only word , ignore case , Character class #264

tonco-miyazawa opened this issue Jun 22, 2022 · 8 comments

Comments

@tonco-miyazawa
Copy link

tonco-miyazawa commented Jun 22, 2022

This is Applied commit 50cdc3e .

-------- /test/test_utf8.c -------

  // Issue #264
  n("(?iI)s", "\xc5\xbf");
  n("(?iI)[s]", "\xc5\xbf");    // FAIL

  n("(?iI:s)", "\xc5\xbf");
  n("(?iI:[s])", "\xc5\xbf");    // FAIL

  x2("(?iI)(?:[[:word:]])", "\xc5\xbf", 0, 2);
  n("(?iI)(?W:[[:word:]])", "\xc5\xbf");     // FAIL

  n("(?iI)(?W:\\w)", "\xc5\xbf");
  n("(?iI)(?W:[\\w])", "\xc5\xbf");     // FAIL

  n("(?iI)(?W:\\p{Word})", "\xc5\xbf");
  n("(?iI)(?W:[\\p{Word}])", "\xc5\xbf");     // FAIL

  /* Are these specifications? or bugs? All these results are "OK". */
  x2("(?i)(?W:[[:word:]])", "\xc5\xbf", 0, 2);
  n("(?i)(?W:\\p{Word})", "\xc5\xbf");
  n("(?i)(?W:\\w)", "\xc5\xbf");
  x2("(?i)(?W:[\\w])", "\xc5\xbf", 0, 2);

  /* These are Works fine. */
  n("(?I)(?W:[[:word:]])", "\xc5\xbf");
  n("(?W:[[:word:]])", "\xc5\xbf");
  x2("(?i)s", "\xc5\xbf", 0, 2);
  x2("(?i)[s]", "\xc5\xbf", 0, 2);

I think these are bugs, but I'm sorry if they are different.

kkos added a commit that referenced this issue Jun 23, 2022
@tonco-miyazawa
Copy link
Author

tonco-miyazawa commented Jun 23, 2022

Thank you for the fix. I will check the operation later.

@kkos
Copy link
Owner

kkos commented Jun 25, 2022

x2("(?i)(?W:[[:word:]])", "\xc5\xbf", 0, 2);
n("(?i)(?W:\\p{Word})", "\xc5\xbf");
n("(?i)(?W:\\w)", "\xc5\xbf");
x2("(?i)(?W:[\\w])", "\xc5\xbf", 0, 2);

I was not aware of any new problems with (?W) and (?i).
Since \w is implemented in different ways, it is difficult to make them behave exactly the same.
I decided to summarize by saying that the behavior differs depending on whether it is in a character class or not.
IGNORECASE has no effect on word character type if it is not in a character class. (But it has no effect when (?W) is not specified.)

@tonco-miyazawa

This comment was marked as off-topic.

@kkos
Copy link
Owner

kkos commented Jun 26, 2022

My English is not right and I don't think you got the message, so I will write it again.

I agree that it is odd that these four results do not match.
However, for word type, it is difficult to have exactly the same behavior because of the particular implementation method.
Therefore, only word type has a special specification for IGNORECASE. But if you are not using the (?W) option, you probably don't need to worry about it.

@tonco-miyazawa
Copy link
Author

tonco-miyazawa commented Jun 26, 2022

Sorry for the inconvenience.

it is difficult to have exactly the same behavior

This isn't desirable because it's a non-intuitive behavior, but I think
it can't be helped because of the complex limitations.
これは直感的でない動作になるのであまり好ましくはないですが、複雑な制約があるため仕方のないことだと思います。

Those who use the (?W) option will read /doc/RE carefully, so if there is a note,
I think that there is not much problem.
(?W) オプションを使用する人は /doc/RE.ja を熟読するので注意書きがあればあまり問題は無いと私は思います。


About behavior of \p{Word} are different from \w and [[:word:]]

perlre.pod
https://metacpan.org/dist/perl/view/pod/perlre.pod#/a-(and-/aa)

With /a, one can write \d with confidence that it will only match ASCII characters, and should the need arise to match beyond ASCII, you can instead use \p{Digit} (or \p{Word} for \w).

perlreref.pod
https://metacpan.org/dist/perl/view/pod/perlreref.pod#OPERATORS

a restrict \d, \s, \w and [:posix:] to match ASCII only
aa (two a's) also /i matches exclude ASCII/non-ASCII

In other words, it will be as follows.

\w and [[:word:]] : "/a" , "/aa" apply to these.

\p{Word} : Not affected by "/a" and "/aa".

I have one question. Unlike perl, oniguruma also applies the (?W) option to \p{Word}.
Is the difference from this perl the intended change?
perl とは違い、oniguruma は (?W) オプションが \p{Word} にも適用されます。
この oniguruma と perl との違いは意図した変更ですか?

/doc/RE - oniguruma

W: ASCII only word (\w, \p{Word}, [[:word:]])

W: ASCII only word (\w, \p{Word}, [[:word:]])


The following is a test to see if it matches "ſ" in Perl 5.35.11. ( "ſ" = \xc5\xbf )

/s/i        # match:  (ſ)
/s/ia       # match:  (ſ)
/s/iaa      # --search fail--

/[s]/i      # match:  (ſ)
/[s]/ia     # match:  (ſ)
/[s]/iaa    # --search fail--

/\w/i      # match:  (ſ)
/\w/ia     # --search fail--
/\w/iaa    # --search fail--

/[\w]/i      # match:  (ſ)
/[\w]/ia     # --search fail--
/[\w]/iaa    # --search fail--

/[[:word:]]/i       # match:  (ſ)
/[[:word:]]/ia      # --search fail--
/[[:word:]]/iaa     # --search fail--

/\p{Word}/i      # match:  (ſ)
/\p{Word}/ia     # match:  (ſ)
/\p{Word}/iaa    # match:  (ſ)

/[\p{Word}]/i      # match:  (ſ)
/[\p{Word}]/ia     # match:  (ſ)
/[\p{Word}]/iaa    # match:  (ſ)

@kkos
Copy link
Owner

kkos commented Jun 27, 2022

Documentation will be written when the specifications are finalized.
Thanks for the explanation of the /a, /aa options in Perl.
On that point, I am not intentional about the differences between Oniguruma and Perl.
When I thought about that option, I had little interest in Perl's /a, /aa options.
I might change it so that it only gets close when ONIG_SYNTAX_PERL.

@tonco-miyazawa
Copy link
Author

tonco-miyazawa commented Jun 28, 2022

I had little interest in Perl's /a, /aa options.

I see, that seems to be an original specification of oniguruma.
なるほど、ということは oniguruma の独自仕様と言えそうですね。

I might change it so that it only gets close when ONIG_SYNTAX_PERL.

It's an annoying problem when considering backward compatibility.
後方互換性を考えると悩ましい問題ですね。

Documentation will be written when the specifications are finalized.

I understand. Thank you for your answer.


* I omitted it because I proposed measures that have already been applied. (Click to view)

Using (?-i) has been a Temporary workaround, in Onigmo's past issue(92).
It would be nice if applying (?-i) internally would be the solution, but I'm not sure if this would be the solution.
The following is a test to see if it matches "ſ" in Perl 5.35.11. ( "ſ" = \xc5\xbf )

Onigmo の過去の issue(92) で (?-i) を使うことが一時的な回避策になったことがあります。
内部的に (?-i) を適用することが解決策になれば良いのですが、これが解決策になるかどうかは私には分かりません。
以下のテストは "ſ" ( "ſ" = \xc5\xbf ) がマッチするかどうかを Perl 5.35.11 で試したテストです。

(?-i) Results

/s/        #  ** search fail **
/s/a       #  ** search fail **
/s/aa      #  ** search fail **

/[s]/      #  ** search fail **
/[s]/a     #  ** search fail **
/[s]/aa    #  ** search fail **

/\w/      #  match: (ſ)
/\w/a     #  ** search fail **
/\w/aa    #  ** search fail **

/[\w]/      #  match: (ſ)
/[\w]/a     #  ** search fail **
/[\w]/aa    #  ** search fail **

/[[:word:]]/       #  match: (ſ)
/[[:word:]]/a      #  ** search fail **
/[[:word:]]/aa     #  ** search fail **

/\p{Word}/      #  match: (ſ)
/\p{Word}/a     #  match: (ſ)
/\p{Word}/aa    #  match: (ſ)

/[\p{Word}]/      #  match: (ſ)
/[\p{Word}]/a     #  match: (ſ)
/[\p{Word}]/aa    #  match: (ſ)

@tonco-miyazawa
Copy link
Author

tonco-miyazawa commented Jun 29, 2022

I'm sorry, I overlooked some of the replies from you.

IGNORECASE has no effect on word character type if it is not in a character class.

Until now, I didn't understand that this was synonymous with applying (?-i), sorry.
I'm always amazed at how fast your fix is, thank you for the fix.


[ Added: Aug 28, 2023 ] This thread was hard to read so I edited it to make it easier to read.

kkos added a commit that referenced this issue Jul 2, 2022
@kkos kkos closed this as completed Jul 2, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants