Questions #12

hediyi · 2016-07-27T03:55:50Z

Hi @kkos, I want to create a wiki page for this, just to make sure I get right. :)

Outside character classes, characters that must be escaped to be used literally are:

^$.?*+()[\|
In character classes, characters that must be escaped to be used literally are:

^-]\ (please correct me if I get it wrong)

^ and - can get away with this by "clever placement."

What really confuses me is [ and ] in character classes (though I think they both should be escaped for clarity):

In Ruby 2.3, [ must be escaped in classes and unescaped ] raises warnings but is allowed.
But in some other implementation (Atom) that use Oniguruma, [ can be used unescaped in classes but ] can't. 😂
In POSIX, ] can be used literally by placing it at the start of a class. In Oniguruma, it can't, right?

The text was updated successfully, but these errors were encountered:

kkos · 2016-07-28T09:16:56Z

Outside character classes, characters that must be escaped to be used literally are: ^$.?*+()[|
Yes.
In character classes, characters that must be escaped to be used literally are: ^-]
plus [
But in some other implementation (Atom) that use Oniguruma, [ can be used unescaped in classes but ] can't.
No.
I don't know what syntax Atom uses, but I suppose that is ONIG_SYNTAX_DEFAULT.
[ in char-class means start of nested char-class. You must escape [ as normal character.
You must escape ] in char-class unless it appears at the top of char-class.

static void warn_func(const char* s)
{
  fprintf(stderr, "WARN: %s\n", s);
}

extern int main(int argc, char* argv[])
{
  onig_set_warn_func(warn_func);

  exec(ONIG_ENCODING_UTF8, ONIG_ENCODING_UTF8, ONIG_OPTION_NONE,
       "[]a]", "]");

  return 0;
}

(* omit exec() code)

result:
WARN: character class has ']' without escape: /[]a]/
match at 0  (UTF-8)
0: (0-1)

hediyi · 2016-07-29T01:49:31Z

✨ Thanks very much for the clarification! ✨

https://github.com/kkos/oniguruma/wiki/Characters-That-Must-Be-Escaped-to-Be-Used-Literally Did I miss anything?

And could you keep this open just in case I have further questions? :))

kkos · 2016-08-03T09:14:26Z

Thank you.
The content is correct.

hediyi · 2016-08-13T09:33:34Z

Isn't \w equivalent to [0-9A-Z_a-z]? I thought it was, but I found it can match much much more than that.

kkos · 2016-08-13T14:37:37Z

\w is equivalent to [0-9A-Z_a-z] if you are using ASCII encoding.
\w matches many code points in Unicode(UTF-8, UTF-16, UTF-32) encoding.
Unicode word code points data is defined at CR_Word[] in src/unicode_property_data.c

hediyi · 2016-08-14T08:55:35Z

So in UTF-8 encoding, \w can match the characters that are mapped to the code points defined in CR_Word[], right?

kkos · 2016-08-14T14:46:12Z

Yes. And 654 is the number of code ranges.

hediyi · 2016-08-15T06:59:41Z

Thanks for your replies. They really helped me a lot!

So similarly, in UTF-8 encoding, \d can match stuff specified in CR_Digit[], right? And can I assume [0-9_A-Za-z] is "faster" than \w in UTF-8?

Do you mind if I help edit the doc of RE? It would just be some small refinements to make it easier to understand.

kkos · 2016-08-15T23:37:22Z

I do not know the difference of the speed�.
Because single byte code range and multi byte code range are separated in compiled regexp code, it may be such not different.

I have add you as collaborator.
Please edit files in doc.

kkos · 2016-08-15T23:46:17Z

And please edit develop branch not master branch.

hediyi · 2016-08-16T00:35:28Z

👌 😉

hediyi · 2016-08-16T08:33:09Z

About \G, does it mean "where the current match attempt begins", i.e., either \A or "where the last match left off"?

kkos · 2016-08-16T14:44:23Z

onig_search() has search-range argument (start, range) and string argument (str, end).
Search-range is matching start position range.

\G mean start position of search-range.
\A mean start position of the string.
In most cases, search-range and string are same value, then \G == \A.

hediyi · 2016-08-17T13:42:21Z

Tried digging around the definition of onig_search, I could understand only a part of it.

I was trying to expand on/clarify some definitions in doc/RE, one of them is of \G:

\G matching start position

\G is useful in a regexp applied to the same string more than once. To define it from the user's standpoint. The first time the regexp applied, it matches the beginning of string; later it matches where the last match ends, right? So I wanna change the definition to something like

beginning of the current search attempt

Does it look correct to you?

kkos · 2016-08-17T19:12:33Z

Yes. You are right.

hediyi · 2016-08-19T08:41:36Z

I was wondering, ~~apart from that [:...:] is only available in character class, what really is different between Unicode properties and the POSIX notation, why do we need them both~~? I mean, Unicode properties are so much more powerful, why do we still need the POSIX notation?

kkos · 2016-08-19T14:56:07Z

Thank you for improved document.

POSIX bracket is poor than Unicode property.
It is not necessary anymore now.
But it was included in GNU regex library.
My first goal was to make GNU regex compatible library, and thereafter I was introduced character property function etc.. from Perl.

hediyi · 2016-08-20T01:22:17Z

👌 No problem, and I can get to know more about how Oniguruma processes regexps along the way. 😌 So are you planning to deprecate POSIX brackets?

New question: in doc/RE,

In the back reference by the multiplex definition name,

What is a multiplex definition name?

Mind if I email you with the questions instead?

kkos · 2016-08-21T03:08:15Z

I will remove POSIX bracket from version 7.0 if it is removed in Perl6.
But I don't know it is removed or not.

You can assign one name to the more than two groups.

hediyi changed the title ~~Questions about escaping~~ Questions Aug 3, 2016

kkos added the question label Aug 23, 2016

hediyi closed this as completed Apr 19, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Questions #12

Questions #12

hediyi commented Jul 27, 2016

kkos commented Jul 28, 2016

hediyi commented Jul 29, 2016 •

edited

Loading

kkos commented Aug 3, 2016

hediyi commented Aug 13, 2016

kkos commented Aug 13, 2016

hediyi commented Aug 14, 2016 •

edited

Loading

kkos commented Aug 14, 2016

hediyi commented Aug 15, 2016

kkos commented Aug 15, 2016

kkos commented Aug 15, 2016

hediyi commented Aug 16, 2016

hediyi commented Aug 16, 2016

kkos commented Aug 16, 2016

hediyi commented Aug 17, 2016

kkos commented Aug 17, 2016

hediyi commented Aug 19, 2016 •

edited

Loading

kkos commented Aug 19, 2016

hediyi commented Aug 20, 2016 •

edited

Loading

kkos commented Aug 21, 2016

Questions #12

Questions #12

Comments

hediyi commented Jul 27, 2016

kkos commented Jul 28, 2016

hediyi commented Jul 29, 2016 • edited Loading

kkos commented Aug 3, 2016

hediyi commented Aug 13, 2016

kkos commented Aug 13, 2016

hediyi commented Aug 14, 2016 • edited Loading

kkos commented Aug 14, 2016

hediyi commented Aug 15, 2016

kkos commented Aug 15, 2016

kkos commented Aug 15, 2016

hediyi commented Aug 16, 2016

hediyi commented Aug 16, 2016

kkos commented Aug 16, 2016

hediyi commented Aug 17, 2016

kkos commented Aug 17, 2016

hediyi commented Aug 19, 2016 • edited Loading

kkos commented Aug 19, 2016

hediyi commented Aug 20, 2016 • edited Loading

kkos commented Aug 21, 2016

hediyi commented Jul 29, 2016 •

edited

Loading

hediyi commented Aug 14, 2016 •

edited

Loading

hediyi commented Aug 19, 2016 •

edited

Loading

hediyi commented Aug 20, 2016 •

edited

Loading