Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Questions #12

Closed
hediyi opened this issue Jul 27, 2016 · 19 comments
Closed

Questions #12

hediyi opened this issue Jul 27, 2016 · 19 comments
Labels

Comments

@hediyi
Copy link
Collaborator

hediyi commented Jul 27, 2016

Hi @kkos, I want to create a wiki page for this, just to make sure I get right. :)

  • Outside character classes, characters that must be escaped to be used literally are:

    ^$.?*+()[\|

  • In character classes, characters that must be escaped to be used literally are:

    ^-]\ (please correct me if I get it wrong)

    ^ and - can get away with this by "clever placement."

What really confuses me is [ and ] in character classes (though I think they both should be escaped for clarity):

  • In Ruby 2.3, [ must be escaped in classes and unescaped ] raises warnings but is allowed.
  • But in some other implementation (Atom) that use Oniguruma, [ can be used unescaped in classes but ] can't. 😂
  • In POSIX, ] can be used literally by placing it at the start of a class. In Oniguruma, it can't, right?
@kkos
Copy link
Owner

kkos commented Jul 28, 2016

  1. Outside character classes, characters that must be escaped to be used literally are: ^$.?*+()[|
    Yes.
  2. In character classes, characters that must be escaped to be used literally are: ^-]
    plus [
  3. But in some other implementation (Atom) that use Oniguruma, [ can be used unescaped in classes but ] can't.
    No.
    I don't know what syntax Atom uses, but I suppose that is ONIG_SYNTAX_DEFAULT.
    [ in char-class means start of nested char-class. You must escape [ as normal character.
    You must escape ] in char-class unless it appears at the top of char-class.
static void warn_func(const char* s)
{
  fprintf(stderr, "WARN: %s\n", s);
}

extern int main(int argc, char* argv[])
{
  onig_set_warn_func(warn_func);

  exec(ONIG_ENCODING_UTF8, ONIG_ENCODING_UTF8, ONIG_OPTION_NONE,
       "[]a]", "]");

  return 0;
}

(* omit exec() code)

result:
WARN: character class has ']' without escape: /[]a]/
match at 0  (UTF-8)
0: (0-1)

@hediyi
Copy link
Collaborator Author

hediyi commented Jul 29, 2016

✨ Thanks very much for the clarification! ✨

https://github.com/kkos/oniguruma/wiki/Characters-That-Must-Be-Escaped-to-Be-Used-Literally Did I miss anything?

And could you keep this open just in case I have further questions? :))

@kkos
Copy link
Owner

kkos commented Aug 3, 2016

Thank you.
The content is correct.

@hediyi hediyi changed the title Questions about escaping Questions Aug 3, 2016
@hediyi
Copy link
Collaborator Author

hediyi commented Aug 13, 2016

Isn't \w equivalent to [0-9A-Z_a-z]? I thought it was, but I found it can match much much more than that.

@kkos
Copy link
Owner

kkos commented Aug 13, 2016

\w is equivalent to [0-9A-Z_a-z] if you are using ASCII encoding.
\w matches many code points in Unicode(UTF-8, UTF-16, UTF-32) encoding.
Unicode word code points data is defined at CR_Word[] in src/unicode_property_data.c

@hediyi
Copy link
Collaborator Author

hediyi commented Aug 14, 2016

So in UTF-8 encoding, \w can match the characters that are mapped to the code points defined in CR_Word[], right?

@kkos
Copy link
Owner

kkos commented Aug 14, 2016

Yes. And 654 is the number of code ranges.

@hediyi
Copy link
Collaborator Author

hediyi commented Aug 15, 2016

Thanks for your replies. They really helped me a lot!

So similarly, in UTF-8 encoding, \d can match stuff specified in CR_Digit[], right? And can I assume [0-9_A-Za-z] is "faster" than \w in UTF-8?

Do you mind if I help edit the doc of RE? It would just be some small refinements to make it easier to understand.

@kkos
Copy link
Owner

kkos commented Aug 15, 2016

I do not know the difference of the speed�.
Because single byte code range and multi byte code range are separated in compiled regexp code, it may be such not different.

I have add you as collaborator.
Please edit files in doc.

@kkos
Copy link
Owner

kkos commented Aug 15, 2016

And please edit develop branch not master branch.

@hediyi
Copy link
Collaborator Author

hediyi commented Aug 16, 2016

👌 😉

@hediyi
Copy link
Collaborator Author

hediyi commented Aug 16, 2016

About \G, does it mean "where the current match attempt begins", i.e., either \A or "where the last match left off"?

@kkos
Copy link
Owner

kkos commented Aug 16, 2016

onig_search() has search-range argument (start, range) and string argument (str, end).
Search-range is matching start position range.

\G mean start position of search-range.
\A mean start position of the string.
In most cases, search-range and string are same value, then \G == \A.

@hediyi
Copy link
Collaborator Author

hediyi commented Aug 17, 2016

Tried digging around the definition of onig_search, I could understand only a part of it.

I was trying to expand on/clarify some definitions in doc/RE, one of them is of \G:

\G matching start position

\G is useful in a regexp applied to the same string more than once. To define it from the user's standpoint. The first time the regexp applied, it matches the beginning of string; later it matches where the last match ends, right? So I wanna change the definition to something like

beginning of the current search attempt

Does it look correct to you?

@kkos
Copy link
Owner

kkos commented Aug 17, 2016

Yes. You are right.

@hediyi
Copy link
Collaborator Author

hediyi commented Aug 19, 2016

I was wondering, apart from that [:...:] is only available in character class, what really is different between Unicode properties and the POSIX notation, why do we need them both? I mean, Unicode properties are so much more powerful, why do we still need the POSIX notation?

@kkos
Copy link
Owner

kkos commented Aug 19, 2016

Thank you for improved document.

POSIX bracket is poor than Unicode property.
It is not necessary anymore now.
But it was included in GNU regex library.
My first goal was to make GNU regex compatible library, and thereafter I was introduced character property function etc.. from Perl.

@hediyi
Copy link
Collaborator Author

hediyi commented Aug 20, 2016

👌 No problem, and I can get to know more about how Oniguruma processes regexps along the way. 😌 So are you planning to deprecate POSIX brackets?

New question: in doc/RE,

In the back reference by the multiplex definition name,

What is a multiplex definition name?

Mind if I email you with the questions instead?

@kkos
Copy link
Owner

kkos commented Aug 21, 2016

I will remove POSIX bracket from version 7.0 if it is removed in Perl6.
But I don't know it is removed or not.

You can assign one name to the more than two groups.

@kkos kkos added the question label Aug 23, 2016
@hediyi hediyi closed this as completed Apr 19, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants