Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

With ONIG_SYNTAX_DEFAULT pattern '(?s)ABC' fails with 'undefined group option'. #297

Closed
spershin opened this issue Jun 3, 2024 · 12 comments

Comments

@spershin
Copy link

spershin commented Jun 3, 2024

We are trying to use Oni as Regex library in Velox for Presto Native Execution.
Another issue we've encountered is:

With ONIG_SYNTAX_DEFAULT pattern (?s)ABC fails with undefined group option.

I confirmed that group option 's' works with ONIG_SYNTAX_JAVA, however the latter does not support named capture groups like ?<digit>, which is also critical for us.

I was wondering if it is possible to add support of 's' to ONIG_SYNTAX_DEFAULT.

Thank you!

@spershin spershin changed the title With ONIG_SYNTAX_DEFAULT pattern '(?s)ABC' fails with 'undefined group option Top-level Expression'. With ONIG_SYNTAX_DEFAULT pattern '(?s)ABC' fails with 'undefined group option'. Jun 3, 2024
@tonco-miyazawa
Copy link

Please use (?m) .

  • ONIG_SYNTAX_DEFAULT == ONIG_SYNTAX_ONIGURUMA

ONIG_SYNTAX_DEFAULT default (== ONIG_SYNTAX_ONIGURUMA)

  • (?s) and (?m)

oniguruma/doc/RE

Lines 520 to 528 in 1b37592

-----------------------------
A-1. Syntax-dependent options
+ ONIG_SYNTAX_ONIGURUMA
(?m): dot (.) also matches newline
+ ONIG_SYNTAX_PERL and ONIG_SYNTAX_JAVA
(?s): dot (.) also matches newline
(?m): ^ matches after newline, $ matches before newline

  • ONIG_SYNTAX_ONIGURUMA : " ^ matches after newline, $ matches before newline " is valid.

oniguruma/doc/RE

Lines 172 to 183 in 1b37592

5. Anchors
^ beginning of the line
$ end of the line
\b word boundary
\B non-word boundary
\A beginning of string
\Z end of string, or before newline at the end
\z end of string
\G where the current search attempt begins
\K keep (keep start position of the result string)

@spershin
Copy link
Author

spershin commented Jun 4, 2024

@tonco-miyazawa

Thank you for replying.

I would gladly follow your advice if it was possible.
Our situation is the following:

We are migrating from Presto Java to Presto Native (C++) with the latter using Velox library.
Presto Java uses JONI (Java port of Oni) for regex implementation.
Velox uses RE2, which is missing a lot of crucial functionality such as backtracking.

So, we are looking/investigating at adopting Oni instead.
Why?
Because we have hundreds of customers with established workloads and there is no way we can change their queries - too many queries, too many customers, too much negative impact if we break them.
So, ideally we would like to have functionality and features in Oni to be as close to JONI as possible.
This gives us a lot of headache, honestly. :)

Would it be possible to also support ?s along with ?m?
I don't know the code and never did anything around regex much up until now, so cannot really tell how hard it can be.

We could, potentially, scan and rewrite the pattern before using it with Oni, however in such cases how can we sure that we can safely rewrite any ?s sequence into ?m? Also, what it is a combined sequence, like ?isq or something like that? Such rewrite becomes complicated and error-prone without deep knowledge of the patterns and syntax.

Thank you!

@tonco-miyazawa
Copy link

Sorry, I'm an end user so I don't know the details.

I think you can add an option 'ONIG_SYN_OP2_QMARK_LT_NAMED_GROUP' to ONIG_SYNTAX_JAVA.

oniguruma/src/regsyntax.c

Lines 146 to 169 in fd33056

OnigSyntaxType OnigSyntaxJava = {
(( SYN_GNU_REGEX_OP | ONIG_SYN_OP_QMARK_NON_GREEDY |
ONIG_SYN_OP_ESC_CONTROL_CHARS | ONIG_SYN_OP_ESC_C_CONTROL |
ONIG_SYN_OP_ESC_OCTAL3 | ONIG_SYN_OP_ESC_X_HEX2 )
& ~ONIG_SYN_OP_ESC_LTGT_WORD_BEGIN_END )
, ( ONIG_SYN_OP2_ESC_CAPITAL_Q_QUOTE | ONIG_SYN_OP2_QMARK_GROUP_EFFECT |
ONIG_SYN_OP2_OPTION_PERL | ONIG_SYN_OP2_PLUS_POSSESSIVE_REPEAT |
ONIG_SYN_OP2_PLUS_POSSESSIVE_INTERVAL | ONIG_SYN_OP2_CCLASS_SET_OP |
ONIG_SYN_OP2_ESC_V_VTAB | ONIG_SYN_OP2_ESC_U_HEX4 |
ONIG_SYN_OP2_ESC_P_BRACE_CHAR_PROPERTY )
, ( SYN_GNU_REGEX_BV | ONIG_SYN_ISOLATED_OPTION_CONTINUE_BRANCH |
ONIG_SYN_DIFFERENT_LEN_ALT_LOOK_BEHIND |
ONIG_SYN_VARIABLE_LEN_LOOK_BEHIND )
, ONIG_OPTION_SINGLELINE
,
{
(OnigCodePoint )'\\' /* esc */
, (OnigCodePoint )ONIG_INEFFECTIVE_META_CHAR /* anychar '.' */
, (OnigCodePoint )ONIG_INEFFECTIVE_META_CHAR /* anytime '*' */
, (OnigCodePoint )ONIG_INEFFECTIVE_META_CHAR /* zero or one time '?' */
, (OnigCodePoint )ONIG_INEFFECTIVE_META_CHAR /* one or more time '+' */
, (OnigCodePoint )ONIG_INEFFECTIVE_META_CHAR /* anychar anytime */
}
};

I think #296 is very helpful.
Please see fd33056. This is the fix for #296.

This adds ONIG_SYN_OP2_QMARK_GROUP_EFFECT to OnigSyntaxEmacs.

These options are documented in SYNTAX.md .
https://github.com/kkos/oniguruma/blob/master/doc/SYNTAX.md

Or do you want to use ONIG_SYNTAX_DEFAULT?
In this case, I think a fix to the regular expression parser is needed.

@tonco-miyazawa
Copy link

@spershin

Is the syntax you used "Syntax Java" in JONI ?
Please make sure that the syntax is not something other than "Syntax Java".

JONI : options of "Syntax Java"
https://github.com/jruby/joni/blob/3750dcbf58cb5cfad76c1a3ae11cc2be65ba7fe1/src/org/joni/Syntax.java#L624-L652

I did a test to add "ONIG_SYN_OP2_QMARK_LT_NAMED_GROUP" to oniguruma's "ONIG_SYNTAX_JAVA".
The test was successful.

I think it would be good to make the syntax options of JONI and oniguruma similar.

Would it be possible to also support ?s along with ?m?

This is a question that should be answered by the Owner so I cannot answer it, sorry.

scan and rewrite the pattern before using it with Oni

This is very difficult and not recommended.


Please try "ONIG_SYNTAX_PERL_NG" of oniguruma.
(?s) and "naming capture groups" are available.

@spershin
Copy link
Author

spershin commented Jun 4, 2024

@tonco-miyazawa

This seems useful. Thank you.
I haven't read it though thoroughly yet as I'm continuing to investigate other possible issues of using Oni in Velox before spending considerable effort at fixing any of them.

To confirm (or disprove) that I understand it right:

Does it mean that we can build our own OnigSyntaxType structure that can encompass multiple features that we need?
For example, in this case we could take ONIG_SYNTAX_DEFAULT as the base and then add a bit so ?s is activated?

Or, as you suggested, if ONIG_SYNTAX_JAVA is good for us as well and is only missing ?<digit> support, then we can add ONIG_SYN_OP2_QMARK_LT_NAMED_GROUP to it or to our custom syntax based on the ONIG_SYNTAX_JAVA.

Thank you!

@spershin
Copy link
Author

spershin commented Jun 4, 2024

@tonco-miyazawa

We use Java syntax, it seems:

regex = new Regex(pattern.getBytes(), 0, pattern.length(), Option.DEFAULT, NonStrictUTF8Encoding.INSTANCE, Syntax.Java);

However, we are most likely be using a forked repo and not the master JONI.
Trying to confirm this now.
Update: Presto Java uses https://github.com/airlift/joni fork of JONI.

The path to the JONI jar is like that:

.../.m2/repository/io/airlift/joni/2.1.5.3/joni-2.1.5.3.jar!/io/airlift/joni/Regex.class

JONI repo does not have such version in the list of tags.

I agree the syntax should be the same, if possible, however, already there are some serious differences, like the named capture group.

I'll try ONIG_SYNTAX_PERL_NG syntax too, thank you!

@tonco-miyazawa
Copy link

I ported the options from JONI-2.1.5.3 's "Syntax.Java" to oniguruma's "OnigSyntaxJava" .

/*  JONI-2.1.5.3 's options for oniguruma-6.9.9 */
OnigSyntaxType OnigSyntaxJava = {
  (( SYN_GNU_REGEX_OP | ONIG_SYN_OP_QMARK_NON_GREEDY |
     ONIG_SYN_OP_ESC_CONTROL_CHARS | ONIG_SYN_OP_ESC_C_CONTROL |
     ONIG_SYN_OP_ESC_OCTAL3 | ONIG_SYN_OP_ESC_X_HEX2 |
     ONIG_SYN_OP_ESC_X_BRACE_HEX8 )
   & ~ONIG_SYN_OP_ESC_LTGT_WORD_BEGIN_END )
  , ( ONIG_SYN_OP2_ESC_CAPITAL_Q_QUOTE | ONIG_SYN_OP2_QMARK_GROUP_EFFECT |
      ONIG_SYN_OP2_OPTION_PERL | ONIG_SYN_OP2_PLUS_POSSESSIVE_REPEAT |
      ONIG_SYN_OP2_PLUS_POSSESSIVE_INTERVAL | ONIG_SYN_OP2_CCLASS_SET_OP |
      ONIG_SYN_OP2_QMARK_LT_NAMED_GROUP | ONIG_SYN_OP2_ESC_K_NAMED_BACKREF |
      ONIG_SYN_OP2_ESC_V_VTAB | ONIG_SYN_OP2_ESC_U_HEX4 |
      ONIG_SYN_OP2_ESC_P_BRACE_CHAR_PROPERTY )
  , ( SYN_GNU_REGEX_BV | ONIG_SYN_ISOLATED_OPTION_CONTINUE_BRANCH |
      ONIG_SYN_DIFFERENT_LEN_ALT_LOOK_BEHIND )
  , ONIG_OPTION_SINGLELINE
  ,
  {
      (OnigCodePoint )'\\'                       /* esc */
    , (OnigCodePoint )ONIG_INEFFECTIVE_META_CHAR /* anychar '.'  */
    , (OnigCodePoint )ONIG_INEFFECTIVE_META_CHAR /* anytime '*'  */
    , (OnigCodePoint )ONIG_INEFFECTIVE_META_CHAR /* zero or one time '?' */
    , (OnigCodePoint )ONIG_INEFFECTIVE_META_CHAR /* one or more time '+' */
    , (OnigCodePoint )ONIG_INEFFECTIVE_META_CHAR /* anychar anytime */
  }
};

[added option]
ONIG_SYN_OP_ESC_X_BRACE_HEX8                /* \x{7HHHHHHH} */
ONIG_SYN_OP2_QMARK_LT_NAMED_GROUP     /* (?<name>...) */
ONIG_SYN_OP2_ESC_K_NAMED_BACKREF        /* \k<name> */

[removed opiton]
ONIG_SYN_VARIABLE_LEN_LOOK_BEHIND        /* (?<=a+|..) */

[ This does not exist in oniguruma ]
OP2_ESC_P_CHAR_CHAR_PROPERTY              /* \pX, \PX */

[This does not exist in JONI-2.1.5.3]
ONIG_SYN_ISOLATED_OPTION_CONTINUE_BRANCH      /* ..(?i)...|... */

SYNTAX.md
https://github.com/kkos/oniguruma/blob/master/doc/SYNTAX.md


I tested using this syntax and it was successful.
So there is no need to try "ONIG_SYNTAX_PERL_NG" syntax.


Does it mean that we can build our own OnigSyntaxType structure that can encompass multiple features that we need?

For example, in this case we could take ONIG_SYNTAX_DEFAULT as the base and then add a bit so ?s is activated?

Or, as you suggested, if ONIG_SYNTAX_JAVA is good for us as well and is only missing ?<digit> support, then we can add ONIG_SYN_OP2_QMARK_LT_NAMED_GROUP to it or to our custom syntax based on the ONIG_SYNTAX_JAVA.

Sorry, I don't have the knowledge to answer these questions.
I will let you know if I find out anything.

@tonco-miyazawa
Copy link

@spershin

Sorry for keeping you waiting.

Does it mean that we can build our own OnigSyntaxType structure that can encompass multiple features that we need?

For example, in this case we could take ONIG_SYNTAX_DEFAULT as the base and then add a bit so ?s is activated?

Or, as you suggested, if ONIG_SYNTAX_JAVA is good for us as well and is only missing ?<digit> support, then we can add ONIG_SYN_OP2_QMARK_LT_NAMED_GROUP to it or to our custom syntax based on the ONIG_SYNTAX_JAVA.

The answer to all of these questions is "yes".
I have actually done this and confirmed that it works.

If you want to write code based on "ONIG_SYNTAX_JAVA", I recommend using the JONI-2.1.5.3 syntax above as a starting point. And please use it as your own syntax.


If you want to use (?s) in "ONIG_SYNTAX_DEFAULT" , please replace the following option:

ONIG_SYN_OP2_OPTION_ONIGURUMA |

Please exchange "ONIG_SYN_OP2_OPTION_ONIGURUMA" for "ONIG_SYN_OP2_OPTION_PERL" .
Below are descriptions of these options:

SYNTAX.md : ONIG_SYN_OP2_OPTION_ONIGURUMA
https://github.com/kkos/oniguruma/blob/master/doc/SYNTAX.md#30-onig_syn_op2_option_oniguruma-enable-options-imxwsdpy-and--imxwdsp

SYNTAX.md : ONIG_SYN_OP2_OPTION_PERL
https://github.com/kkos/oniguruma/blob/master/doc/SYNTAX.md#2-onig_syn_op2_option_perl-enable-options-imsx-and--imsx

However, this swap makes (?W) , (?D) , (?S) , (?P) invalid.


If you want to change only (?s) and (?m) , please change the code below.

oniguruma/src/regparse.c

Lines 8281 to 8302 in f2f9c69

case '-': neg = TRUE; break;
case 'x': OPTION_NEGATE(option, ONIG_OPTION_EXTEND, neg); break;
case 'i': OPTION_NEGATE(option, ONIG_OPTION_IGNORECASE, neg); break;
case 's':
if (IS_SYNTAX_OP2(env->syntax, ONIG_SYN_OP2_OPTION_PERL)) {
OPTION_NEGATE(option, ONIG_OPTION_MULTILINE, neg);
}
else
return ONIGERR_UNDEFINED_GROUP_OPTION;
break;
case 'm':
if (IS_SYNTAX_OP2(env->syntax, ONIG_SYN_OP2_OPTION_PERL)) {
OPTION_NEGATE(option, ONIG_OPTION_SINGLELINE, (neg == FALSE ? TRUE : FALSE));
}
else if (IS_SYNTAX_OP2(env->syntax,
ONIG_SYN_OP2_OPTION_ONIGURUMA|ONIG_SYN_OP2_OPTION_RUBY)) {
OPTION_NEGATE(option, ONIG_OPTION_MULTILINE, neg);
}
else
return ONIGERR_UNDEFINED_GROUP_OPTION;
break;

'm' : (?m) , (?-m)
's' : (?s) , (?-s)
'-' negates the effect.

Below is an explanation of "ONIG_OPTION_SINGLELINE" and "ONIG_OPTION_MULTILINE" .

oniguruma/doc/API

Lines 79 to 80 in f2f9c69

ONIG_OPTION_SINGLELINE '^' -> '\A', '$' -> '\Z'
ONIG_OPTION_MULTILINE '.' match with newline

But such a change may be confusing to users. If I were a user, I would like to be able to choose between the JONI and oniguruma syntax. And oniguruma can do that.


If you want to replace the default syntax, please change the following:

OnigSyntaxType* OnigDefaultSyntax = ONIG_SYNTAX_ONIGURUMA;

syntax = ONIG_SYNTAX_ONIGURUMA;


If you want to create your own syntax, please refer to the following.
"OnigSyntaxJava" is registered like this, you can register your own syntax in the same way.

oniguruma.h

ONIG_EXTERN OnigSyntaxType OnigSyntaxJava;

#define ONIG_SYNTAX_JAVA (&OnigSyntaxJava)

onigposix.h

ONIG_EXTERN OnigSyntaxType OnigSyntaxJava;

#define ONIG_SYNTAX_JAVA (&OnigSyntaxJava)

regsyntax.c

OnigSyntaxType OnigSyntaxJava = {

@spershin
Copy link
Author

spershin commented Jun 8, 2024

@tonco-miyazawa
Thank you very much for so much detailed information!
This is VERY helpful.
I am going to try out the options you suggested (especially the one you took from JONI) and see if we are lucky enough to have most of the options working like they do in JONI.

I'm sorry that I wasn't able to reply here earlier - was busy implementing the remaining regex functions based on Oni and testing them, collecting issues I can find.

@tonco-miyazawa
Copy link

tonco-miyazawa commented Jun 8, 2024

@spershin

I'm glad I could be of help to you. Don't worry about the reply.

I think the next problem will be that \pX and \PX cannot be used in oniguruma.
This is too difficult for me.

I will close this thread as it has served its purpose. Thank you.


EDIT: I can't seem to close this thread.

@kkos kkos closed this as completed Jun 9, 2024
@spershin
Copy link
Author

@tonco-miyazawa
Using your pointers on how the syntax structures work I managed to make (?s) work in ONIG_SYNTAX_DEFAULT.
Now, I'm yet to do more testing to see if there are possible side effects of this, but unit tests pass.

The code:

static OnigSyntaxType myOnigSyntax = *ONIG_SYNTAX_DEFAULT;
myOnigSyntax.op2 |= ONIG_SYN_OP2_OPTION_PERL;
gOnigSyntax = &myOnigSyntax;

where gOnigSyntax is
void* gOnigSyntax = ONIG_SYNTAX_DEFAULT;
and is used in onig_new().

Somewhat relevant:
I'm also adding

  myOnigSyntax.options |= ONIG_OPTION_SINGLELINE;

To treat string as a single line, which ONIG_SYNTAX_DEFAULT does not do and it does not match JONI in this behavior.

@tonco-miyazawa
Copy link

@spershin

unit tests pass.

Congratulations! Good job!

myOnigSyntax.options |= ONIG_OPTION_SINGLELINE;
To treat string as a single line, which ONIG_SYNTAX_DEFAULT does not do
and it does not match JONI in this behavior.

I have tested this and found that this option works correctly.
This is a very confusing subject and it's easy to get confused.
Because we are confused by settings like:

"ONIG_OPTION_SINGLELINE" and "ONIG_OPTION_MULTILINE",

"ONIG_SYN_OP2_OPTION_ONIGURUMA" and "ONIG_SYN_OP2_OPTION_PERL"

"(?s)" and "(?m)"

So I created a test to check if the option is valid or Invalid.
Below is the test for "/test/test_syntax.c".
If the option is Invalid, these tests will return "FAIL" .

  ////////////  Test to see if an option is valid /////////////
  Syntax = ONIG_SYNTAX_DEFAULT;

  x2("$", "123\n456", 7, 7);           // 7-7 == ONIG_OPTION_SINGLELINE is valid.
  
  x2("\\A.*\\z", "123\n456", 0, 7);    // 0-7 == ONIG_OPTION_MULTILINE is valid.
  ///////////////////////////////////////////////////////////////

Below is the command for testing above.
Please copy "/test/test_syntax.c" to "Documents" folder. ( This is the same as issue290. )

$ gcc ~/Documents/test_syntax.c  -L/usr/local/lib -lonig
$ ./a.out

Also, please check your onig_new() setting.

oniguruma/doc/API

Lines 76 to 104 in bcfbab2

4 option: compile time options.
ONIG_OPTION_NONE no option
ONIG_OPTION_SINGLELINE '^' -> '\A', '$' -> '\Z'
ONIG_OPTION_MULTILINE '.' match with newline
ONIG_OPTION_IGNORECASE ambiguity match on
ONIG_OPTION_EXTEND extended pattern form
ONIG_OPTION_FIND_LONGEST find longest match
ONIG_OPTION_FIND_NOT_EMPTY ignore empty match
ONIG_OPTION_NEGATE_SINGLELINE clear ONIG_OPTION_SINGLELINE which is enabled on ONIG_SYNTAX_POSIX_BASIC/POSIX_EXTENDED/PERL/PERL_NG/PYTHON/JAVA
ONIG_OPTION_DONT_CAPTURE_GROUP only named group captured.
ONIG_OPTION_CAPTURE_GROUP named and no-named group captured.
ONIG_OPTION_IGNORECASE_IS_ASCII Limit IGNORECASE((?i)) to a range of ASCII characters
ONIG_OPTION_WORD_IS_ASCII ASCII only word (\w, \p{Word}, [[:word:]])
ASCII only word bound (\b)
ONIG_OPTION_DIGIT_IS_ASCII ASCII only digit (\d, \p{Digit}, [[:digit:]])
ONIG_OPTION_SPACE_IS_ASCII ASCII only space (\s, \p{Space}, [[:space:]])
ONIG_OPTION_POSIX_IS_ASCII ASCII only POSIX properties
(includes word, digit, space)
(alnum, alpha, blank, cntrl, digit, graph,
lower, print, punct, space, upper, xdigit,
word)
ONIG_OPTION_TEXT_SEGMENT_EXTENDED_GRAPHEME_CLUSTER Extended Grapheme Cluster mode
ONIG_OPTION_TEXT_SEGMENT_WORD Word mode
* The ONIG_OPTION_FIND_LONGEST option doesn't work properly during backward search of onig_search().

Also, see below for the values ​​of "ONIG_OPTION_DEFAULT" .

#define ONIG_OPTION_DEFAULT ONIG_OPTION_NONE

You can also check "ONIG_OPTION_SINGLELINE" by using "ONIG_OPTION_NEGATE_SINGLELINE" in onig_new() .
Try using this option to see if it changes your results.

ONIG_OPTION_NEGATE_SINGLELINE clear ONIG_OPTION_SINGLELINE which is enabled on ONIG_SYNTAX_POSIX_BASIC/POSIX_EXTENDED/PERL/PERL_NG/PYTHON/JAVA

I hope your work goes well. Thank you.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants