Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unexpected regex match #192

Closed
alexdima opened this issue Apr 21, 2020 · 1 comment
Closed

Unexpected regex match #192

alexdima opened this issue Apr 21, 2020 · 1 comment

Comments

@alexdima
Copy link

In the sample/bug_fix.c file, insert the following:

  exec(ONIG_ENCODING_UTF8, ONIG_OPTION_FIND_LONGEST,
       "(?x)\n  (?<!\\+\\+|--)(?<=[({\\[,?=>:*]|&&|\\|\\||\\?|\\*\\/|^await|[^\\._$[:alnum:]]await|^return|[^\\._$[:alnum:]]return|^default|[^\\._$[:alnum:]]default|^yield|[^\\._$[:alnum:]]yield|^)\\s*\n  (?!<\\s*[_$[:alpha:]][_$[:alnum:]]*((\\s+extends\\s+[^=>])|,)) # look ahead is not type parameter of arrow\n  (?=(<)\\s*(?:([_$[:alpha:]][-_$[:alnum:].]*)(?<!\\.|-)(:))?((?:[a-z][a-z0-9]*|([_$[:alpha:]][-_$[:alnum:].]*))(?<!\\.|-))(?=((<\\s*)|(\\s+))(?!\\?)|\\/?>))", "    while (i < len && f(array[i]))");

I would expect to see in the output:

search fail (UTF-8)

But instead there is printed:

match at 12  (UTF-8)
0: (12-13)
1: (-1--1)
2: (-1--1)
3: (13-14)
4: (-1--1)
5: (-1--1)
6: (15-18)
7: (-1--1)
8: (18-19)
9: (-1--1)
10: (18-19)

I have done git bisect and it led towards 32e688b:

32e688b0c71bd3912867c5107a5efdbcdde3ff2c is the first bad commit
commit 32e688b0c71bd3912867c5107a5efdbcdde3ff2c
Author: K.Kosako <kosako@sofnec.co.jp>
Date:   Tue Jan 21 08:37:19 2020 +0900

    optimize for min len == 0 and min len is sure case

I am sorry for the very complex regular expression and for not spending more time to simplify it, but I believe it has something to do with the look behind starting with (?<=[({\\[ ... . It is almost as if it is not checked. I have hit this regular expression in VS Code's JavaScript TM grammar, while validating/updating VS Code's oniguruma version with a fresh WASM implementation here.

Thank you for your great work, and special thanks for the new RegSet API, which might save a lot of time in parsing TM grammars.

alexdima added a commit to microsoft/vscode-oniguruma that referenced this issue Apr 22, 2020
@kkos kkos closed this as completed in 47af49a Apr 22, 2020
kkos added a commit that referenced this issue Apr 22, 2020
@kkos kkos reopened this Apr 22, 2020
@kkos
Copy link
Owner

kkos commented Apr 22, 2020

Thank you for your research.
Fixed.
This is a bug that if the look-behind contains a branch with a character length of 0 and an anchor is included in the branch, the whole look-behind is ignored.

It may be quite serious, so an urgent release may be needed.
But there may be some other problems, so wait a while.

@kkos kkos closed this as completed in 32b1c5a Apr 26, 2020
kkos added a commit that referenced this issue Apr 26, 2020
bob-beck pushed a commit to openbsd/ports that referenced this issue Apr 27, 2020
"This is a bug that if the look-behind contains a branch with a
character length of 0 and an anchor is included in the branch, the whole
look-behind is ignored."
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants