Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

v6.9.2 breaks php parsing in vscode and atom #146

Closed
PF4Public opened this issue Jul 2, 2019 · 9 comments
Closed

v6.9.2 breaks php parsing in vscode and atom #146

PF4Public opened this issue Jul 2, 2019 · 9 comments

Comments

@PF4Public
Copy link

vscode gives the following error:

[renderer1] [error] invalid code point value: Error: invalid code point value
at Object.createOnigScanner (/usr/lib64/vscode/node_modules.asar/vscode-textmate/release/main.js:67:24)
at Grammar.createOnigScanner (/usr/lib64/vscode/node_modules.asar/vscode-textmate/release/main.js:2478:30)
at RegExpSourceList.compile (/usr/lib64/vscode/node_modules.asar/vscode-textmate/release/main.js:1853:38)
at BeginEndRule.compile (/usr/lib64/vscode/node_modules.asar/vscode-textmate/release/main.js:2008:45)
at matchRule (/usr/lib64/vscode/node_modules.asar/vscode-textmate/release/main.js:2724:28)
at matchRuleOrInjections (/usr/lib64/vscode/node_modules.asar/vscode-textmate/release/main.js:2743:23)
at scanNext (/usr/lib64/vscode/node_modules.asar/vscode-textmate/release/main.js:2833:17)
at _tokenizeString (/usr/lib64/vscode/node_modules.asar/vscode-textmate/release/main.js:2826:9)
at Grammar._tokenize (/usr/lib64/vscode/node_modules.asar/vscode-textmate/release/main.js:2582:25)
at Grammar.tokenizeLine2 (/usr/lib64/vscode/node_modules.asar/vscode-textmate/release/main.js:2552:22)
at D.tokenize2 (file:///usr/lib64/vscode/out/vs/workbench/workbench.main.js:4365:788)
at h._updateTokensUntilLine (file:///usr/lib64/vscode/out/vs/workbench/workbench.main.js:1173:735)
at h._tokenizeOneLine (file:///usr/lib64/vscode/out/vs/workbench/workbench.main.js:1173:187)
at P._revalidateTokensNow (file:///usr/lib64/vscode/out/vs/workbench/workbench.main.js:1203:644)
at P._warmUpTokens (file:///usr/lib64/vscode/out/vs/workbench/workbench.main.js:1203:327)
at P._tokenizationListener.C.TokenizationRegistry.onDidChange.e (file:///usr/lib64/vscode/out/vs/workbench/workbench.main.js:1177:857)
at d.fire (file:///usr/lib64/vscode/out/vs/workbench/workbench.main.js:77:983)
at r.fire (file:///usr/lib64/vscode/out/vs/workbench/workbench.main.js:627:245)
at register (file:///usr/lib64/vscode/out/vs/workbench/workbench.main.js:627:336)
at _promises.set.t.then.t (file:///usr/lib64/vscode/out/vs/workbench/workbench.main.js:627:543)

atom gives a similar error

This does not happen in v6.9.1

See elprans/electron-overlay#41 for a reference

@kkos
Copy link
Owner

kkos commented Jul 2, 2019

I changed the maximum number of bytes from 6 to 4 in utf-8 encoding to RFC 3629.
(de6342d)
I feel that this is related to the error, but I can not conclude that the regular expression pattern is unknown.

@PF4Public
Copy link
Author

Sorry, for the long delay.

Is there any way to find out the cause and fix it? Is there any way I could help?

@kkos
Copy link
Owner

kkos commented Jul 21, 2019

I don't know atom-overlay, vscode-textmate etc..
Therefore I can not investigate these.

I created issue_146 branch.
https://github.com/kkos/oniguruma/tree/issue_146
In this branch, when an error occurs in onig_new(), the pattern is output to standard error.
I don't know if the patterns appear correctly in your execution environment, but that's all I can do.

@PF4Public
Copy link
Author

Thank you for your modifications, but this didn't catch the regexp.

Modifying the source further gave some results.

Like you've said, code_to_mbclen(OnigCodePoint code) in utf8.c does check for a length. Disabling the check seemingly fixes the syntax highlight in vscode and atom.

Adding print before returning with return ONIGERR_INVALID_CODE_POINT_VALUE; gave the following offending codepoint 2147483647, which is 0x7FFFFFFF, which is strange.

Having disabled the 4-byte check I've dumped every pattern, that onig_new receives. To my surprise not only 0x7FFFFFFF had no mentions, but even 0x7F didn't occur a single time.

Digging further I caught a lot of occurrences of 2147483647 in regparse.c in case TK_CODE_POINT: of parse_char_class. The beginning of a dump follows:

code: 45
code: 45
code: 45
code: 9
code: 45
code: 45
code: 9
code: 9
code: 183
code: 192
code: 214
code: 216
code: 246
code: 248
code: 893
code: 895
code: 8191
code: 8204
code: 8205
code: 8255
code: 8256
code: 8304
code: 8591
code: 11264
code: 12271
code: 12289
code: 55295
code: 63744
code: 64975
code: 65008
code: 65533
code: 65536
code: 983039
code: 183
code: 192
code: 214
code: 216
code: 246
code: 248
code: 893
code: 895
code: 8191
code: 8204
code: 8205
code: 8255
code: 8256
code: 8304
code: 8591
code: 11264
code: 12271
code: 12289
code: 55295
code: 63744
code: 64975
code: 65008
code: 65533
code: 65536
code: 983039
code: 183
code: 192
code: 214
code: 216
code: 246
code: 248
code: 893
code: 895
code: 8191
code: 8204
code: 8205
code: 8255
code: 8256
code: 8304
code: 8591
code: 11264
code: 12271
code: 12289
code: 55295
code: 63744
code: 64975
code: 65008
code: 65533
code: 65536
code: 983039
code: 183
code: 192
code: 214
code: 216
code: 246
code: 248
code: 893
code: 895
code: 8191
code: 8204
code: 8205
code: 8255
code: 8256
code: 8304
code: 8591
code: 11264
code: 12271
code: 12289
code: 55295
code: 63744
code: 64975
code: 65008
code: 65533
code: 65536
code: 983039
code: 127
code: 2147483647
code: 127
code: 2147483647
code: 127
code: 2147483647
code: 127
code: 2147483647
code: 127
code: 2147483647
code: 127
code: 2147483647
code: 127
code: 2147483647
code: 127
code: 2147483647
code: 127
code: 2147483647
code: 127
code: 2147483647
code: 127
code: 2147483647
code: 127
code: 2147483647
code: 127
code: 2147483647
code: 127
code: 2147483647
code: 127
code: 2147483647
code: 127
code: 2147483647
code: 127
code: 2147483647
code: 127
code: 2147483647
code: 127
code: 2147483647
code: 127
code: 2147483647
code: 127
code: 2147483647
code: 127
code: 2147483647
code: 127
code: 2147483647
code: 127
code: 2147483647
code: 127
code: 2147483647
code: 127
code: 2147483647
code: 127
code: 2147483647
code: 45
code: 45
code: 45
code: 127
code: 2147483647
code: 127
code: 2147483647
code: 127
code: 2147483647
code: 127
code: 2147483647
code: 127
code: 2147483647
code: 127
code: 2147483647
code: 127
code: 2147483647
code: 127
code: 2147483647
code: 127
code: 2147483647

Unfortunately, at this point I have no idea how to interpret this result and what to do next.

@kkos
Copy link
Owner

kkos commented Jul 26, 2019

0x7f is a valid code in UTF-8.
In your code point output, incorrect code looks like 0x7FFFFFFF only.
You should ask the PHP mode file maintainer to change 0x7FFFFFFF to 0x1FFFFF.
I do not know who the person is.

@PF4Public
Copy link
Author

You should ask the PHP mode file maintainer to change 0x7FFFFFFF to 0x1FFFFF.

May I ask you to rephrase this? I don't get, what PHP mode is :( Is it an extension, responsible for PHP or something?

@Thom1729
Copy link

Thom1729 commented Aug 2, 2019

@PF4Public

May I ask you to rephrase this? I don't get, what PHP mode is :( Is it an extension, responsible for PHP or something?

I believe that @kkos is referring to the syntax definition used for highlighting PHP. In some editors, these are known as “modes”.

At a glance, it seems that this is Atom's PHP syntax definition. You can see \x{7f}-\x{7fffffff} all over the place in that file. As I understand it, the intent is to match any Unicode character whose UTF-8 representation contains only bytes >= 0x7f. By coincidence, this is every character whose code point is >= 0x7f. See atom/language-php#302 for more discussion.

The problem is that \x{7fffffff} is not a valid Unicode code point, so Oniguruma won't accept the expressions. Instead, the expressions should use \x{7f}-\x{10ffff}. There should be no downside to this, because there are no characters with code points greater than 0x10ffff.

Why did this break? Before 6.9.2, \x{7fffffff} would have been interpreted as referring to a code point with a six-byte UTF-8 encoding. In 6.9.2, a code point that would require five or six bytes will instead result in an error

@kkos
Copy link
Owner

kkos commented Aug 3, 2019

Thanks @Thom1729.
Correct my previous comment.
It was better to use 0x10FFFF instead of 0x1FFFFF for Unicode range.

By the way, in the current master branch, I added ONIG_SYN_ALLOW_INVALID_CODE_END_OF_RANGE_IN_CC and enabled it in DEFAULT syntax only.
This allows invalid character code values at the end of a range in a character class.

@PF4Public
Copy link
Author

@kkos , @Thom1729 Thank you for your replies. After updating to oniguruma-6.9.3 this seems to be fixed…

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants