v6.9.2 breaks php parsing in vscode and atom #146

PF4Public · 2019-07-02T16:40:42Z

vscode gives the following error:

[renderer1] [error] invalid code point value: Error: invalid code point value
at Object.createOnigScanner (/usr/lib64/vscode/node_modules.asar/vscode-textmate/release/main.js:67:24)
at Grammar.createOnigScanner (/usr/lib64/vscode/node_modules.asar/vscode-textmate/release/main.js:2478:30)
at RegExpSourceList.compile (/usr/lib64/vscode/node_modules.asar/vscode-textmate/release/main.js:1853:38)
at BeginEndRule.compile (/usr/lib64/vscode/node_modules.asar/vscode-textmate/release/main.js:2008:45)
at matchRule (/usr/lib64/vscode/node_modules.asar/vscode-textmate/release/main.js:2724:28)
at matchRuleOrInjections (/usr/lib64/vscode/node_modules.asar/vscode-textmate/release/main.js:2743:23)
at scanNext (/usr/lib64/vscode/node_modules.asar/vscode-textmate/release/main.js:2833:17)
at _tokenizeString (/usr/lib64/vscode/node_modules.asar/vscode-textmate/release/main.js:2826:9)
at Grammar._tokenize (/usr/lib64/vscode/node_modules.asar/vscode-textmate/release/main.js:2582:25)
at Grammar.tokenizeLine2 (/usr/lib64/vscode/node_modules.asar/vscode-textmate/release/main.js:2552:22)
at D.tokenize2 (file:///usr/lib64/vscode/out/vs/workbench/workbench.main.js:4365:788)
at h._updateTokensUntilLine (file:///usr/lib64/vscode/out/vs/workbench/workbench.main.js:1173:735)
at h._tokenizeOneLine (file:///usr/lib64/vscode/out/vs/workbench/workbench.main.js:1173:187)
at P._revalidateTokensNow (file:///usr/lib64/vscode/out/vs/workbench/workbench.main.js:1203:644)
at P._warmUpTokens (file:///usr/lib64/vscode/out/vs/workbench/workbench.main.js:1203:327)
at P._tokenizationListener.C.TokenizationRegistry.onDidChange.e (file:///usr/lib64/vscode/out/vs/workbench/workbench.main.js:1177:857)
at d.fire (file:///usr/lib64/vscode/out/vs/workbench/workbench.main.js:77:983)
at r.fire (file:///usr/lib64/vscode/out/vs/workbench/workbench.main.js:627:245)
at register (file:///usr/lib64/vscode/out/vs/workbench/workbench.main.js:627:336)
at _promises.set.t.then.t (file:///usr/lib64/vscode/out/vs/workbench/workbench.main.js:627:543)

atom gives a similar error

This does not happen in v6.9.1

See elprans/electron-overlay#41 for a reference

kkos · 2019-07-02T23:50:24Z

I changed the maximum number of bytes from 6 to 4 in utf-8 encoding to RFC 3629.
(de6342d)
I feel that this is related to the error, but I can not conclude that the regular expression pattern is unknown.

PF4Public · 2019-07-20T13:24:07Z

Sorry, for the long delay.

Is there any way to find out the cause and fix it? Is there any way I could help?

kkos · 2019-07-21T10:21:09Z

I don't know atom-overlay, vscode-textmate etc..
Therefore I can not investigate these.

I created issue_146 branch.
https://github.com/kkos/oniguruma/tree/issue_146
In this branch, when an error occurs in onig_new(), the pattern is output to standard error.
I don't know if the patterns appear correctly in your execution environment, but that's all I can do.

PF4Public · 2019-07-24T23:10:26Z

Thank you for your modifications, but this didn't catch the regexp.

Modifying the source further gave some results.

Like you've said, code_to_mbclen(OnigCodePoint code) in utf8.c does check for a length. Disabling the check seemingly fixes the syntax highlight in vscode and atom.

Adding print before returning with return ONIGERR_INVALID_CODE_POINT_VALUE; gave the following offending codepoint 2147483647, which is 0x7FFFFFFF, which is strange.

Having disabled the 4-byte check I've dumped every pattern, that onig_new receives. To my surprise not only 0x7FFFFFFF had no mentions, but even 0x7F didn't occur a single time.

Digging further I caught a lot of occurrences of 2147483647 in regparse.c in case TK_CODE_POINT: of parse_char_class. The beginning of a dump follows:

code: 45
code: 45
code: 45
code: 9
code: 45
code: 45
code: 9
code: 9
code: 183
code: 192
code: 214
code: 216
code: 246
code: 248
code: 893
code: 895
code: 8191
code: 8204
code: 8205
code: 8255
code: 8256
code: 8304
code: 8591
code: 11264
code: 12271
code: 12289
code: 55295
code: 63744
code: 64975
code: 65008
code: 65533
code: 65536
code: 983039
code: 183
code: 192
code: 214
code: 216
code: 246
code: 248
code: 893
code: 895
code: 8191
code: 8204
code: 8205
code: 8255
code: 8256
code: 8304
code: 8591
code: 11264
code: 12271
code: 12289
code: 55295
code: 63744
code: 64975
code: 65008
code: 65533
code: 65536
code: 983039
code: 183
code: 192
code: 214
code: 216
code: 246
code: 248
code: 893
code: 895
code: 8191
code: 8204
code: 8205
code: 8255
code: 8256
code: 8304
code: 8591
code: 11264
code: 12271
code: 12289
code: 55295
code: 63744
code: 64975
code: 65008
code: 65533
code: 65536
code: 983039
code: 183
code: 192
code: 214
code: 216
code: 246
code: 248
code: 893
code: 895
code: 8191
code: 8204
code: 8205
code: 8255
code: 8256
code: 8304
code: 8591
code: 11264
code: 12271
code: 12289
code: 55295
code: 63744
code: 64975
code: 65008
code: 65533
code: 65536
code: 983039
code: 127
code: 2147483647
code: 127
code: 2147483647
code: 127
code: 2147483647
code: 127
code: 2147483647
code: 127
code: 2147483647
code: 127
code: 2147483647
code: 127
code: 2147483647
code: 127
code: 2147483647
code: 127
code: 2147483647
code: 127
code: 2147483647
code: 127
code: 2147483647
code: 127
code: 2147483647
code: 127
code: 2147483647
code: 127
code: 2147483647
code: 127
code: 2147483647
code: 127
code: 2147483647
code: 127
code: 2147483647
code: 127
code: 2147483647
code: 127
code: 2147483647
code: 127
code: 2147483647
code: 127
code: 2147483647
code: 127
code: 2147483647
code: 127
code: 2147483647
code: 127
code: 2147483647
code: 127
code: 2147483647
code: 127
code: 2147483647
code: 127
code: 2147483647
code: 45
code: 45
code: 45
code: 127
code: 2147483647
code: 127
code: 2147483647
code: 127
code: 2147483647
code: 127
code: 2147483647
code: 127
code: 2147483647
code: 127
code: 2147483647
code: 127
code: 2147483647
code: 127
code: 2147483647
code: 127
code: 2147483647

Unfortunately, at this point I have no idea how to interpret this result and what to do next.

kkos · 2019-07-26T01:59:39Z

0x7f is a valid code in UTF-8.
In your code point output, incorrect code looks like 0x7FFFFFFF only.
You should ask the PHP mode file maintainer to change 0x7FFFFFFF to 0x1FFFFF.
I do not know who the person is.

PF4Public · 2019-07-26T19:42:23Z

You should ask the PHP mode file maintainer to change 0x7FFFFFFF to 0x1FFFFF.

May I ask you to rephrase this? I don't get, what PHP mode is :( Is it an extension, responsible for PHP or something?

Thom1729 · 2019-08-02T18:37:50Z

@PF4Public

May I ask you to rephrase this? I don't get, what PHP mode is :( Is it an extension, responsible for PHP or something?

I believe that @kkos is referring to the syntax definition used for highlighting PHP. In some editors, these are known as “modes”.

At a glance, it seems that this is Atom's PHP syntax definition. You can see \x{7f}-\x{7fffffff} all over the place in that file. As I understand it, the intent is to match any Unicode character whose UTF-8 representation contains only bytes >= 0x7f. By coincidence, this is every character whose code point is >= 0x7f. See atom/language-php#302 for more discussion.

The problem is that \x{7fffffff} is not a valid Unicode code point, so Oniguruma won't accept the expressions. Instead, the expressions should use \x{7f}-\x{10ffff}. There should be no downside to this, because there are no characters with code points greater than 0x10ffff.

Why did this break? Before 6.9.2, \x{7fffffff} would have been interpreted as referring to a code point with a six-byte UTF-8 encoding. In 6.9.2, a code point that would require five or six bytes will instead result in an error

kkos · 2019-08-03T14:25:58Z

Thanks @Thom1729.
Correct my previous comment.
It was better to use 0x10FFFF instead of 0x1FFFFF for Unicode range.

By the way, in the current master branch, I added ONIG_SYN_ALLOW_INVALID_CODE_END_OF_RANGE_IN_CC and enabled it in DEFAULT syntax only.
This allows invalid character code values at the end of a range in a character class.

PF4Public · 2019-08-22T21:02:44Z

@kkos , @Thom1729 Thank you for your replies. After updating to oniguruma-6.9.3 this seems to be fixed…

PF4Public closed this as completed Aug 22, 2019

IonBazan mentioned this issue Oct 21, 2021

PHP8 Attributes syntax highlight github-linguist/linguist#5522

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

v6.9.2 breaks php parsing in vscode and atom #146

v6.9.2 breaks php parsing in vscode and atom #146

PF4Public commented Jul 2, 2019

kkos commented Jul 2, 2019

PF4Public commented Jul 20, 2019

kkos commented Jul 21, 2019

PF4Public commented Jul 24, 2019

kkos commented Jul 26, 2019

PF4Public commented Jul 26, 2019

Thom1729 commented Aug 2, 2019

kkos commented Aug 3, 2019

PF4Public commented Aug 22, 2019

v6.9.2 breaks php parsing in vscode and atom #146

v6.9.2 breaks php parsing in vscode and atom #146

Comments

PF4Public commented Jul 2, 2019

kkos commented Jul 2, 2019

PF4Public commented Jul 20, 2019

kkos commented Jul 21, 2019

PF4Public commented Jul 24, 2019

kkos commented Jul 26, 2019

PF4Public commented Jul 26, 2019

Thom1729 commented Aug 2, 2019

kkos commented Aug 3, 2019

PF4Public commented Aug 22, 2019