Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

(chore) Clean up all regexs to be UTF-8 compliant/ready #2759

Merged
merged 24 commits into from Nov 2, 2020

Conversation

joshgoebel
Copy link
Member

@joshgoebel joshgoebel commented Oct 16, 2020

Work toward #2756.

Cleans up a lot of incorrect (unnecessary escaped) regex and would not compile with the u flag. After that makes some rather large performance improvements (with utf8 turned on at least) to yaml and mipsasm. It looks like the mipasm rules have been wrong all alone... as far as I can determine they are intended to match a literal . (otherwise they are far too broad) but were matching any character - which seems to terribly slow down the whole grammar in u mode.

This consisted mostly of:

  • Most unescaped { and }
  • Lots of unneeded escapes for -, <, >, and others.
  • Converting strings to regex if it made them simpler, easier to read (editor syntax coloring)

I plan to review the PR myself line by line, but can't imagine it'll be fun. I did try to change the minimal amount necessary. Often I turned strings into regex if it made them easier to read and see what I was doing. Once in a while I touched a nearby regex.

I imagine the fact that all tests still pass is a pretty good indication this is 99% correct. :-)

Note: This doesn't actually turn on UTF8 anywhere... it just fixes all the regex so that if u is added inside the main mode compiler everything "just works". It is still needed to be reviewed what else might need to be done on the road to UTF8 support.

@joshgoebel joshgoebel added this to the 10.4 milestone Oct 16, 2020
@joshgoebel joshgoebel requested review from allejo and egor-rogov Oct 16, 2020
@joshgoebel
Copy link
Member Author

@joshgoebel joshgoebel commented Oct 16, 2020

No rush, I'm pushing this off until 10.4. 10.3 is big enough.

src/languages/qml.js Outdated Show resolved Hide resolved
@@ -22,7 +22,7 @@ export default function(hljs) {
},
// YAML block
{
begin: '(\s+)?---$', end: '\\.\\.\\.$',
begin: /---$/, end: '\\.\\.\\.$',
Copy link
Member Author

@joshgoebel joshgoebel Oct 20, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was broken before \s vs \\s and when fixed seemed to break things so without any context going to go with the simpler rule for now.

className: 'string',
begin: /0\'\\s/ // 0'\s
begin: /0'\\s/ // 0'\s
Copy link
Member Author

@joshgoebel joshgoebel Oct 20, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is greek to me, but I think it's correct... anyone know Prolog? The "\s" is literal?

Copy link
Member Author

@joshgoebel joshgoebel Oct 28, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This makes sense to me now I think. Code might be:

% \s literal (escaped)
0'\s
% character "b"
0'b
% character "'" (escaped)
0'\'

I'm assuming 0'' is invalid.

Copy link
Member

@allejo allejo Nov 2, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea, looks like 0' is the way of getting the character code of a char. So 0'\s returns 32. And 0'' is valid, it returns 39

Screenshot and confirmation courtesy of @Adwitiya-Singh

image

More info, https://www.swi-prolog.org/pldoc/man?section=charescapes

Copy link
Member Author

@joshgoebel joshgoebel Nov 2, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And you are correct, 0'' is valid, it returns 39

Valid or invalid? Although I suppose the regex currently allows it either way, LOL... and I'm not even sure if that is bad or not.

Copy link
Member

@allejo allejo Nov 2, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My bad, I misread your original comment 😅 0'' is valid as seen in the screenshot. But since the regex allows it, works for me

src/languages/yaml.js Outdated Show resolved Hide resolved
@@ -144,7 +144,7 @@ export default function(hljs) {
}, //*/

{
begin: '\\b(' + COMMON_COMMANDS.split(' ').join('|') + ')([\\s\[\(]|\])',
begin: '\\b(' + COMMON_COMMANDS.split(' ').join('|') + ')([\\s[(\\]|])',
Copy link
Member Author

@joshgoebel joshgoebel Oct 20, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe I broke this? Not sure what it's supposed to be doing.

Copy link
Member Author

@joshgoebel joshgoebel Oct 27, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pretty sure it's fine, added a comment to explain it.

@@ -31,29 +31,29 @@ export default function(hljs) {
className: 'literal',
variants: [
{
begin: '\\b(?:PI|TWO_PI|PI_BY_TWO|DEG_TO_RAD|RAD_TO_DEG|SQRT2)\\b'
begin: '\\b(PI|TWO_PI|PI_BY_TWO|DEG_TO_RAD|RAD_TO_DEG|SQRT2)\\b'
Copy link
Member Author

@joshgoebel joshgoebel Oct 20, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are speed optimizations believe it or not the benchmark was faster without the ":" to mark it as a non capturing group.

src/lib/mode_compiler.js Outdated Show resolved Hide resolved
src/highlight.js Outdated
@@ -579,12 +579,17 @@ const HLJS = function(hljs) {
@param {Array<string>} [languageSubset]
@returns {AutoHighlightResult}
*/
let ts = {};
Copy link
Member Author

@joshgoebel joshgoebel Oct 25, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will be reverted before merge. This whole PR changes nothing in the core library (other than grammars).

@joshgoebel
Copy link
Member Author

@joshgoebel joshgoebel commented Oct 25, 2020

I've 99% review this myself but could still use another set of eyes. :)

@joshgoebel joshgoebel changed the title (chore) Clean up all regexs to be UTF-8 compliant (chore) Clean up all regexs to be UTF-8 compliant/ready Oct 27, 2020
src/languages/ebnf.js Outdated Show resolved Hide resolved
src/languages/gams.js Outdated Show resolved Hide resolved
src/languages/javascript.js Show resolved Hide resolved
src/languages/less.js Show resolved Hide resolved
src/languages/lisp.js Show resolved Hide resolved
src/languages/parser3.js Show resolved Hide resolved
src/languages/routeros.js Show resolved Hide resolved
src/languages/routeros.js Outdated Show resolved Hide resolved
src/languages/stylus.js Outdated Show resolved Hide resolved
src/languages/xquery.js Show resolved Hide resolved
src/languages/yaml.js Outdated Show resolved Hide resolved
src/languages/yaml.js Outdated Show resolved Hide resolved
Co-authored-by: Vladimir Jimenez <allejo@me.com>
@joshgoebel
Copy link
Member Author

@joshgoebel joshgoebel commented Nov 2, 2020

Anything else? :)

allejo
allejo approved these changes Nov 2, 2020
Copy link
Member

@allejo allejo left a comment

Just don't forget to revert this but this looks good to me

@joshgoebel joshgoebel merged commit 0cee2f3 into highlightjs:master Nov 2, 2020
1 check was pending
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants