-
Notifications
You must be signed in to change notification settings - Fork 3.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
enh(python) Add support for unicode identifiers #3280
enh(python) Add support for unicode identifiers #3280
Conversation
For a bit of context here, I'm working on internationalizing an interactive online textbook. Code like def enRatónPesionado(x, y):
Círculo(x, y, 50, relleno='azulMarino', opacidad=x) isn't highlighting properly (and as you can see here, good support for Spanish-language programmers is rare, which is a shame) |
It looks like I can't request a review, but I'd love to get some eyes on this @joshgoebel |
Some very high level thoughts:
Previously I changed the entirely library to
So if you're interesting in pursuing this further I think the best bang for the buck course to get some useful info would be:
Doing this quickly right now (with just a single run) I see ~7.4s HEAD vs ~8.5s with So I'd be curious to know:
@austin-schick For starters perhaps see if you can reproduce my results with |
The parsing engine often combines 100s of regex into a huge single "either" regex... so just saying "this single regex is /u not others" isn't easily possible. Also, many of our regex are still in string form giving us no place to even store this the Hence the easiest way to split |
Thanks @joshgoebel! I'm responding to a subset of your comments to start out.
I ran
Makes sense! What's your list of supported browsers / versions? I'm not concerned about the browser support of
I believe that's what I'm doing with this PR. Are we on the same page? |
I'm pretty sure you aren't testing the FULL change until you do it globally. You're only touching Python in this PR but the tests are running thousands of highlights, of which only a tiny percent (1/190 or so) are Python. It's still possible Python is slower, but it's getting lost with all the other tests being the same speed (because they aren't using In my testing I turned on
Well, that Firefox issue is perhaps more problematic than the speed. One could make an argument that we should trade performance for accuracy (though I'm not sure all users would agree). Generally we support "all modern green-field browsers, going a few years back"... which until now has meant great browser coverage, but also no conditional support or polyfills... saying we only support Firefox from mid-2020 forward would be a change. @highlightjs/core Though, since most browsers have auto-update now (and Firefox generally isn't tied to OS versions) am I worrying too much about older Firefox versions? I wonder.
Yes, indeed, it'd be a per grammar flag as you're doing here. |
Here is my test case: https://github.com/joshgoebel/highlight.js/tree/slower_utf8 I think we need to somehow show that the 15% loss is incorrect, or switch to a discussion I proposed above:
And yes I do realize IF we went per-grammar (which has it's own small costs, I'd prefer if it could be done globally) we'd only pay that cost on "modern" languages that support UTF-8 identifiers, but I'd also guess that would be most modern languages - so if our common bundle was say 70% modern languages then I assume we'd still be paying a measurable cost even if we only turn on support for those. We must of course also keep in mind that performance "leaks" over into all other grammars during auto-detection... so someone highlighting Lua (via auto-detect) with the CC @highlightjs/core |
Browsers do usually have an auto-update feature on personal computers but I've been in enterprise environments where that is controlled by the IT department and they may not always be on top of updates. I think if we want to adopt a policy, we can do what similar to what autoprefixer, Create React App, and friends do. They use something like browserslist and support "last 2 versions" (or X in our case) of each browser. |
I also see this 15% to 20% speed decrease running on your branch, so I agree that it may not be good to make this a global change. However, if we just focus on this PR, which is just targeting Python, we get:
This may not be the way forward for every language, or individual languages may need optimization, but I don't see any reason not to take this approach for Python in particular |
I'm much less worried about true enterprise environments. If an enterprise wants to purposely use an older browser then they need to deal with the problems that causes. If Highlight.js is "enterprise critical" then such an enterprise may need to maintain a compatible version just for their use... we shouldn't punish the rest of the world just for the sake of the enterprise. I know at least one person running us thru Babel to produce an ES5 build... which we've long since moved on from (thankfully). That's fine for them if it works, but not something the core team needs to deal with.
I'm not sure how we'd enforce this exactly (does Babel or something have some sort of "check only" tooling?), but I think I like the spirit of it... If by "version" you mean "release" here... Firefox seems to issue releases every 1 - 1.5 months (recently)... if you extrapolate back a year that's ~9-12 releases... I think I'm semi-comfortable saying "last few" or "past 6 months" etc... I'm slightly more sympathetic for those on older Mac OS using a Safari with no upgrade path, but Safari has supported this stuff since v11.1 in early 2018. I'm only slightly bothered because it feels like this is the first time we're drawing a line in the sand... otherwise I think we support the past several years worth of modern browsers work just fine... but I suppose I can get over it. |
That's a fair redirect... and thanks for that. :-) Can you provide a benchmark script/scaffold or something I could use to quickly confirm the performance numbers you are seeing on this side - do you have something semi-automated for all those 10x, 100x, 1500x you mention? Perhaps there is indeed something else entirely at work here - some other grammar - or a particular regex that is somehow ill-suited for UTF-8 mode that is skewing the results when the change is global.
If I can confirm those performance numbers then I believe I would concur. This PR would still need some small changes, but I could work with you on those once I double check the numbers. |
Awesome, thank you! I'll put together a script for easy performance testing and send that over Thursday or Friday. |
Here you go, @joshgoebel! You can run I think the easiest way to run the tests on main is to cherry pick 62e64d5 onto main and then run the script exactly the same way |
Whoops, you'll want to cherry pick this commit too (or just uncomment the line it uncomments) ^ |
Sorry I've been busy, I should get to this later this week. |
Awesome, looking forward to your review :) |
@joshgoebel Any update on this? With the fall semester starting, it'd be great to get this working now |
Ok, I don't see any performance regression here... let me drop a few comments on the PR itself so we van move forward! Thanks for the ping. |
src/languages/python.js
Outdated
@@ -5,10 +5,14 @@ Website: https://www.python.org | |||
Category: common | |||
*/ | |||
|
|||
import { UNDERSCORE_IDENT_RE } from '../lib/modes.js'; | |||
import { UNICODE_SUPPORTED, UNDERSCORE_IDENT_RE } from '../lib/modes.js'; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't need a check for this (we can assume we only support browsers with unicode regex support), and we don't need the import at all anymore I don't think.
import { UNICODE_SUPPORTED, UNDERSCORE_IDENT_RE } from '../lib/modes.js'; |
src/languages/python.js
Outdated
import * as regex from '../lib/regex.js'; | ||
|
||
export default function(hljs) { | ||
const PY_IDENT_RE = UNICODE_SUPPORTED ? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please remove conditional and just write out the regex as a regex with /u
.
src/languages/python.js
Outdated
@@ -358,6 +362,7 @@ export default function(hljs) { | |||
'gyp', | |||
'ipython' | |||
], | |||
unicode: UNICODE_SUPPORTED, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Lets go with unicodeRegex: true
...
unicode: UNICODE_SUPPORTED, | |
unicodeRegex: true, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And of course compiler will need the name updated there.
src/lib/modes.js
Outdated
@@ -4,6 +4,16 @@ import * as regex from './regex.js'; | |||
/** @typedef {import('highlight.js').Mode} Mode */ | |||
/** @typedef {import('highlight.js').ModeCallback} ModeCallback */ | |||
|
|||
// Determine browser support for RegExp unicode property escapes |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove, etc.
You can just add new commits on top... I can clean up before merging... and figure out whether I want the perf stuff in this PR or make a sep one for that... |
I know I took a while to review this - if you're too busy to finish it up I may pick it up in the next few days and push it to completion myself, since now the XML stuff is potentially waiting on this. |
Thanks for helping to push this to the finish line @joshgoebel. I had a busy two weeks, but I'm a little more open now. Let me know if there's anything else I can do |
@austin-schick Thanks for much for the PR! |
Changes
Adds support for unicode Python identifiers. As discussed in #2756, Python identifiers have supported a wider range of characters than the Javascript regex
\w
since Python 3.This PR starts using the
/u
flag in regexes for Python only, and introduces a mechanism for converting one language to unicode at a time. It also allows hljs to fall back to regexes without unicode property escapes if the user's browser does not yet support them.I did a bit of performance testing in node against this file from the bs4 library. Highlighting the file 1500 times was marginally faster with unicode mode enabled -- 18278ms vs 18615ms with unicode mode disabled.
Checklist
CHANGES.md