Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

enh(python) Add support for unicode identifiers #3280

Merged

Conversation

austin-schick
Copy link
Contributor

@austin-schick austin-schick commented Jul 19, 2021

Changes

Adds support for unicode Python identifiers. As discussed in #2756, Python identifiers have supported a wider range of characters than the Javascript regex \w since Python 3.

This PR starts using the /u flag in regexes for Python only, and introduces a mechanism for converting one language to unicode at a time. It also allows hljs to fall back to regexes without unicode property escapes if the user's browser does not yet support them.

I did a bit of performance testing in node against this file from the bs4 library. Highlighting the file 1500 times was marginally faster with unicode mode enabled -- 18278ms vs 18615ms with unicode mode disabled.

Checklist

  • Added markup tests
  • Updated the changelog at CHANGES.md

@austin-schick austin-schick changed the title Add support for diacritics in Python identifiers fix(python) Add support for diacritics in Python identifiers Jul 20, 2021
@austin-schick austin-schick marked this pull request as draft July 20, 2021 00:29
@austin-schick austin-schick changed the title fix(python) Add support for diacritics in Python identifiers fix(python) Add support for unicode Python identifiers Jul 20, 2021
@austin-schick
Copy link
Contributor Author

For a bit of context here, I'm working on internationalizing an interactive online textbook. Code like

def enRatónPesionado(x, y):
    Círculo(x, y, 50, relleno='azulMarino', opacidad=x)

isn't highlighting properly (and as you can see here, good support for Spanish-language programmers is rare, which is a shame)

@austin-schick austin-schick marked this pull request as ready for review July 20, 2021 17:16
@austin-schick
Copy link
Contributor Author

It looks like I can't request a review, but I'd love to get some eyes on this @joshgoebel

@joshgoebel
Copy link
Member

joshgoebel commented Jul 20, 2021

Some very high level thoughts:

  • We're not interested in conditional behavior across different browsers. If all browsers we support do not support /u yet then it's simply off the table until they do (like negative lookbehind, because of Safari lagging). At a glance though I don't think this is an issue, so we shouldn't need a conditional...
  • It's not just Python you'd have to test (for performance) but EVERY potential input source language. Auto-detect is a big feature which means that any grammars end up being run against code from many different languages. A change to Python that makes auto-detection of C++ much slower is a problem, even if it has 0 ill effects for Python highlighting.
  • It's also possible Unicode is faster (or similar) at run-time yet slower at startup. Start-up concerns me less than run-time but if it's considerably slower it's still relevant since in-browser usage is a large portion of our use case - and there start-up time matters.

Previously I changed the entirely library to /u and saw 10-20% slower runs of the test suite (Node)... that's not am exact benchmark, but it's something. I feel like the performance tests we really need here are:

  • benchmarked versions of our auto-detection tests (it's possible ./tools/CheckAutoDetect.js could also serve this purpose for Node, though it would also include start-up time)
  • benchmark highlighting all our markup test content with individual grammars... (comparing UTF mode to non-UTF) (the markup tests are a larger sample size than the auto-detect tests, and targeted at a single grammar, ie: C code is only run against the C grammar)
  • possibly benchmark some larger files (with samples for our common subset of languages)

So if you're interesting in pursuing this further I think the best bang for the buck course to get some useful info would be:

  • Add some actual benchmarking to CheckAutoDetect.js... or it's possible just running in 5-6 times in a Bash script might be roughly informative.
  • Benchmark current HEAD vs HEAD + /u (for all regex). (see mode_compiler.js line ~33 os so)

Doing this quickly right now (with just a single run) I see ~7.4s HEAD vs ~8.5s with /u... (that's without using any complex UTF-8 matchers, which I assume would be slower, though perhaps I'm mistaken). That's a 15% slow-down (akin to what I was seeing before), which is not great - and why #2756 got stuck in the first place.

So I'd be curious to know:

  • Is this start-up cost or runtime cost [or both] (modifying the script itself to run the tests say 5-10 times vs 1 should prove educational here)
  • Is there some way that those users wishing to highlight UTF-8 identifiers could pay this performance penalty? I'm afraid the answer here will be no since once UTF-8 starts sneaking into all our regex then we're forced to use /u everywhere... and having two entirely different regex that are conditional isn't a route we'd want to go either.
  • We could also support /u on a per-grammar basis (like we currently do with case-sensitivity), adding some complexity to try and avoid paying the 15% performance penalty everywhere.

@austin-schick For starters perhaps see if you can reproduce my results with /u and checkAutoDetect?

@joshgoebel
Copy link
Member

joshgoebel commented Jul 20, 2021

The parsing engine often combines 100s of regex into a huge single "either" regex... so just saying "this single regex is /u not others" isn't easily possible. Also, many of our regex are still in string form giving us no place to even store this the /u information. IDENTs especially are commonly string form for easy appending. So switching to regex everywhere just to know which are /u and which aren't is no small task. Even then we'd be forced to trigger /u if a single regex in a contains group needed it... so 1 regex out of 100 might force all 100 to use /u...

Hence the easiest way to split /u vs not is at the grammar level (as we already do with case).

@austin-schick
Copy link
Contributor Author

austin-schick commented Jul 21, 2021

Thanks @joshgoebel! I'm responding to a subset of your comments to start out.

For starters perhaps see if you can reproduce my results with /u and checkAutoDetect?

I ran time (for i in {1..10}; do node checkAutoDetect.js; done) on this branch and main. This branch took 70.78s to run, main took 73.69s to run. So we actually got a small speedup, but there seems to be a few seconds of variation in the tests.

We're not interested in conditional behavior across different browsers

Makes sense! What's your list of supported browsers / versions? I'm not concerned about the browser support of \u, but rather the support of \p{} escapes (here), which weren't supported in Firefox until mid last year.

Hence the easiest way to split /u vs not is at the grammar level (as we already do with case).

I believe that's what I'm doing with this PR. Are we on the same page?

@joshgoebel
Copy link
Member

. So we actually got a small speedup, but there seems to be a few seconds of variation in the tests.

I'm pretty sure you aren't testing the FULL change until you do it globally. You're only touching Python in this PR but the tests are running thousands of highlights, of which only a tiny percent (1/190 or so) are Python. It's still possible Python is slower, but it's getting lost with all the other tests being the same speed (because they aren't using /u).

In my testing I turned on /u GLOBALLY (one line patch - though I had to fix a few broken regex)... and saw the 15% speed decrease - which to me means that (until we can prove otherwise) that we're looking at a measurable slowdown if this support is added to all languages.

Makes sense! What's your list of supported browsers / versions? I'm not concerned about the browser support of \u, but rather the support of \p{} escapes (here), which weren't supported in Firefox until mid last year.

Well, that Firefox issue is perhaps more problematic than the speed. One could make an argument that we should trade performance for accuracy (though I'm not sure all users would agree). Generally we support "all modern green-field browsers, going a few years back"... which until now has meant great browser coverage, but also no conditional support or polyfills... saying we only support Firefox from mid-2020 forward would be a change.

@highlightjs/core Though, since most browsers have auto-update now (and Firefox generally isn't tied to OS versions) am I worrying too much about older Firefox versions? I wonder.

I believe that's what I'm doing with this PR. Are we on the same page?

Yes, indeed, it'd be a per grammar flag as you're doing here.

@joshgoebel
Copy link
Member

joshgoebel commented Jul 21, 2021

Here is my test case:

https://github.com/joshgoebel/highlight.js/tree/slower_utf8

I think we need to somehow show that the 15% loss is incorrect, or switch to a discussion I proposed above:

One could make an argument that we should trade performance for accuracy (though I'm not sure all users would agree).


And yes I do realize IF we went per-grammar (which has it's own small costs, I'd prefer if it could be done globally) we'd only pay that cost on "modern" languages that support UTF-8 identifiers, but I'd also guess that would be most modern languages - so if our common bundle was say 70% modern languages then I assume we'd still be paying a measurable cost even if we only turn on support for those.

We must of course also keep in mind that performance "leaks" over into all other grammars during auto-detection... so someone highlighting Lua (via auto-detect) with the common build would still pay that 70% of 15% price, etc.

CC @highlightjs/core

@allejo
Copy link
Member

allejo commented Jul 21, 2021

Though, since most browsers have auto-update now (and Firefox generally isn't tied to OS versions) am I worrying too much about older Firefox versions? I wonder.

Browsers do usually have an auto-update feature on personal computers but I've been in enterprise environments where that is controlled by the IT department and they may not always be on top of updates.

I think if we want to adopt a policy, we can do what similar to what autoprefixer, Create React App, and friends do. They use something like browserslist and support "last 2 versions" (or X in our case) of each browser.

@austin-schick
Copy link
Contributor Author

austin-schick commented Jul 21, 2021

In my testing I turned on /u GLOBALLY (one line patch - though I had to fix a few broken regex)... and saw the 15% speed decrease

I also see this 15% to 20% speed decrease running on your branch, so I agree that it may not be good to make this a global change.

However, if we just focus on this PR, which is just targeting Python, we get:

  • Better accuracy in the highlighting (with real world use cases)
  • No performance difference in the Python markup tests (58.81s here vs 63.12s main, run 100x)
  • No performance difference in global checkAutoDetect (70.78s here vs 73.69s main, run 10x)
  • Minimal performance difference in Python-only checkAutoDetect (58.70s here vs 56.89s main, run 100x)
  • No performance difference on a large real-world file (18.28s here vs 18.61s main, run 1500x)

This may not be the way forward for every language, or individual languages may need optimization, but I don't see any reason not to take this approach for Python in particular

@joshgoebel
Copy link
Member

joshgoebel commented Jul 21, 2021

Browsers do usually have an auto-update feature on personal computers but I've been in enterprise environments where that is controlled by the IT department and they may not always be on top of updates.

I'm much less worried about true enterprise environments. If an enterprise wants to purposely use an older browser then they need to deal with the problems that causes. If Highlight.js is "enterprise critical" then such an enterprise may need to maintain a compatible version just for their use... we shouldn't punish the rest of the world just for the sake of the enterprise. I know at least one person running us thru Babel to produce an ES5 build... which we've long since moved on from (thankfully). That's fine for them if it works, but not something the core team needs to deal with.

support "last 2 versions" (or X in our case) of each browser.

I'm not sure how we'd enforce this exactly (does Babel or something have some sort of "check only" tooling?), but I think I like the spirit of it... If by "version" you mean "release" here... Firefox seems to issue releases every 1 - 1.5 months (recently)... if you extrapolate back a year that's ~9-12 releases... I think I'm semi-comfortable saying "last few" or "past 6 months" etc... I'm slightly more sympathetic for those on older Mac OS using a Safari with no upgrade path, but Safari has supported this stuff since v11.1 in early 2018.

I'm only slightly bothered because it feels like this is the first time we're drawing a line in the sand... otherwise I think we support the past several years worth of modern browsers work just fine... but I suppose I can get over it.

@joshgoebel
Copy link
Member

However, if we just focus on this PR, which is just targeting Python, we get:

That's a fair redirect... and thanks for that. :-) Can you provide a benchmark script/scaffold or something I could use to quickly confirm the performance numbers you are seeing on this side - do you have something semi-automated for all those 10x, 100x, 1500x you mention? Perhaps there is indeed something else entirely at work here - some other grammar - or a particular regex that is somehow ill-suited for UTF-8 mode that is skewing the results when the change is global.

... I don't see any reason not to take this approach for Python in particular.

If I can confirm those performance numbers then I believe I would concur. This PR would still need some small changes, but I could work with you on those once I double check the numbers.

@austin-schick
Copy link
Contributor Author

Awesome, thank you! I'll put together a script for easy performance testing and send that over Thursday or Friday.

@austin-schick
Copy link
Contributor Author

Here you go, @joshgoebel! You can run node perf.js from the tools directory to get four different performance metrics. I'm hoping I wrote this in a way that's easily extensible so you can apply it to #3257 as well.

I think the easiest way to run the tests on main is to cherry pick 62e64d5 onto main and then run the script exactly the same way

@austin-schick
Copy link
Contributor Author

Whoops, you'll want to cherry pick this commit too (or just uncomment the line it uncomments) ^

@austin-schick austin-schick changed the title fix(python) Add support for unicode Python identifiers enh(python) Add support for unicode identifiers Jul 27, 2021
@joshgoebel
Copy link
Member

Sorry I've been busy, I should get to this later this week.

@austin-schick
Copy link
Contributor Author

Awesome, looking forward to your review :)

@austin-schick
Copy link
Contributor Author

@joshgoebel Any update on this? With the fall semester starting, it'd be great to get this working now

@joshgoebel
Copy link
Member

Ok, I don't see any performance regression here... let me drop a few comments on the PR itself so we van move forward!

Thanks for the ping.

@@ -5,10 +5,14 @@ Website: https://www.python.org
Category: common
*/

import { UNDERSCORE_IDENT_RE } from '../lib/modes.js';
import { UNICODE_SUPPORTED, UNDERSCORE_IDENT_RE } from '../lib/modes.js';
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't need a check for this (we can assume we only support browsers with unicode regex support), and we don't need the import at all anymore I don't think.

Suggested change
import { UNICODE_SUPPORTED, UNDERSCORE_IDENT_RE } from '../lib/modes.js';

import * as regex from '../lib/regex.js';

export default function(hljs) {
const PY_IDENT_RE = UNICODE_SUPPORTED ?
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please remove conditional and just write out the regex as a regex with /u.

@@ -358,6 +362,7 @@ export default function(hljs) {
'gyp',
'ipython'
],
unicode: UNICODE_SUPPORTED,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lets go with unicodeRegex: true...

Suggested change
unicode: UNICODE_SUPPORTED,
unicodeRegex: true,

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And of course compiler will need the name updated there.

src/lib/modes.js Outdated
@@ -4,6 +4,16 @@ import * as regex from './regex.js';
/** @typedef {import('highlight.js').Mode} Mode */
/** @typedef {import('highlight.js').ModeCallback} ModeCallback */

// Determine browser support for RegExp unicode property escapes
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove, etc.

@joshgoebel
Copy link
Member

You can just add new commits on top... I can clean up before merging... and figure out whether I want the perf stuff in this PR or make a sep one for that...

@joshgoebel
Copy link
Member

I know I took a while to review this - if you're too busy to finish it up I may pick it up in the next few days and push it to completion myself, since now the XML stuff is potentially waiting on this.

@austin-schick
Copy link
Contributor Author

Thanks for helping to push this to the finish line @joshgoebel. I had a busy two weeks, but I'm a little more open now. Let me know if there's anything else I can do

@joshgoebel
Copy link
Member

@austin-schick Thanks for much for the PR!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants