handle surrogate pairs in character offsets #2509

minrk · 2017-05-19T22:46:48Z

CodeMirror / javascript use utf16 code units for indices, but Jupyter protocol expects unicode character offsets, so we need to translate back and forth.

closes jupyter/jupyter_client#259

CodeMirror / javascript use utf16 code unit offsets, but Jupyter protocol expects unicode *character* offsets, so we need to translate back and forth.

stevengj · 2017-05-20T12:10:09Z

Can you increment something in the protocol version so that in IJulia we can disable our workaround if this patch is present?

minrk · 2017-05-20T20:40:38Z

Since this is fixing a bug in this repo where it failed to implement the existing spec correctly, I'm not sure that updating the protocol version is appropriate. Unless we want to retroactively change the v5.1 protocol spec to describe the buggy behavior and then define v5.2 to be exactly as v5.1 is now.

stevengj · 2017-05-21T11:38:20Z

I'm just wondering how to implement support for both the new and old versions of Jupyter.

Arguably, this is a de facto change in the protocol, since up to now Jupyter has apparently always been using UTF-16 code units for cursor positions.

Carreau · 2017-05-21T16:25:02Z

Well technically it's only the notebook frontend that does that. We would have to check how nteract, hydrogen, qtconsole, jupyterlab and sagemathcloud behave. If they all behaved inappropriately then it may make sens. But he notebook is just one of the implementation.

…

On May 21, 2017 04:38, "Steven G. Johnson" ***@***.***> wrote: I'm just wondering how to implement support for both the new and old versions of Jupyter. Arguably, this is a de facto change in the protocol, since up to now Jupyter has apparently always been using UTF-16 code units for cursor positions. — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#2509 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAUez9qkqqigm6sRdcD4s-Wdfx1vvS2lks5r8CIsgaJpZM4NhFxD> .

blink1073 · 2017-05-22T11:56:07Z

I opened jupyterlab/jupyterlab#2255 to track this in JupyterLab.

stevengj · 2017-05-22T14:18:58Z

Hydrogen seems to have the same bug, since it passes code.length as the cursor_pos: my understanding is that this is the length in UTF-16 code units, not Unicode characters (& CoffeeScript's string indexing is identical to JavaScript's).

stevengj · 2017-05-22T14:24:01Z

sagemathcloud seems to have copied the IPython notebook code, so it appears to have the same bug.

stevengj · 2017-05-22T14:26:06Z

nteract seems to have the same bug.

stevengj · 2017-05-22T14:34:27Z

qtconsole seems to use whatever Python's indexing is? On Python 3.3 or later, this will be unicode characters, but on previous Python versions it will be UTF-16 code units. Or maybe it's set by Qt? A little hard for me to tell without trying it.

stevengj · 2017-05-22T14:35:56Z

In short, regarding @Carreau's comment, it seems that almost all existing Jupyter clients in fact implement the buggy UTF-16 behavior. So, it might make sense to retroactively change the spec to reflect this behavior (UTF-16 code units), and then change the spec going forward to use codepoints.

(Or, alternatively, change the spec retroactively to use UTF-16 code units and keep it that way going forward. Either way, some kernels and/or clients will have to compensate.)

stevengj · 2017-05-22T18:15:48Z

I just tried qtconsole on my Mac with Python 2.7 and it seems to have the same bug. Pasting 𝐚𝐚 followed by tab results in a complete_request message with code="𝐚𝐚" and cursor_pos=4. Dunno about Python 3.

rgbkrk · 2017-05-22T18:31:33Z

nteract seems to have the same bug.

Noted, adding an issue there to track.

minrk · 2017-05-22T22:18:14Z

I've tested with QtConsole on py27 as well, and it was correct. Python 2 has the option to store unicode as UCS-2 or UCS-4 (reflected in sys.maxunicode), chosen at build time (one of the many things fixed in Python 3). On Linux, I believe UCS-4 is default on 64b builds (~all users these days) and UCS-2 on 32b. It's UCS-2 by default on most darwin builds, for some reason. If it uses UCS-4, offsets will be correct, if it's UCS-2 they will have the same issue as js. So for the most part, Python frontends behave correctly, but javascript frontends have this bug.

Encoding this bug into released versions of the spec means that all of the correctly-behaving frontends (Python-based ones, such as QtConsole, jupyter-console, etc.) would retroactively get the inverse bug introduced, breaking code that was correct.

This is a case where it would have been really nice to include the frontend name and version in message headers, because changing the spec to backdate bugs introducing the same bug into packages that didn't have problems causes problems of its own.

stevengj · 2017-05-23T02:57:31Z

Given that there is no way to detect the frontend, in IJulia we've opted to just assume that the bug is present. This will be true for the vast majority of our users (of whom qtconsole users are a minuscule minority).

My biggest concern is what to do going forward: once you merge this fix, how do we detect it if you don't bump the protocol version? (And bumping the protocol version would also notify kernels and frontends of the effective change in wire protocol.)

I agree that enshrining the bug retroactively in the spec is ugly, and would technically break things, but apparently only for Linux or python3-based qtconsole and jupyter-console users who use non-BMP characters in their source code. This is apparently an incredibly small fraction of users since no one has noticed this problem until now. (Unicode identifiers are still pretty uncommon to begin with, especially outside of Julia, and on top of that you only see this problem for "astral plane" chars.)

minrk · 2017-05-23T16:32:11Z

@stevengj yeah, I'm trying to figure out the best solution and nothing seems perfect. Bumping the version does make sense, but I'd describe the bug in previous versions as MAY have this bug (i.e. it's ambiguous) with a note about affected frontends, rather than backdating the definition to the buggy implementations, and let kernels make their choice about which frontends they want to interact with incorrectly for ambiguous versions.

I'll draft an update to the protocol with a proposal and cc folks here.

Carreau · 2017-05-23T17:00:22Z

we can still add a key in the metadata field on completion request that says "this is fixed". It does not bump the spec and can include client names. We just need a pseudo-standard until we get the next spec version right ?

minrk · 2017-05-23T17:09:50Z

That's another option, to include a 'unicode_fixed' key or similar. I'm not sure a temporary pseudo-standard helps if we are doing a minor bump anyway, since we can push that out pretty quickly. I'm okay with defining 5.2 as "5.1 with this ambiguity resolved" immediately, for instance.

Carreau · 2017-05-23T17:21:46Z

This definition of 5.2 suits me as well. I was just offering alternative.

BTW, isn't the server negotiating the protocol version ? In which case other clients that talk to the kernel via the server may be wrong, in which case bumping the version number is not useful ?

minrk · 2017-05-23T18:10:08Z

Thanks! Good to have alternatives to compare.

isn't the server negotiating the protocol version ? In which case other clients that talk to the kernel via the server may be wrong, in which case bumping the version number is not useful ?

I was just looking into this, and the protocol version comes from the javascript, so it's currently identifying as 5.0 and bumping the version in the js to 5.2 should do everything right, as far as I can tell. If this weren't the case, I think the extra key would probably be a more appropriate band-aid.

The Python frontends, on the other hand, are importing the version from jupyter_client right now, so protocol version info will not correctly signal changes in the application after an upgrade to jupyter_client. But in general, the version should be coming from the individual applications, not the defaults in jupyter_client, so we can and should fix this in jupyter_console and qtconsole. But these frontends also don't have the issue on Python 3, so it's of less immediate importance. The answer to unicode issues on Python 2 has always been to upgrade to Python 3 :).

minrk added 2 commits May 19, 2017 15:46

handle surrogate pairs

a5e64e1

CodeMirror / javascript use utf16 code unit offsets, but Jupyter protocol expects unicode *character* offsets, so we need to translate back and forth.

handle surrogate pairs in tooltip cursor_pos

2da75c3

minrk mentioned this pull request May 19, 2017

incorrect cursor_pos for completion requests with non-BMP Unicode characters jupyter/jupyter_client#259

Closed

minrk mentioned this pull request May 20, 2017

incorrect cursor position for bold math text codemirror/codemirror5#4750

Closed

blink1073 mentioned this pull request May 22, 2017

Handle surrogate pairs in character offsets jupyterlab/jupyterlab#2255

Closed

blink1073 approved these changes May 22, 2017

View reviewed changes

rgbkrk mentioned this pull request May 22, 2017

handle surrogate pairs in character offsets nteract/nteract#1706

Closed

takluyver merged commit df2fc69 into jupyter:master May 23, 2017

takluyver added this to the 5.1 milestone May 23, 2017

This was referenced May 23, 2017

incorrect cursor_pos for non-BMP characters in Jupyter protocol nteract/hydrogen#807

Closed

incorrect cursor_pos for non-BMP characters in Jupyter protocol sagemath/cloud#207

Closed

minrk deleted the surrogates branch May 23, 2017 17:24

minrk mentioned this pull request May 23, 2017

describe cursor_pos ambiguity and bump protocol to 5.2 jupyter/jupyter_client#262

Merged

minrk mentioned this pull request Jun 8, 2017

Protect against hypothetical future where javascript stops using surrogate pairs #2560

Merged

github-actions bot added the status:resolved-locked label Apr 7, 2021

github-actions bot locked as resolved and limited conversation to collaborators Apr 7, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

handle surrogate pairs in character offsets #2509

handle surrogate pairs in character offsets #2509

minrk commented May 19, 2017 •

edited

stevengj commented May 20, 2017 •

edited

minrk commented May 20, 2017

stevengj commented May 21, 2017

Carreau commented May 21, 2017 via email

blink1073 commented May 22, 2017

stevengj commented May 22, 2017 •

edited

stevengj commented May 22, 2017

stevengj commented May 22, 2017

stevengj commented May 22, 2017 •

edited

stevengj commented May 22, 2017 •

edited

stevengj commented May 22, 2017 •

edited

rgbkrk commented May 22, 2017

minrk commented May 22, 2017

stevengj commented May 23, 2017 •

edited

minrk commented May 23, 2017

Carreau commented May 23, 2017

minrk commented May 23, 2017

Carreau commented May 23, 2017

minrk commented May 23, 2017

handle surrogate pairs in character offsets #2509

handle surrogate pairs in character offsets #2509

Conversation

minrk commented May 19, 2017 • edited

stevengj commented May 20, 2017 • edited

minrk commented May 20, 2017

stevengj commented May 21, 2017

Carreau commented May 21, 2017 via email

blink1073 commented May 22, 2017

stevengj commented May 22, 2017 • edited

stevengj commented May 22, 2017

stevengj commented May 22, 2017

stevengj commented May 22, 2017 • edited

stevengj commented May 22, 2017 • edited

stevengj commented May 22, 2017 • edited

rgbkrk commented May 22, 2017

minrk commented May 22, 2017

stevengj commented May 23, 2017 • edited

minrk commented May 23, 2017

Carreau commented May 23, 2017

minrk commented May 23, 2017

Carreau commented May 23, 2017

minrk commented May 23, 2017

minrk commented May 19, 2017 •

edited

stevengj commented May 20, 2017 •

edited

stevengj commented May 22, 2017 •

edited

stevengj commented May 22, 2017 •

edited

stevengj commented May 22, 2017 •

edited

stevengj commented May 22, 2017 •

edited

stevengj commented May 23, 2017 •

edited