Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

handle surrogate pairs in character offsets #2509

Merged
merged 2 commits into from May 23, 2017
Merged

Conversation

minrk
Copy link
Member

@minrk minrk commented May 19, 2017

CodeMirror / javascript use utf16 code units for indices, but Jupyter protocol expects unicode character offsets, so we need to translate back and forth.

closes jupyter/jupyter_client#259

minrk added 2 commits May 19, 2017 15:46
CodeMirror / javascript use utf16 code unit offsets,
but Jupyter protocol expects unicode *character* offsets,
so we need to translate back and forth.
@stevengj
Copy link

stevengj commented May 20, 2017

Can you increment something in the protocol version so that in IJulia we can disable our workaround if this patch is present?

@minrk
Copy link
Member Author

minrk commented May 20, 2017

Since this is fixing a bug in this repo where it failed to implement the existing spec correctly, I'm not sure that updating the protocol version is appropriate. Unless we want to retroactively change the v5.1 protocol spec to describe the buggy behavior and then define v5.2 to be exactly as v5.1 is now.

@stevengj
Copy link

I'm just wondering how to implement support for both the new and old versions of Jupyter.

Arguably, this is a de facto change in the protocol, since up to now Jupyter has apparently always been using UTF-16 code units for cursor positions.

@Carreau
Copy link
Member

Carreau commented May 21, 2017 via email

@blink1073
Copy link
Member

I opened jupyterlab/jupyterlab#2255 to track this in JupyterLab.

@stevengj
Copy link

stevengj commented May 22, 2017

Hydrogen seems to have the same bug, since it passes code.length as the cursor_pos: my understanding is that this is the length in UTF-16 code units, not Unicode characters (& CoffeeScript's string indexing is identical to JavaScript's).

@stevengj
Copy link

sagemathcloud seems to have copied the IPython notebook code, so it appears to have the same bug.

@stevengj
Copy link

nteract seems to have the same bug.

@stevengj
Copy link

stevengj commented May 22, 2017

qtconsole seems to use whatever Python's indexing is? On Python 3.3 or later, this will be unicode characters, but on previous Python versions it will be UTF-16 code units. Or maybe it's set by Qt? A little hard for me to tell without trying it.

@stevengj
Copy link

stevengj commented May 22, 2017

In short, regarding @Carreau's comment, it seems that almost all existing Jupyter clients in fact implement the buggy UTF-16 behavior. So, it might make sense to retroactively change the spec to reflect this behavior (UTF-16 code units), and then change the spec going forward to use codepoints.

(Or, alternatively, change the spec retroactively to use UTF-16 code units and keep it that way going forward. Either way, some kernels and/or clients will have to compensate.)

@stevengj
Copy link

stevengj commented May 22, 2017

I just tried qtconsole on my Mac with Python 2.7 and it seems to have the same bug. Pasting 𝐚𝐚 followed by tab results in a complete_request message with code="𝐚𝐚" and cursor_pos=4. Dunno about Python 3.

@rgbkrk
Copy link
Member

rgbkrk commented May 22, 2017

nteract seems to have the same bug.

Noted, adding an issue there to track.

@minrk
Copy link
Member Author

minrk commented May 22, 2017

I've tested with QtConsole on py27 as well, and it was correct. Python 2 has the option to store unicode as UCS-2 or UCS-4 (reflected in sys.maxunicode), chosen at build time (one of the many things fixed in Python 3). On Linux, I believe UCS-4 is default on 64b builds (~all users these days) and UCS-2 on 32b. It's UCS-2 by default on most darwin builds, for some reason. If it uses UCS-4, offsets will be correct, if it's UCS-2 they will have the same issue as js. So for the most part, Python frontends behave correctly, but javascript frontends have this bug.

Encoding this bug into released versions of the spec means that all of the correctly-behaving frontends (Python-based ones, such as QtConsole, jupyter-console, etc.) would retroactively get the inverse bug introduced, breaking code that was correct.

This is a case where it would have been really nice to include the frontend name and version in message headers, because changing the spec to backdate bugs introducing the same bug into packages that didn't have problems causes problems of its own.

@stevengj
Copy link

stevengj commented May 23, 2017

Given that there is no way to detect the frontend, in IJulia we've opted to just assume that the bug is present. This will be true for the vast majority of our users (of whom qtconsole users are a minuscule minority).

My biggest concern is what to do going forward: once you merge this fix, how do we detect it if you don't bump the protocol version? (And bumping the protocol version would also notify kernels and frontends of the effective change in wire protocol.)

I agree that enshrining the bug retroactively in the spec is ugly, and would technically break things, but apparently only for Linux or python3-based qtconsole and jupyter-console users who use non-BMP characters in their source code. This is apparently an incredibly small fraction of users since no one has noticed this problem until now. (Unicode identifiers are still pretty uncommon to begin with, especially outside of Julia, and on top of that you only see this problem for "astral plane" chars.)

@minrk
Copy link
Member Author

minrk commented May 23, 2017

@stevengj yeah, I'm trying to figure out the best solution and nothing seems perfect. Bumping the version does make sense, but I'd describe the bug in previous versions as MAY have this bug (i.e. it's ambiguous) with a note about affected frontends, rather than backdating the definition to the buggy implementations, and let kernels make their choice about which frontends they want to interact with incorrectly for ambiguous versions.

I'll draft an update to the protocol with a proposal and cc folks here.

@Carreau
Copy link
Member

Carreau commented May 23, 2017

we can still add a key in the metadata field on completion request that says "this is fixed". It does not bump the spec and can include client names. We just need a pseudo-standard until we get the next spec version right ?

@minrk
Copy link
Member Author

minrk commented May 23, 2017

That's another option, to include a 'unicode_fixed' key or similar. I'm not sure a temporary pseudo-standard helps if we are doing a minor bump anyway, since we can push that out pretty quickly. I'm okay with defining 5.2 as "5.1 with this ambiguity resolved" immediately, for instance.

@Carreau
Copy link
Member

Carreau commented May 23, 2017

This definition of 5.2 suits me as well. I was just offering alternative.

BTW, isn't the server negotiating the protocol version ? In which case other clients that talk to the kernel via the server may be wrong, in which case bumping the version number is not useful ?

@minrk minrk deleted the surrogates branch May 23, 2017 17:24
@minrk
Copy link
Member Author

minrk commented May 23, 2017

Thanks! Good to have alternatives to compare.

isn't the server negotiating the protocol version ? In which case other clients that talk to the kernel via the server may be wrong, in which case bumping the version number is not useful ?

I was just looking into this, and the protocol version comes from the javascript, so it's currently identifying as 5.0 and bumping the version in the js to 5.2 should do everything right, as far as I can tell. If this weren't the case, I think the extra key would probably be a more appropriate band-aid.

The Python frontends, on the other hand, are importing the version from jupyter_client right now, so protocol version info will not correctly signal changes in the application after an upgrade to jupyter_client. But in general, the version should be coming from the individual applications, not the defaults in jupyter_client, so we can and should fix this in jupyter_console and qtconsole. But these frontends also don't have the issue on Python 3, so it's of less immediate importance. The answer to unicode issues on Python 2 has always been to upgrade to Python 3 :).

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

incorrect cursor_pos for completion requests with non-BMP Unicode characters
6 participants