New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
handle surrogate pairs in character offsets #2509
Conversation
CodeMirror / javascript use utf16 code unit offsets, but Jupyter protocol expects unicode *character* offsets, so we need to translate back and forth.
Can you increment something in the protocol version so that in IJulia we can disable our workaround if this patch is present? |
Since this is fixing a bug in this repo where it failed to implement the existing spec correctly, I'm not sure that updating the protocol version is appropriate. Unless we want to retroactively change the v5.1 protocol spec to describe the buggy behavior and then define v5.2 to be exactly as v5.1 is now. |
I'm just wondering how to implement support for both the new and old versions of Jupyter. Arguably, this is a de facto change in the protocol, since up to now Jupyter has apparently always been using UTF-16 code units for cursor positions. |
Well technically it's only the notebook frontend that does that. We would
have to check how nteract, hydrogen, qtconsole, jupyterlab and
sagemathcloud behave. If they all behaved inappropriately then it may make
sens. But he notebook is just one of the implementation.
…On May 21, 2017 04:38, "Steven G. Johnson" ***@***.***> wrote:
I'm just wondering how to implement support for both the new and old
versions of Jupyter.
Arguably, this is a de facto change in the protocol, since up to now
Jupyter has apparently always been using UTF-16 code units for cursor
positions.
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
<#2509 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAUez9qkqqigm6sRdcD4s-Wdfx1vvS2lks5r8CIsgaJpZM4NhFxD>
.
|
I opened jupyterlab/jupyterlab#2255 to track this in JupyterLab. |
Hydrogen seems to have the same bug, since it passes |
sagemathcloud seems to have copied the IPython notebook code, so it appears to have the same bug. |
nteract seems to have the same bug. |
qtconsole seems to use whatever Python's indexing is? On Python 3.3 or later, this will be unicode characters, but on previous Python versions it will be UTF-16 code units. Or maybe it's set by Qt? A little hard for me to tell without trying it. |
In short, regarding @Carreau's comment, it seems that almost all existing Jupyter clients in fact implement the buggy UTF-16 behavior. So, it might make sense to retroactively change the spec to reflect this behavior (UTF-16 code units), and then change the spec going forward to use codepoints. (Or, alternatively, change the spec retroactively to use UTF-16 code units and keep it that way going forward. Either way, some kernels and/or clients will have to compensate.) |
I just tried qtconsole on my Mac with Python 2.7 and it seems to have the same bug. Pasting |
Noted, adding an issue there to track. |
I've tested with QtConsole on py27 as well, and it was correct. Python 2 has the option to store unicode as UCS-2 or UCS-4 (reflected in Encoding this bug into released versions of the spec means that all of the correctly-behaving frontends (Python-based ones, such as QtConsole, jupyter-console, etc.) would retroactively get the inverse bug introduced, breaking code that was correct. This is a case where it would have been really nice to include the frontend name and version in message headers, because changing the spec to backdate bugs introducing the same bug into packages that didn't have problems causes problems of its own. |
Given that there is no way to detect the frontend, in IJulia we've opted to just assume that the bug is present. This will be true for the vast majority of our users (of whom qtconsole users are a minuscule minority). My biggest concern is what to do going forward: once you merge this fix, how do we detect it if you don't bump the protocol version? (And bumping the protocol version would also notify kernels and frontends of the effective change in wire protocol.) I agree that enshrining the bug retroactively in the spec is ugly, and would technically break things, but apparently only for Linux or python3-based qtconsole and jupyter-console users who use non-BMP characters in their source code. This is apparently an incredibly small fraction of users since no one has noticed this problem until now. (Unicode identifiers are still pretty uncommon to begin with, especially outside of Julia, and on top of that you only see this problem for "astral plane" chars.) |
@stevengj yeah, I'm trying to figure out the best solution and nothing seems perfect. Bumping the version does make sense, but I'd describe the bug in previous versions as MAY have this bug (i.e. it's ambiguous) with a note about affected frontends, rather than backdating the definition to the buggy implementations, and let kernels make their choice about which frontends they want to interact with incorrectly for ambiguous versions. I'll draft an update to the protocol with a proposal and cc folks here. |
we can still add a key in the metadata field on completion request that says "this is fixed". It does not bump the spec and can include client names. We just need a pseudo-standard until we get the next spec version right ? |
That's another option, to include a 'unicode_fixed' key or similar. I'm not sure a temporary pseudo-standard helps if we are doing a minor bump anyway, since we can push that out pretty quickly. I'm okay with defining 5.2 as "5.1 with this ambiguity resolved" immediately, for instance. |
This definition of 5.2 suits me as well. I was just offering alternative. BTW, isn't the server negotiating the protocol version ? In which case other clients that talk to the kernel via the server may be wrong, in which case bumping the version number is not useful ? |
Thanks! Good to have alternatives to compare.
I was just looking into this, and the protocol version comes from the javascript, so it's currently identifying as 5.0 and bumping the version in the js to 5.2 should do everything right, as far as I can tell. If this weren't the case, I think the extra key would probably be a more appropriate band-aid. The Python frontends, on the other hand, are importing the version from jupyter_client right now, so protocol version info will not correctly signal changes in the application after an upgrade to jupyter_client. But in general, the version should be coming from the individual applications, not the defaults in jupyter_client, so we can and should fix this in jupyter_console and qtconsole. But these frontends also don't have the issue on Python 3, so it's of less immediate importance. The answer to unicode issues on Python 2 has always been to upgrade to Python 3 :). |
CodeMirror / javascript use utf16 code units for indices, but Jupyter protocol expects unicode character offsets, so we need to translate back and forth.
closes jupyter/jupyter_client#259