Error when inputting UTF8 CJK characters #88

Closed
ipython opened this Issue May 10, 2010 · 17 comments

Comments

Projects
None yet
5 participants
@ghost
Collaborator

ghost commented May 10, 2010

Original Launchpad bug 264587: https://bugs.launchpad.net/ipython/+bug/264587
Reported by: ngu-kho (ngU khO).

I'm using a UTF8 locale on my system and it's been working well for years. But when I try to input CJK characters into the IPython console in gnome-terminal via scim(an input method platform), some of the characters may go wrong in the console.

An example is the CJK character '选'('\u9009'), it should be encoded as '\xe9\x80\x89' in UTF8. However, trying to type this character into IPython is always a failure. The character will be display as several spaces followed by two question marks surrounded by a diamond(the � character):

In [1]: s = raw_input()
�� (I was actually inputting '选', however this character could be displayed correctly)

And the read string is not the one I input(the first byte in original string becomes several spaces)

In [3]: s
Out[3]: ' \x80\x89'

Moreover, making an assignment to such characters may cause IPython to exit:

In [1]: s = ' ��'
WARNING:


You or a %run:ed script called sys.stdin.close() or sys.stdout.close()!
Exiting IPython!

Such things do not happen in the original python console(/usr/bin/python). And it should not be a problem of scim since the same thing happens when I paste the character from the clipboard instead of typing.

Attached is the screenshot of problem.

@ghost

This comment has been minimized.

Show comment Hide comment
@ghost

ghost May 10, 2010

Collaborator

[ LP comment 1 by: ngU khO, on 2008-09-04 05:18:18.813823+00:00 ]

Collaborator

ghost commented May 10, 2010

[ LP comment 1 by: ngU khO, on 2008-09-04 05:18:18.813823+00:00 ]

@ghost

This comment has been minimized.

Show comment Hide comment
@ghost

ghost May 10, 2010

Collaborator

[ LP comment 2 by: ngU khO, on 2008-09-04 05:20:23.926034+00:00 ]

Sorry I missed a 'not' in the line after 'In[1]'. The sentence in the brackets should be:
(I was actually inputting '选', however this character could NOT be displayed correctly)

Collaborator

ghost commented May 10, 2010

[ LP comment 2 by: ngU khO, on 2008-09-04 05:20:23.926034+00:00 ]

Sorry I missed a 'not' in the line after 'In[1]'. The sentence in the brackets should be:
(I was actually inputting '选', however this character could NOT be displayed correctly)

@takluyver

This comment has been minimized.

Show comment Hide comment
@takluyver

takluyver Mar 23, 2011

Owner

Duplicate of #58.

Owner

takluyver commented Mar 23, 2011

Duplicate of #58.

@takluyver takluyver closed this Mar 23, 2011

@rspeer

This comment has been minimized.

Show comment Hide comment
@rspeer

rspeer Apr 13, 2011

Bug #58 seems unrelated, and I can replicate the crash using the current git version of IPython.

~/src/ipython|master$ ./ipython.py
Python 2.6.1 (r261:67515, Jun 24 2010, 21:47:49)
Type "copyright", "credits" or "license" for more information.

IPython 0.11.dev -- An enhanced Interactive Python.
? -> Introduction and overview of IPython's features.
%quickref -> Quick reference.
help -> Python's own help system.
object? -> Details about 'object', use 'object??' for extra details.

In [1]: a = u'地震'
WARNING:


You or a %run:ed script called sys.stdin.close() or sys.stdout.close()!
Exiting IPython!

Note that the miscellaneous spaces and diamond question marks no longer appear, but the crash still happens.

rspeer commented Apr 13, 2011

Bug #58 seems unrelated, and I can replicate the crash using the current git version of IPython.

~/src/ipython|master$ ./ipython.py
Python 2.6.1 (r261:67515, Jun 24 2010, 21:47:49)
Type "copyright", "credits" or "license" for more information.

IPython 0.11.dev -- An enhanced Interactive Python.
? -> Introduction and overview of IPython's features.
%quickref -> Quick reference.
help -> Python's own help system.
object? -> Details about 'object', use 'object??' for extra details.

In [1]: a = u'地震'
WARNING:


You or a %run:ed script called sys.stdin.close() or sys.stdout.close()!
Exiting IPython!

Note that the miscellaneous spaces and diamond question marks no longer appear, but the crash still happens.

@takluyver

This comment has been minimized.

Show comment Hide comment
@takluyver

takluyver Apr 13, 2011

Owner

My previous comment seems to have linked to bug 58 on another project for some reason, but bug #58 on IPython is also about entering characters starting with the byte \xe9. It's something related to readline, and we disabled a line of the configuration, which is why you no longer get the spaces.

I'm afraid I can't replicate the crash. What system are you on? What's your terminal encoding? Can you check what range of characters cause it? If you've customised your .inputrc file, can you remove it so you're using a default readline config (you might need to restart your terminal after doing that)? If it still fails, can you try commenting out the lines 237-252 of IPython/core/interactiveshell.py, and see which, if any, are linked to the problem? Thanks.

Owner

takluyver commented Apr 13, 2011

My previous comment seems to have linked to bug 58 on another project for some reason, but bug #58 on IPython is also about entering characters starting with the byte \xe9. It's something related to readline, and we disabled a line of the configuration, which is why you no longer get the spaces.

I'm afraid I can't replicate the crash. What system are you on? What's your terminal encoding? Can you check what range of characters cause it? If you've customised your .inputrc file, can you remove it so you're using a default readline config (you might need to restart your terminal after doing that)? If it still fails, can you try commenting out the lines 237-252 of IPython/core/interactiveshell.py, and see which, if any, are linked to the problem? Thanks.

@rspeer

This comment has been minimized.

Show comment Hide comment
@rspeer

rspeer Apr 18, 2011

I'm running iPython 0.11dev on Mac OS 10.6. The terminal encoding is UTF-8.

I've installed readline from PyPI so that I don't get the "Leopard libedit" nonsense, but I haven't customized .inputrc in any way.

Unfortunately, I'm sshed into that mac half the time, and because other programs also fail at Unicode I can't actually paste in those literal characters to reproduce the issue until I'm sitting in front of the computer again. But I'll try to let you know what effect commenting out those lines has when I get the chance.

rspeer commented Apr 18, 2011

I'm running iPython 0.11dev on Mac OS 10.6. The terminal encoding is UTF-8.

I've installed readline from PyPI so that I don't get the "Leopard libedit" nonsense, but I haven't customized .inputrc in any way.

Unfortunately, I'm sshed into that mac half the time, and because other programs also fail at Unicode I can't actually paste in those literal characters to reproduce the issue until I'm sitting in front of the computer again. But I'll try to let you know what effect commenting out those lines has when I get the chance.

@takluyver takluyver reopened this Apr 18, 2011

@takluyver

This comment has been minimized.

Show comment Hide comment
@takluyver

takluyver Apr 18, 2011

Owner

OK, reopened for now until we know what's going on.

Owner

takluyver commented Apr 18, 2011

OK, reopened for now until we know what's going on.

@minrk

This comment has been minimized.

Show comment Hide comment
@minrk

minrk Apr 22, 2011

Owner

I don't believe this and #58 are unrelated, I think they are symptoms of the exact same underlying issue, and indeed fixed in master.

readline bindings map \M-i to \xe9, which causes unicode characters in that range to misbehave (by turning 3-byte unicode characters into invalid ones by replacing the control-code). Now that we no longer bind that keyseq, the crash does not occur, at least not on my UTF8 OSX machine (apparently identical to rspeer), nor on my UTF8 Linux system.

If anyone does, however, bind \xe9 (which is 'é') to any command or macro, they will likely encounter this error.

Owner

minrk commented Apr 22, 2011

I don't believe this and #58 are unrelated, I think they are symptoms of the exact same underlying issue, and indeed fixed in master.

readline bindings map \M-i to \xe9, which causes unicode characters in that range to misbehave (by turning 3-byte unicode characters into invalid ones by replacing the control-code). Now that we no longer bind that keyseq, the crash does not occur, at least not on my UTF8 OSX machine (apparently identical to rspeer), nor on my UTF8 Linux system.

If anyone does, however, bind \xe9 (which is 'é') to any command or macro, they will likely encounter this error.

@rspeer

This comment has been minimized.

Show comment Hide comment
@rspeer

rspeer Apr 23, 2011

I said "#58 is unrelated" because following some links to "#58" was bringing up a different bug. Your link works fine.

Here's what I've found. Even though Mac OS natively supports UTF-8, it by default has LC_ALL=C in its locale settings. The behavior of trying to paste high Unicode text into IPython is quite inconsistent when this is the case -- since experimenting with environment variables, I can't reproduce the previous crash, but now it just ignores the characters.

When LC_ALL="en_US.UTF-8", IPython handles this text correctly.

rspeer commented Apr 23, 2011

I said "#58 is unrelated" because following some links to "#58" was bringing up a different bug. Your link works fine.

Here's what I've found. Even though Mac OS natively supports UTF-8, it by default has LC_ALL=C in its locale settings. The behavior of trying to paste high Unicode text into IPython is quite inconsistent when this is the case -- since experimenting with environment variables, I can't reproduce the previous crash, but now it just ignores the characters.

When LC_ALL="en_US.UTF-8", IPython handles this text correctly.

@minrk

This comment has been minimized.

Show comment Hide comment
@minrk

minrk Apr 23, 2011

Owner

Ah, thanks for clarifying. Sorry I misunderstood your comment.

I don't have LC_ALL set anywhere, but if the Terminal config option 'Set locale environment variables on startup' is checked (it is checked by default), then LANG=en_US.UTF-8, and IPython's unicode behavior appears exactly as it should.

Can you describe what you mean by 'inconsistent', and can you compare IPython's behavior to Python's?

If that box is unchecked, or I otherwise override LANG or LC_ALL to 'C' or blank, or I ssh to the machine, then stdin/out get 'ascii' as their encoding, and the behavior doesn't seem inconsistent, though I do get ASCII encode errors if I try to print unicode characters, and I can't type or paste unicode characters at all. But this is exactly right if stdin/out are ASCII, and not different from Python's own behavior, or even bash, if they are started in an environment with LC_ALL=C.

However, I still can't make this bug occur with any combination of terminals and environment variables I've tried yet.

Owner

minrk commented Apr 23, 2011

Ah, thanks for clarifying. Sorry I misunderstood your comment.

I don't have LC_ALL set anywhere, but if the Terminal config option 'Set locale environment variables on startup' is checked (it is checked by default), then LANG=en_US.UTF-8, and IPython's unicode behavior appears exactly as it should.

Can you describe what you mean by 'inconsistent', and can you compare IPython's behavior to Python's?

If that box is unchecked, or I otherwise override LANG or LC_ALL to 'C' or blank, or I ssh to the machine, then stdin/out get 'ascii' as their encoding, and the behavior doesn't seem inconsistent, though I do get ASCII encode errors if I try to print unicode characters, and I can't type or paste unicode characters at all. But this is exactly right if stdin/out are ASCII, and not different from Python's own behavior, or even bash, if they are started in an environment with LC_ALL=C.

However, I still can't make this bug occur with any combination of terminals and environment variables I've tried yet.

@rspeer

This comment has been minimized.

Show comment Hide comment
@rspeer

rspeer May 9, 2011

What I mean by "inconsistent" is that the same input will produce different results at different times. For example, I cannot reproduce my crash of April 12 in any terminal environment, and I've gone a few weeks without a crash -- but now I get a new one.

It is triggered by the single character \uff1a, the double-width colon; when pasted into IPython, it deletes the last several characters and inserts two question marks. Here is the result of doing the same thing in python and ipython.

$ python
Python 2.6.1 (r261:67515, Jun 24 2010, 21:47:49)
[GCC 4.2.1 (Apple Inc. build 5646)] on darwin
Type "help", "copyright", "credits" or "license" for more information.

a = u':'
a
u'\uff1a'
exit()

$ ipython
Python 2.6.1 (r261:67515, Jun 24 2010, 21:47:49)
Type "copyright", "credits" or "license" for more information.

IPython 0.11.dev -- An enhanced Interactive Python.
? -> Introduction and overview of IPython's features.
%quickref -> Quick reference.
help -> Python's own help system.
object? -> Details about 'object', use 'object??' for extra details.

In [1]: a ??'
WARNING:


You or a %run:ed script called sys.stdin.close() or sys.stdout.close()!
Exiting IPython!

rspeer commented May 9, 2011

What I mean by "inconsistent" is that the same input will produce different results at different times. For example, I cannot reproduce my crash of April 12 in any terminal environment, and I've gone a few weeks without a crash -- but now I get a new one.

It is triggered by the single character \uff1a, the double-width colon; when pasted into IPython, it deletes the last several characters and inserts two question marks. Here is the result of doing the same thing in python and ipython.

$ python
Python 2.6.1 (r261:67515, Jun 24 2010, 21:47:49)
[GCC 4.2.1 (Apple Inc. build 5646)] on darwin
Type "help", "copyright", "credits" or "license" for more information.

a = u':'
a
u'\uff1a'
exit()

$ ipython
Python 2.6.1 (r261:67515, Jun 24 2010, 21:47:49)
Type "copyright", "credits" or "license" for more information.

IPython 0.11.dev -- An enhanced Interactive Python.
? -> Introduction and overview of IPython's features.
%quickref -> Quick reference.
help -> Python's own help system.
object? -> Details about 'object', use 'object??' for extra details.

In [1]: a ??'
WARNING:


You or a %run:ed script called sys.stdin.close() or sys.stdout.close()!
Exiting IPython!

@takluyver

This comment has been minimized.

Show comment Hide comment
@takluyver

takluyver May 10, 2011

Owner

Same problem, different byte. This time it's the \M-o combination, which is evidently byte \xef, and maps to four backspaces.

Do we just disable that as well? Is there a better way of working out which readline key combinations will interfere with UTF-8 code? Is it only meta- key combinations that cause the problem?

Owner

takluyver commented May 10, 2011

Same problem, different byte. This time it's the \M-o combination, which is evidently byte \xef, and maps to four backspaces.

Do we just disable that as well? Is there a better way of working out which readline key combinations will interfere with UTF-8 code? Is it only meta- key combinations that cause the problem?

@rspeer

This comment has been minimized.

Show comment Hide comment
@rspeer

rspeer May 10, 2011

The meta modifier in readline adds 128 to the character code. This means that it cannot possibly coexist with UTF-8, so it would surprise me if meta-key combinations have ever worked correctly in the last decade. Actually, it conflicts with any 8-bit encoding, so make that two decades.

Disabling them all sounds like a good idea.

Control-key modifiers seem to map to characters 0-31 and 127, which are safe. Are there other sets of modified characters in readline besides control and meta?

rspeer commented May 10, 2011

The meta modifier in readline adds 128 to the character code. This means that it cannot possibly coexist with UTF-8, so it would surprise me if meta-key combinations have ever worked correctly in the last decade. Actually, it conflicts with any 8-bit encoding, so make that two decades.

Disabling them all sounds like a good idea.

Control-key modifiers seem to map to characters 0-31 and 127, which are safe. Are there other sets of modified characters in readline besides control and meta?

@minrk

This comment has been minimized.

Show comment Hide comment
@minrk

minrk May 10, 2011

Owner

Ah, thanks for the clarification. We should definitely remove the meta-bindings.

These are the only \M bindings we have, and they are both indentation:

    '"\M-o": "\d\d\d\d"',
    '"\M-I": "\d\d\d\d"',

Has anybody ever used these? Is it an emacs thing? Should we just remap them to \C- or throw them away?

I'd vote for just removing them and call it done. Users can make their own parse_and_bind() calls be unsafe, but that's not IPython's problem - they can do it in bash just as well.

Owner

minrk commented May 10, 2011

Ah, thanks for the clarification. We should definitely remove the meta-bindings.

These are the only \M bindings we have, and they are both indentation:

    '"\M-o": "\d\d\d\d"',
    '"\M-I": "\d\d\d\d"',

Has anybody ever used these? Is it an emacs thing? Should we just remap them to \C- or throw them away?

I'd vote for just removing them and call it done. Users can make their own parse_and_bind() calls be unsafe, but that's not IPython's problem - they can do it in bash just as well.

@epatters

This comment has been minimized.

Show comment Hide comment
@epatters

epatters May 10, 2011

Contributor

M-I is apparently an Emacs binding for indentation, although I didn't know about it until I tried it out just now.

M-o is a completely different function in Emacs.

Contributor

epatters commented May 10, 2011

M-I is apparently an Emacs binding for indentation, although I didn't know about it until I tried it out just now.

M-o is a completely different function in Emacs.

@minrk

This comment has been minimized.

Show comment Hide comment
@minrk

minrk May 10, 2011

Owner

Okay, I'll just remove them with a note, closing this Issue. Thanks for the help, @rspeer!

Owner

minrk commented May 10, 2011

Okay, I'll just remove them with a note, closing this Issue. Thanks for the help, @rspeer!

@minrk minrk closed this in b2146ae May 10, 2011

@takluyver

This comment has been minimized.

Show comment Hide comment
@takluyver

takluyver May 10, 2011

Owner

We did have M-i, which was mapped to indentation, and we removed earlier in relation to this bug. Apart from Ctrl- modified codes, we only now have \e[A and \e[B, which I think represent the up and down arrow keys.

Owner

takluyver commented May 10, 2011

We did have M-i, which was mapped to indentation, and we removed earlier in relation to this bug. Apart from Ctrl- modified codes, we only now have \e[A and \e[B, which I think represent the up and down arrow keys.

mattvonrocketstein pushed a commit to mattvonrocketstein/ipython that referenced this issue Nov 3, 2014

Remove "\M-" readline bindings
They conflict with 8-bit encodings

closes gh-88
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment