unicode bug - encoding input #25

ipython · 2010-05-10T09:29:16Z

Original Launchpad bug 339642: https://bugs.launchpad.net/ipython/+bug/339642
Reported by: vsevolod-solovyov (Murkt).

Default Python shell:

u'абвгд'
u'\u0430\u0431\u0432\u0433\u0434'

IPython 0.9.1:

u'абвгд'
u'\xd0\xb0\xd0\xb1\xd0\xb2\xd0\xb3\xd0\xb4'
'абвгд'.decode('utf8')
u'\u0430\u0431\u0432\u0433\u0434'

sys.stdin.encoding is 'UTF-8'.

How to fix: remove the line №2022 from IPython/iplib.py (for 0.9.1 release). Here it is:

--- a/iplib.py
+++ b/iplib.py
@@ -2019,7 +2019,6 @@
# this allows execution of indented pasted code. It is tempting
# to add '\n' at the end of source to run commands like ' a=1'
# directly, but this fails for more complicated scenarios

   source=source.encode(self.stdin_encoding)
 if source[:1] in [' ', '\t']:
     source = 'if 1:\n%s' % source

I didn't find any intoduced bugs by a quick check.

Additionaly, I checked ipython-wx and ipythonx - latter doesn't have this bug.

The text was updated successfully, but these errors were encountered:

ipython · 2010-05-10T09:29:18Z

[ LP comment 1 by: Murkt, on 2009-03-08 20:07:25.828411+00:00 ]

This line in trunk: http://bazaar.launchpad.net/~ipython-dev/ipython/trunk/annotate/head%3A/IPython//iplib.py#L2031

ipython · 2010-05-10T09:29:19Z

[ LP comment 2 by: Murkt, on 2009-03-08 20:54:18.976119+00:00 ]

This bug was noticed in 2006 year: http://lists.ipython.scipy.org/pipermail/ipython-dev/2006-August/002305.html

ipython · 2010-05-10T09:29:20Z

[ LP comment 3 by: Sergey Kishchenko, on 2009-03-09 11:57:58.714729+00:00 ]

I confirm this bug. Attached patch fixed the issue for me

ipython · 2010-05-10T09:29:21Z

[ LP comment 4 by: Laurent Dufrechou, on 2009-03-17 20:20:24.848465+00:00 ]

also related :
https://bugs.launchpad.net/bugs/290677

ipython · 2010-05-10T09:29:22Z

[ LP comment 5 by: Fernando Perez, on 2009-03-17 20:24:00.727454+00:00 ]

That's indeed a bug, but the patch is removing a line that was put in there explicitly for some reason. So what I'd like to have, before committing this, is a set of tests in a file named test_unicode.py, that encapsulates all of the recent unicode work.

Unfortunately a lot of these unicode fixes have been made in a completely ad-hoc manner, as people report problems, but we don't have a centralized list of cases to check against. His may be a reasonable fix, for all I know, but I'm afraid that if we apply it we'll get back 10 old bugs again. I don't know, maybe not, but there's simply no way to be sure.

I'm one of the most ignorant of our bunch in unicode issues, blissfully living in the stupidity of the ascii world. It would be great if one of us who knows more about this stuff could at least write a set of simple unicode tests that catch many of the recently reported encoding problems. Jorgen, Ville, any chance you guys could take this up at some point? You know about it a lot more than I do...

ipython · 2010-05-10T09:29:23Z

[ LP comment 6 by: Jörgen Stenarson, on 2009-03-17 20:28:38.963967+00:00 ]

The proposed patch does not work for me on win32 with or without pyreadline

sys.stdin.encoding == "cp1252"

Standard python:

c:\python>python
Python 2.5.2 (r252:60911, Feb 21 2008, 13:11:45) [MSC v.1310 32 bit (Intel)] on win32
Type "help", "copyright", "credits" or "license" for more information.

"åäö"
'\xe5\xe4\xf6'
u"åäö"
u'\xe5\xe4\xf6'

IPython from trunk:

c:\python>ipython
Python 2.5.2 (r252:60911, Feb 21 2008, 13:11:45) [MSC v.1310 32 bit (Intel)]
Type "copyright", "credits" or "license" for more information.

IPython 0.9.1 -- An enhanced Interactive Python.
? -> Introduction and overview of IPython's features.
%quickref -> Quick reference.
help -> Python's own help system.
object? -> Details about 'object'. ?object also works, ?? prints more.

In [1]: "åäö"
Out[1]: '\xe5\xe4\xf6'

In [2]: u"åäö"
Out[2]: u'\xe5\xe4\xf6'

In [3]:
Do you really want to exit ([y]/n)?

IPython with proposed change:

c:\python>ipython
Python 2.5.2 (r252:60911, Feb 21 2008, 13:11:45) [MSC v.1310 32 bit (Intel)]
Type "copyright", "credits" or "license" for more information.

IPython 0.9.1 -- An enhanced Interactive Python.
? -> Introduction and overview of IPython's features.
%quickref -> Quick reference.
help -> Python's own help system.
object? -> Details about 'object'. ?object also works, ?? prints more.

In [1]: "åäö"
Out[1]: '\xc3\xa5\xc3\xa4\xc3\xb6'

In [2]: u"åäö"
Out[2]: u'\xe5\xe4\xf6'

In [3]:
Do you really want to exit ([y]/n)?

ipython · 2010-05-10T09:29:23Z

[ LP comment 7 by: Rodrigo Senra, on 2009-03-24 03:46:20.603219+00:00 ]

This bugis still live and kicking.
The problem is in iplib.py: source=source.encode(self.stdin_encoding)

This is wrong whenever there is a unicode string in source.

A simple:

x = u"ação"

with the offending line becomes:

x = u'a\xc3\xa7\xc3\xa3o'

Notice that the encoding is done inplace,and the u"" is kept after the encoding. This is wrong.
I have removed the line, and it is now working for me. Do not know enough of IPython internals to predict side effects. I hope this helps.
regards,
Rod Senra

ipython · 2010-05-10T09:29:24Z

[ LP comment 8 by: INADA Naoki, on 2009-04-12 00:25:48.950763+00:00 ]

This is another patch that handle encoded byte string literal and unicode literal correctly.

    source=source.encode(self.stdin_encoding)
    if source[:1] in [u' ', u'\t']:
        source = u'if 1:\n%s' % source

   source = '# coding: %s\n%s' % (self.stdin_encoding, source)

ipython · 2010-05-10T09:29:25Z

[ LP comment 9 by: Fernando Perez, on 2009-04-14 07:20:50+00:00 ]

Can anyone provide a set of tests that we can actually run
automatically for this? Honestly, until we have actual tests, this is
like playing whack-a-mole blind: the problems will just keep
resurfacing... What we need is a test file for unicode that can be
run reliably, by anyone, and that shows the various issues...

As I said earlier, it's quite possible that the various proposed fixes
work for someone, but without actual tests that we can include,
there's no way to know what they may break for someone else (as has
happened in the past).

Sorry to seem like a curmudgeon: I really appreciate people
contributing ideas and even code. But we need to fix these unicode
problems the right way, else we'll be hunting them forever.

ipython · 2010-05-10T09:29:26Z

[ LP comment 10 by: Brian Granger, on 2009-04-14 17:38:38+00:00 ]

Definitely, I don't like playing whack-a-mole blind. These types of
bug fixes definitely need tests before fixes get commited.

Brian

On Tue, Apr 14, 2009 at 12:20 AM, Fernando Perez fperez.net@gmail.com wrote:

Can anyone provide a set of tests that we can actually run
automatically for this? Honestly, until we have actual tests, this is
like playing whack-a-mole blind: the problems will just keep
resurfacing... What we need is a test file for unicode that can be
run reliably, by anyone, and that shows the various issues...

As I said earlier, it's quite possible that the various proposed fixes
work for someone, but without actual tests that we can include,
there's no way to know what they may break for someone else (as has
happened in the past).

Sorry to seem like a curmudgeon: I really appreciate people
contributing ideas and even code. But we need to fix these unicode
problems the right way, else we'll be hunting them forever.

unicode bug - encoding input
https://bugs.launchpad.net/bugs/339642
You received this bug notification because you are a member of IPython
Developers, which is subscribed to IPython.

Status in IPython - Enhanced Interactive Python: New

Bug description:
Default Python shell:

u'абвгд'
u'\u0430\u0431\u0432\u0433\u0434'

IPython 0.9.1:

u'абвгд'
u'\xd0\xb0\xd0\xb1\xd0\xb2\xd0\xb3\xd0\xb4'
'абвгд'.decode('utf8')
u'\u0430\u0431\u0432\u0433\u0434'

sys.stdin.encoding is 'UTF-8'.

How to fix: remove the line No.2022 from IPython/iplib.py (for 0.9.1 release). Here it is:

--- a/iplib.py
+++ b/iplib.py
@@ -2019,7 +2019,6 @@
# this allows execution of indented pasted code. It is tempting
# to add '\n' at the end of source to run commands like ' a=1'
# directly, but this fails for more complicated scenarios
   source=source.encode(self.stdin_encoding)
if source[:1] in [' ', '\t']:
    source = 'if 1:\n%s' % source
I didn't find any intoduced bugs by a quick check.

Additionaly, I checked ipython-wx and ipythonx - latter doesn't have this bug.

Brian E. Granger, Ph.D.
Assistant Professor of Physics
Cal Poly State University, San Luis Obispo
bgranger@calpoly.edu
ellisonbg@gmail.com

ipython · 2010-05-10T09:29:27Z

[ LP comment 11 by: Jörgen Stenarson, on 2009-04-14 18:16:27+00:00 ]

Fernando Perez skrev:

Can anyone provide a set of tests that we can actually run
automatically for this? Honestly, until we have actual tests, this is
like playing whack-a-mole blind: the problems will just keep
resurfacing... What we need is a test file for unicode that can be
run reliably, by anyone, and that shows the various issues...

As I said earlier, it's quite possible that the various proposed fixes
work for someone, but without actual tests that we can include,
there's no way to know what they may break for someone else (as has
happened in the past).

Sorry to seem like a curmudgeon: I really appreciate people
contributing ideas and even code. But we need to fix these unicode
problems the right way, else we'll be hunting them forever.

I agree, but part of the problem here is that part of the problem is to
have the correct visual output in the shell and this may be difficult to
check automatically. I have a feeling that this problem is also platform
dependent making it necessary to run the tests on several platforms as
well to see that the bug has been fixed.

/Jörgen

ipython · 2010-05-10T09:29:28Z

[ LP comment 12 by: Fernando Perez, on 2009-04-14 21:58:00+00:00 ]

On Tue, Apr 14, 2009 at 11:16 AM, Jörgen Stenarson
jorgen.stenarson@bostream.nu wrote:

Fernando Perez skrev:

I agree, but part of the problem here is that part of the problem is to
have the correct visual output in the shell and this may be difficult to
check automatically. I have a feeling that this problem is also platform
dependent making it necessary to run the tests on several platforms as
well to see that the bug has been fixed.

Well, even if we have a special file we need to re-run by hand, that
would be better than little snippets as we have. At least the file
can be run by the test suite automatically and not crashing is a good
start. Core developers can then re-run it by hand (we can put an "if
name" main section at the bottom for this) to check visually.
This is basically what we are doing now with snippets all over the
mailing list, I'm just suggesting that unless all those checks are:

collected in one file
auto-executed

we'll never get anywhere reliable on these unicode problems. We can
then have a note to manually do

%run test_unicode

ourselves for the full visual verification.

Cheers,

f

ipython · 2010-05-10T09:29:29Z

[ LP comment 13 by: gdamjan, on 2009-05-02 01:04:38.272380+00:00 ]

I can confirm this bug and the sollution given.

Now obviously the bug is in the input handling of ipython .. how do you make test cases for that??

ipython · 2010-05-10T09:29:29Z

[ LP comment 14 by: Andy Mikhailenko, on 2009-05-14 20:52:29.106176+00:00 ]

Confirming. "UTF-8" in all cases, IPython prints screwed up "unicode" strings and this renders the program almost unusable.

Anyone got ideas about how to test this? I guess IPython developers possess a bit more knowledge of the immense innards of the package than reporters of the bug do, so users could expect at least some guidelines for writing tests, could they?

Maybe we should allow to tune bug-related behaviour in user settings until the bug is finally fixed? This may also help with testing.

ipython · 2010-05-10T09:29:30Z

[ LP comment 15 by: pawciobiel, on 2009-09-04 00:24:55.923328+00:00 ]

Confirming.

core/iplib.py
2201
--- source=source.encode(self.stdin_encoding)

Apart of the above, shouldn't the input be decoded if it's not unicode?
(Similar issue was in python2.5/code.py)
core/iplib.py
2332,2334d2330
< line = raw_input_original(prompt)
< if not isinstance(line, unicode):
< line = line.decode(self.stdin_encoding)

cheers,

ipython · 2010-05-10T09:29:31Z

[ LP comment 16 by: INADA Naoki, on 2009-10-08 03:52:22.535452+00:00 ]

I manage to fix this bug in Python side: http://bugs.python.org/issue5911
But if the python issue is fixed in Python 2.7, this problem is still in Python 2.6 and lower.

ipython · 2010-05-10T09:29:32Z

[ LP comment 17 by: t0ster, on 2010-04-26 16:45:21.798105+00:00 ]

Patches worked for me, removing 'source=source.encode(self.stdin_encoding)' helped in Mac OS X 10.6

Thanks

fperez · 2010-09-30T01:14:13Z

On Launchpad, Thorsten Glaser wrote:

I didn’t use the patch from LP: #290677 due to
http://bugs.python.org/issue5911
but wrote a workaround.

It may or may not touch all places needed and not break anything unrelated,
I searched for a good place to do so actually, but can’t guarantee anything.
Feedback extremely welcome.

It at least fixes the two Trac things for me.

His patch is available here:
http://launchpadlibrarian.net/56767748/patch-IPython_iplib_py

It's unfortunately too late in the release cycle for 0.10.1 to properly test this, but if more testing shows it to be stable, we can push a 0.10.2 with this as a fix.

fperez · 2010-10-29T23:33:49Z

Brian saw this on Python 2.6, Mac OS X 10.5:

======================================================================
ERROR: test_unicode
(IPython.core.tests.test_inputsplitter.InputSplitterTestCase)
----------------------------------------------------------------------
Traceback (most recent call last):
 File "/Users/bgranger/Documents/Computation/IPython/code/ipython/IPython/core/tests/test_inputsplitter.py",
line 353, in test_unicode
   self.isp.push("u'\xc3\xa9'")
 File "/Users/bgranger/Documents/Computation/IPython/code/ipython/IPython/core/inputsplitter.py",
line 374, in push
   self._store(lines)
 File "/Users/bgranger/Documents/Computation/IPython/code/ipython/IPython/core/inputsplitter.py",
line 607, in _store
   setattr(self, store, self._set_source(buffer))
 File "/Users/bgranger/Documents/Computation/IPython/code/ipython/IPython/core/inputsplitter.py",
line 610, in _set_source
   return ''.join(buffer).encode(self.encoding)
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position
2: ordinal not in range(128)

----------------------------------------------------------------------
Ran 270 tests in 1.974s

I don't see it on linux, but let's make sure it's fixed across all platforms after reworking the unicode machinery.

fperez · 2010-10-31T03:18:29Z

Notes Robert Kern on list:

The code is just wrong (at least on Python 2) since it calls .encode() on a byte string, not a unicode string. You've never decoded it.

takluyver · 2011-03-25T02:01:50Z

These unicode issues should now be fixed. Please reopen if you can still replicate them.

jrus · 2011-05-17T01:31:46Z

I still get this issue, Python 2.7, IPython 0.10.2, Mac OS X 10.6.7, python readline 6.1.0

$ ipython
Python 2.7.1 (r271:86882M, Nov 30 2010, 10:35:34)
IPython 0.10.2 -- An enhanced Interactive Python.

In [1]: 'pequeño'
Out[1]: 'peque\xc3\xb1o'

In [2]: u'pequeño'
Out[2]: u'peque\xc3\xb1o'

In [3]: print 'pequeño'
pequeño

In [4]: print u'pequeño'
pequeÃ±o

vs Python:

$ python
Python 2.7.1 (r271:86882M, Nov 30 2010, 10:35:34) 

>>> 'pequeño'
'peque\xc3\xb1o'

>>> u'pequeño'
u'peque\xf1o'

>>> print 'pequeño'
pequeño

>>> print u'pequeño'
pequeño

minrk · 2011-05-17T05:44:20Z

The bug has been fixed in master (soon to be 0.11). It will not be fixed in 0.10.

use markdown package instead instead of subprocess

add missing `that = this`

takluyver closed this as completed Mar 25, 2011

minrk pushed a commit to minrk/ipython that referenced this issue Jul 1, 2013

Merge pull request ipython#25 from Carreau/no_md_subprocess

3290053

use markdown package instead instead of subprocess

jdfreder added a commit that referenced this issue Jan 27, 2015

Merge pull request #25 from minrk/checkbox-dashboard

7d07058

add missing `that = this`

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

unicode bug - encoding input #25

unicode bug - encoding input #25

ipython commented May 10, 2010

ipython commented May 10, 2010

ipython commented May 10, 2010

ipython commented May 10, 2010

ipython commented May 10, 2010

ipython commented May 10, 2010

ipython commented May 10, 2010

ipython commented May 10, 2010

ipython commented May 10, 2010

ipython commented May 10, 2010

ipython commented May 10, 2010

ipython commented May 10, 2010

ipython commented May 10, 2010

ipython commented May 10, 2010

ipython commented May 10, 2010

ipython commented May 10, 2010

ipython commented May 10, 2010

ipython commented May 10, 2010

fperez commented Sep 30, 2010

fperez commented Oct 29, 2010

fperez commented Oct 31, 2010

takluyver commented Mar 25, 2011

jrus commented May 17, 2011

minrk commented May 17, 2011

unicode bug - encoding input #25

unicode bug - encoding input #25

Comments

ipython commented May 10, 2010

ipython commented May 10, 2010

ipython commented May 10, 2010

ipython commented May 10, 2010

ipython commented May 10, 2010

ipython commented May 10, 2010

ipython commented May 10, 2010

ipython commented May 10, 2010

ipython commented May 10, 2010

ipython commented May 10, 2010

ipython commented May 10, 2010

ipython commented May 10, 2010

ipython commented May 10, 2010

ipython commented May 10, 2010

ipython commented May 10, 2010

ipython commented May 10, 2010

ipython commented May 10, 2010

ipython commented May 10, 2010

fperez commented Sep 30, 2010

fperez commented Oct 29, 2010

fperez commented Oct 31, 2010

takluyver commented Mar 25, 2011

jrus commented May 17, 2011

minrk commented May 17, 2011