Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Already on GitHub? Sign in to your account

unicode bug - encoding input #25

Closed
ipython opened this Issue May 10, 2010 · 23 comments

Comments

Projects
None yet
5 participants
@ghost

ghost commented May 10, 2010

Original Launchpad bug 339642: https://bugs.launchpad.net/ipython/+bug/339642
Reported by: vsevolod-solovyov (Murkt).

Default Python shell:

u'абвгд'
u'\u0430\u0431\u0432\u0433\u0434'

IPython 0.9.1:

u'абвгд'
u'\xd0\xb0\xd0\xb1\xd0\xb2\xd0\xb3\xd0\xb4'
'абвгд'.decode('utf8')
u'\u0430\u0431\u0432\u0433\u0434'

sys.stdin.encoding is 'UTF-8'.

How to fix: remove the line №2022 from IPython/iplib.py (for 0.9.1 release). Here it is:

--- a/iplib.py
+++ b/iplib.py
@@ -2019,7 +2019,6 @@
# this allows execution of indented pasted code. It is tempting
# to add '\n' at the end of source to run commands like ' a=1'
# directly, but this fails for more complicated scenarios

  •    source=source.encode(self.stdin_encoding)
     if source[:1] in [' ', '\t']:
         source = 'if 1:\n%s' % source
    

I didn't find any intoduced bugs by a quick check.

Additionaly, I checked ipython-wx and ipythonx - latter doesn't have this bug.

@ghost

ghost commented May 10, 2010

[ LP comment 1 by: Murkt, on 2009-03-08 20:07:25.828411+00:00 ]

This line in trunk: http://bazaar.launchpad.net/~ipython-dev/ipython/trunk/annotate/head%3A/IPython//iplib.py#L2031

@ghost

ghost commented May 10, 2010

[ LP comment 2 by: Murkt, on 2009-03-08 20:54:18.976119+00:00 ]

This bug was noticed in 2006 year: http://lists.ipython.scipy.org/pipermail/ipython-dev/2006-August/002305.html

@ghost

ghost commented May 10, 2010

[ LP comment 3 by: Sergey Kishchenko, on 2009-03-09 11:57:58.714729+00:00 ]

I confirm this bug. Attached patch fixed the issue for me

@ghost

ghost commented May 10, 2010

[ LP comment 4 by: Laurent Dufrechou, on 2009-03-17 20:20:24.848465+00:00 ]

also related :
https://bugs.launchpad.net/bugs/290677

@ghost

ghost commented May 10, 2010

[ LP comment 5 by: Fernando Perez, on 2009-03-17 20:24:00.727454+00:00 ]

That's indeed a bug, but the patch is removing a line that was put in there explicitly for some reason. So what I'd like to have, before committing this, is a set of tests in a file named test_unicode.py, that encapsulates all of the recent unicode work.

Unfortunately a lot of these unicode fixes have been made in a completely ad-hoc manner, as people report problems, but we don't have a centralized list of cases to check against. His may be a reasonable fix, for all I know, but I'm afraid that if we apply it we'll get back 10 old bugs again. I don't know, maybe not, but there's simply no way to be sure.

I'm one of the most ignorant of our bunch in unicode issues, blissfully living in the stupidity of the ascii world. It would be great if one of us who knows more about this stuff could at least write a set of simple unicode tests that catch many of the recently reported encoding problems. Jorgen, Ville, any chance you guys could take this up at some point? You know about it a lot more than I do...

@ghost

ghost commented May 10, 2010

[ LP comment 6 by: Jörgen Stenarson, on 2009-03-17 20:28:38.963967+00:00 ]

The proposed patch does not work for me on win32 with or without pyreadline

sys.stdin.encoding == "cp1252"

Standard python:

c:\python>python
Python 2.5.2 (r252:60911, Feb 21 2008, 13:11:45) [MSC v.1310 32 bit (Intel)] on win32
Type "help", "copyright", "credits" or "license" for more information.

"åäö"
'\xe5\xe4\xf6'
u"åäö"
u'\xe5\xe4\xf6'

IPython from trunk:

c:\python>ipython
Python 2.5.2 (r252:60911, Feb 21 2008, 13:11:45) [MSC v.1310 32 bit (Intel)]
Type "copyright", "credits" or "license" for more information.

IPython 0.9.1 -- An enhanced Interactive Python.
? -> Introduction and overview of IPython's features.
%quickref -> Quick reference.
help -> Python's own help system.
object? -> Details about 'object'. ?object also works, ?? prints more.

In [1]: "åäö"
Out[1]: '\xe5\xe4\xf6'

In [2]: u"åäö"
Out[2]: u'\xe5\xe4\xf6'

In [3]:
Do you really want to exit ([y]/n)?

IPython with proposed change:

c:\python>ipython
Python 2.5.2 (r252:60911, Feb 21 2008, 13:11:45) [MSC v.1310 32 bit (Intel)]
Type "copyright", "credits" or "license" for more information.

IPython 0.9.1 -- An enhanced Interactive Python.
? -> Introduction and overview of IPython's features.
%quickref -> Quick reference.
help -> Python's own help system.
object? -> Details about 'object'. ?object also works, ?? prints more.

In [1]: "åäö"
Out[1]: '\xc3\xa5\xc3\xa4\xc3\xb6'

In [2]: u"åäö"
Out[2]: u'\xe5\xe4\xf6'

In [3]:
Do you really want to exit ([y]/n)?

@ghost

ghost commented May 10, 2010

[ LP comment 7 by: Rodrigo Senra, on 2009-03-24 03:46:20.603219+00:00 ]

This bugis still live and kicking.
The problem is in iplib.py: source=source.encode(self.stdin_encoding)

This is wrong whenever there is a unicode string in source.

A simple:

x = u"ação"

with the offending line becomes:

x = u'a\xc3\xa7\xc3\xa3o'

Notice that the encoding is done inplace,and the u"" is kept after the encoding. This is wrong.
I have removed the line, and it is now working for me. Do not know enough of IPython internals to predict side effects. I hope this helps.
regards,
Rod Senra

@ghost

ghost commented May 10, 2010

[ LP comment 8 by: INADA Naoki, on 2009-04-12 00:25:48.950763+00:00 ]

This is another patch that handle encoded byte string literal and unicode literal correctly.

    source=source.encode(self.stdin_encoding)
    if source[:1] in [u' ', u'\t']:
        source = u'if 1:\n%s' % source
  •    source = '# coding: %s\n%s' % (self.stdin_encoding, source)
    
@ghost

ghost commented May 10, 2010

[ LP comment 9 by: Fernando Perez, on 2009-04-14 07:20:50+00:00 ]

Can anyone provide a set of tests that we can actually run
automatically for this? Honestly, until we have actual tests, this is
like playing whack-a-mole blind: the problems will just keep
resurfacing... What we need is a test file for unicode that can be
run reliably, by anyone, and that shows the various issues...

As I said earlier, it's quite possible that the various proposed fixes
work for someone, but without actual tests that we can include,
there's no way to know what they may break for someone else (as has
happened in the past).

Sorry to seem like a curmudgeon: I really appreciate people
contributing ideas and even code. But we need to fix these unicode
problems the right way, else we'll be hunting them forever.

@ghost

ghost commented May 10, 2010

[ LP comment 10 by: Brian Granger, on 2009-04-14 17:38:38+00:00 ]

Definitely, I don't like playing whack-a-mole blind. These types of
bug fixes definitely need tests before fixes get commited.

Brian

On Tue, Apr 14, 2009 at 12:20 AM, Fernando Perez fperez.net@gmail.com wrote:

Can anyone provide a set of tests that we can actually run
automatically for this? Honestly, until we have actual tests, this is
like playing whack-a-mole blind: the problems will just keep
resurfacing... What we need is a test file for unicode that can be
run reliably, by anyone, and that shows the various issues...

As I said earlier, it's quite possible that the various proposed fixes
work for someone, but without actual tests that we can include,
there's no way to know what they may break for someone else (as has
happened in the past).

Sorry to seem like a curmudgeon: I really appreciate people
contributing ideas and even code. But we need to fix these unicode
problems the right way, else we'll be hunting them forever.

unicode bug - encoding input
https://bugs.launchpad.net/bugs/339642
You received this bug notification because you are a member of IPython
Developers, which is subscribed to IPython.

Status in IPython - Enhanced Interactive Python: New

Bug description:
Default Python shell:

u'абвгд'
u'\u0430\u0431\u0432\u0433\u0434'

IPython 0.9.1:

u'абвгд'
u'\xd0\xb0\xd0\xb1\xd0\xb2\xd0\xb3\xd0\xb4'
'абвгд'.decode('utf8')
u'\u0430\u0431\u0432\u0433\u0434'

sys.stdin.encoding is 'UTF-8'.

How to fix: remove the line No.2022 from IPython/iplib.py (for 0.9.1 release). Here it is:

--- a/iplib.py
+++ b/iplib.py
@@ -2019,7 +2019,6 @@
# this allows execution of indented pasted code. It is tempting
# to add '\n' at the end of source to run commands like ' a=1'
# directly, but this fails for more complicated scenarios

  •    source=source.encode(self.stdin_encoding)
    if source[:1] in [' ', '\t']:
        source = 'if 1:\n%s' % source
    

I didn't find any intoduced bugs by a quick check.

Additionaly, I checked ipython-wx and ipythonx - latter doesn't have this bug.

Brian E. Granger, Ph.D.
Assistant Professor of Physics
Cal Poly State University, San Luis Obispo
bgranger@calpoly.edu
ellisonbg@gmail.com

@ghost

ghost commented May 10, 2010

[ LP comment 11 by: Jörgen Stenarson, on 2009-04-14 18:16:27+00:00 ]

Fernando Perez skrev:

Can anyone provide a set of tests that we can actually run
automatically for this? Honestly, until we have actual tests, this is
like playing whack-a-mole blind: the problems will just keep
resurfacing... What we need is a test file for unicode that can be
run reliably, by anyone, and that shows the various issues...

As I said earlier, it's quite possible that the various proposed fixes
work for someone, but without actual tests that we can include,
there's no way to know what they may break for someone else (as has
happened in the past).

Sorry to seem like a curmudgeon: I really appreciate people
contributing ideas and even code. But we need to fix these unicode
problems the right way, else we'll be hunting them forever.

I agree, but part of the problem here is that part of the problem is to
have the correct visual output in the shell and this may be difficult to
check automatically. I have a feeling that this problem is also platform
dependent making it necessary to run the tests on several platforms as
well to see that the bug has been fixed.

/Jörgen

@ghost

ghost commented May 10, 2010

[ LP comment 12 by: Fernando Perez, on 2009-04-14 21:58:00+00:00 ]

On Tue, Apr 14, 2009 at 11:16 AM, Jörgen Stenarson
jorgen.stenarson@bostream.nu wrote:

Fernando Perez skrev:

I agree, but part of the problem here is that part of the problem is to
have the correct visual output in the shell and this may be difficult to
check automatically. I have a feeling that this problem is also platform
dependent making it necessary to run the tests on several platforms as
well to see that the bug has been fixed.

Well, even if we have a special file we need to re-run by hand, that
would be better than little snippets as we have. At least the file
can be run by the test suite automatically and not crashing is a good
start. Core developers can then re-run it by hand (we can put an "if
name" main section at the bottom for this) to check visually.
This is basically what we are doing now with snippets all over the
mailing list, I'm just suggesting that unless all those checks are:

  • collected in one file
  • auto-executed

we'll never get anywhere reliable on these unicode problems. We can
then have a note to manually do

%run test_unicode

ourselves for the full visual verification.

Cheers,

f

@ghost

ghost commented May 10, 2010

[ LP comment 13 by: gdamjan, on 2009-05-02 01:04:38.272380+00:00 ]

I can confirm this bug and the sollution given.

Now obviously the bug is in the input handling of ipython .. how do you make test cases for that??

@ghost

ghost commented May 10, 2010

[ LP comment 14 by: Andy Mikhailenko, on 2009-05-14 20:52:29.106176+00:00 ]

Confirming. "UTF-8" in all cases, IPython prints screwed up "unicode" strings and this renders the program almost unusable.

Anyone got ideas about how to test this? I guess IPython developers possess a bit more knowledge of the immense innards of the package than reporters of the bug do, so users could expect at least some guidelines for writing tests, could they?

Maybe we should allow to tune bug-related behaviour in user settings until the bug is finally fixed? This may also help with testing.

@ghost

ghost commented May 10, 2010

[ LP comment 15 by: pawciobiel, on 2009-09-04 00:24:55.923328+00:00 ]

Confirming.

core/iplib.py
2201
--- source=source.encode(self.stdin_encoding)

Apart of the above, shouldn't the input be decoded if it's not unicode?
(Similar issue was in python2.5/code.py)
core/iplib.py
2332,2334d2330
< line = raw_input_original(prompt)
< if not isinstance(line, unicode):
< line = line.decode(self.stdin_encoding)

cheers,

@ghost

ghost commented May 10, 2010

[ LP comment 16 by: INADA Naoki, on 2009-10-08 03:52:22.535452+00:00 ]

I manage to fix this bug in Python side: http://bugs.python.org/issue5911
But if the python issue is fixed in Python 2.7, this problem is still in Python 2.6 and lower.

@ghost

ghost commented May 10, 2010

[ LP comment 17 by: t0ster, on 2010-04-26 16:45:21.798105+00:00 ]

Patches worked for me, removing 'source=source.encode(self.stdin_encoding)' helped in Mac OS X 10.6

Thanks

Owner

fperez commented Sep 30, 2010

On Launchpad, Thorsten Glaser wrote:

I didn’t use the patch from LP: #290677 due to
http://bugs.python.org/issue5911
but wrote a workaround.

It may or may not touch all places needed and not break anything unrelated,
I searched for a good place to do so actually, but can’t guarantee anything.
Feedback extremely welcome.

It at least fixes the two Trac things for me.

His patch is available here:
http://launchpadlibrarian.net/56767748/patch-IPython_iplib_py

It's unfortunately too late in the release cycle for 0.10.1 to properly test this, but if more testing shows it to be stable, we can push a 0.10.2 with this as a fix.

Owner

fperez commented Oct 29, 2010

Brian saw this on Python 2.6, Mac OS X 10.5:

======================================================================
ERROR: test_unicode
(IPython.core.tests.test_inputsplitter.InputSplitterTestCase)
----------------------------------------------------------------------
Traceback (most recent call last):
 File "/Users/bgranger/Documents/Computation/IPython/code/ipython/IPython/core/tests/test_inputsplitter.py",
line 353, in test_unicode
   self.isp.push("u'\xc3\xa9'")
 File "/Users/bgranger/Documents/Computation/IPython/code/ipython/IPython/core/inputsplitter.py",
line 374, in push
   self._store(lines)
 File "/Users/bgranger/Documents/Computation/IPython/code/ipython/IPython/core/inputsplitter.py",
line 607, in _store
   setattr(self, store, self._set_source(buffer))
 File "/Users/bgranger/Documents/Computation/IPython/code/ipython/IPython/core/inputsplitter.py",
line 610, in _set_source
   return ''.join(buffer).encode(self.encoding)
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position
2: ordinal not in range(128)

----------------------------------------------------------------------
Ran 270 tests in 1.974s

I don't see it on linux, but let's make sure it's fixed across all platforms after reworking the unicode machinery.

Owner

fperez commented Oct 31, 2010

Notes Robert Kern on list:

The code is just wrong (at least on Python 2) since it calls .encode() on a byte string, not a unicode string. You've never decoded it.

Owner

takluyver commented Mar 25, 2011

These unicode issues should now be fixed. Please reopen if you can still replicate them.

@takluyver takluyver closed this Mar 25, 2011

jrus commented May 17, 2011

I still get this issue, Python 2.7, IPython 0.10.2, Mac OS X 10.6.7, python readline 6.1.0

$ ipython
Python 2.7.1 (r271:86882M, Nov 30 2010, 10:35:34)
IPython 0.10.2 -- An enhanced Interactive Python.

In [1]: 'pequeño'
Out[1]: 'peque\xc3\xb1o'

In [2]: u'pequeño'
Out[2]: u'peque\xc3\xb1o'

In [3]: print 'pequeño'
pequeño

In [4]: print u'pequeño'
pequeño

vs Python:

$ python
Python 2.7.1 (r271:86882M, Nov 30 2010, 10:35:34) 

>>> 'pequeño'
'peque\xc3\xb1o'

>>> u'pequeño'
u'peque\xf1o'

>>> print 'pequeño'
pequeño

>>> print u'pequeño'
pequeño
Owner

minrk commented May 17, 2011

The bug has been fixed in master (soon to be 0.11). It will not be fixed in 0.10.

@minrk minrk pushed a commit to minrk/ipython that referenced this issue Jul 1, 2013

@Carreau Carreau Merge pull request #25 from Carreau/no_md_subprocess
use markdown package instead instead of subprocess
3290053

@jdfreder jdfreder added a commit that referenced this issue Jan 27, 2015

@jdfreder jdfreder Merge pull request #25 from minrk/checkbox-dashboard
add missing `that = this`
7d07058
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment