UnicodeDecodeError: 'ascii' codec can't decode byte 0xc2 #597

juliantaylor · 2011-07-19T21:33:06Z

running following results in an UnicodeDecodeError with current git head:

$ cat test.ipy
from BeautifulSoup import BeautifulSoup
s = """<td>     </td>
<td>xxxxxxxxxxx</td>
<td>yyyyyyyyy</td>
<td>4asfag</td>"""
soup = BeautifulSoup(s)
soup.findAll("td")

$ irunner --ipython test.ipy
...
----> 1 soup.findAll("td")

/home/jtaylor/tmp/ipython/IPython/core/displayhook.pyc in __call__(self, result)
    300             self.start_displayhook()
    301             self.write_output_prompt()
--> 302             format_dict = self.compute_format_data(result)
    303             self.write_format_data(format_dict)
    304             self.update_user_ns(result)

/home/jtaylor/tmp/ipython/IPython/core/displayhook.pyc in compute_format_data(self, result)
    213             MIME type representation of the object.
    214         """
--> 215         return self.shell.display_formatter.format(result)
    216 
    217     def write_format_data(self, format_dict):

/home/jtaylor/tmp/ipython/IPython/core/formatters.pyc in format(self, obj, include, exclude)
    120                     continue
    121             try:
--> 122                 data = formatter(obj)
    123             except:
    124                 # FIXME: log the exception

/home/jtaylor/tmp/ipython/IPython/core/formatters.pyc in __call__(self, obj)
    440             printer.pretty(obj)
    441             printer.flush()
--> 442             return stream.getvalue()
    443 
    444 

/usr/lib/python2.7/StringIO.pyc in getvalue(self)
    268         """
    269         if self.buflist:
--> 270             self.buf += ''.join(self.buflist)
    271             self.buflist = []
    272         return self.buf

UnicodeDecodeError: 'ascii' codec can't decode byte 0xc2 in position 4: ordinal not in range(128)

it works when one prints the last line

The text was updated successfully, but these errors were encountered:

juliantaylor · 2011-07-19T21:38:43Z

github and pasteibins screws up the test code
download it here:
https://github.com/downloads/juliantaylor/testing/test2.py

minrk · 2011-07-19T21:50:40Z

Hm, on what system? Is there any unicode in that file that I'm not seeing? Do you have any unicode paths that could be causing the problem

I can run that example without error, even if I stick unicode content into it.

If I do add unicode content and export LC_ALL=C (force ascii), then I can see this error. That is, if the file uses an encoding other than that of stdin, I think unicode input is not properly respected.

takluyver · 2011-07-19T22:03:37Z

The first <td> block is filled with non-breaking spaces, am I right? I can replicate this with the file you downloaded, without changing the terminal encoding. It's not specific to irunner - I can get it inside IPython as well.

juliantaylor · 2011-07-19T22:04:52Z

regular ubuntu 11.04 natty system with en_US.UTF-8
there are some unicode character in that td which are displayed as space. The file needs to be used exact, if you remove some lines after the td with the "spaces" it cannot be reproduced anymore

this is what goes into usr/lib/python2.7/StringIO.py(270) from /home/jtaylor/tmp/ipython/IPython/core/formatters.py(444)

['[', '<td>\xc2\xa0\xc2\xa0\xc2\xa0\xc2\xa0\xc2\xa0</td>', ',', u'\n', ' ', '<td>xxxxxxxxxxx</td>', ',', u'\n', ' ', '<td>yyyyyyyyy</td>', ',', u'\n', ' ', '<td>4asfag</td>', ']']

takluyver · 2011-07-19T22:10:09Z

To be specific, I think the issue is when one of the objects in a list has a unicode repr*, and the pretty printer tries to format the list onto several lines (it doesn't happen when the list is short enough to be displayed on one line).

Unicode strings themselves have a (byte) string repr in Python 2, which was why the issue wasn't more obvious. BeautifulSoup objects are probably the commonest case of something with a unicode repr.

takluyver · 2011-07-19T22:14:54Z

The relevant bit from the docs:

The StringIO object can accept either Unicode or 8-bit strings, but mixing the two may take some care. If both are used, 8-bit strings that cannot be interpreted as 7-bit ASCII (that use the 8th bit) will cause a UnicodeError to be raised when getvalue() is called.

So it seems the pretty printer is encoding things to bytestrings, and that gets the StringIO in a muddle.

Closes ipythongh-597

takluyver · 2011-07-19T22:40:00Z

The linked commit above is a simple but inelegant fix for this issue. Ideally, pretty should use and return unicode, but that's a much bigger set of changes.

takluyver · 2011-07-22T19:29:20Z

Closed by 6b2de8f.

Closes ipythongh-597

takluyver added a commit to takluyver/ipython that referenced this issue Jul 19, 2011

Fix bug with non-ascii reprs inside pretty-printed lists.

7a59ef3

Closes ipythongh-597

takluyver added a commit to takluyver/ipython that referenced this issue Jul 22, 2011

Comment explaining fix for ipythongh-597.

35869db

takluyver closed this as completed Jul 22, 2011

takluyver mentioned this issue Jul 22, 2011

Fix bug with non-ascii reprs inside pretty-printed lists. #600

Merged

mattvonrocketstein pushed a commit to mattvonrocketstein/ipython that referenced this issue Nov 3, 2014

Fix bug with non-ascii reprs inside pretty-printed lists.

987ed6b

Closes ipythongh-597

mattvonrocketstein pushed a commit to mattvonrocketstein/ipython that referenced this issue Nov 3, 2014

Comment explaining fix for ipythongh-597.

7777579

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

UnicodeDecodeError: 'ascii' codec can't decode byte 0xc2 #597

UnicodeDecodeError: 'ascii' codec can't decode byte 0xc2 #597

juliantaylor commented Jul 19, 2011

juliantaylor commented Jul 19, 2011

minrk commented Jul 19, 2011

takluyver commented Jul 19, 2011

juliantaylor commented Jul 19, 2011

takluyver commented Jul 19, 2011

takluyver commented Jul 19, 2011

takluyver commented Jul 19, 2011

takluyver commented Jul 22, 2011

UnicodeDecodeError: 'ascii' codec can't decode byte 0xc2 #597

UnicodeDecodeError: 'ascii' codec can't decode byte 0xc2 #597

Comments

juliantaylor commented Jul 19, 2011

juliantaylor commented Jul 19, 2011

minrk commented Jul 19, 2011

takluyver commented Jul 19, 2011

juliantaylor commented Jul 19, 2011

takluyver commented Jul 19, 2011

takluyver commented Jul 19, 2011

takluyver commented Jul 19, 2011

takluyver commented Jul 22, 2011