Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

UnicodeDecodeError: 'ascii' codec can't decode byte 0xc2 #597

Closed
juliantaylor opened this issue Jul 19, 2011 · 8 comments · Fixed by #600
Closed

UnicodeDecodeError: 'ascii' codec can't decode byte 0xc2 #597

juliantaylor opened this issue Jul 19, 2011 · 8 comments · Fixed by #600
Milestone

Comments

@juliantaylor
Copy link
Contributor

running following results in an UnicodeDecodeError with current git head:

$ cat test.ipy
from BeautifulSoup import BeautifulSoup
s = """<td>     </td>
<td>xxxxxxxxxxx</td>
<td>yyyyyyyyy</td>
<td>4asfag</td>"""
soup = BeautifulSoup(s)
soup.findAll("td")

$ irunner --ipython test.ipy
...
----> 1 soup.findAll("td")

/home/jtaylor/tmp/ipython/IPython/core/displayhook.pyc in __call__(self, result)
    300             self.start_displayhook()
    301             self.write_output_prompt()
--> 302             format_dict = self.compute_format_data(result)
    303             self.write_format_data(format_dict)
    304             self.update_user_ns(result)

/home/jtaylor/tmp/ipython/IPython/core/displayhook.pyc in compute_format_data(self, result)
    213             MIME type representation of the object.
    214         """
--> 215         return self.shell.display_formatter.format(result)
    216 
    217     def write_format_data(self, format_dict):

/home/jtaylor/tmp/ipython/IPython/core/formatters.pyc in format(self, obj, include, exclude)
    120                     continue
    121             try:
--> 122                 data = formatter(obj)
    123             except:
    124                 # FIXME: log the exception

/home/jtaylor/tmp/ipython/IPython/core/formatters.pyc in __call__(self, obj)
    440             printer.pretty(obj)
    441             printer.flush()
--> 442             return stream.getvalue()
    443 
    444 

/usr/lib/python2.7/StringIO.pyc in getvalue(self)
    268         """
    269         if self.buflist:
--> 270             self.buf += ''.join(self.buflist)
    271             self.buflist = []
    272         return self.buf

UnicodeDecodeError: 'ascii' codec can't decode byte 0xc2 in position 4: ordinal not in range(128)

it works when one prints the last line

@juliantaylor
Copy link
Contributor Author

github and pasteibins screws up the test code
download it here:
https://github.com/downloads/juliantaylor/testing/test2.py

@minrk
Copy link
Member

minrk commented Jul 19, 2011

Hm, on what system? Is there any unicode in that file that I'm not seeing? Do you have any unicode paths that could be causing the problem

I can run that example without error, even if I stick unicode content into it.

If I do add unicode content and export LC_ALL=C (force ascii), then I can see this error. That is, if the file uses an encoding other than that of stdin, I think unicode input is not properly respected.

@takluyver
Copy link
Member

The first <td> block is filled with non-breaking spaces, am I right? I can replicate this with the file you downloaded, without changing the terminal encoding. It's not specific to irunner - I can get it inside IPython as well.

@juliantaylor
Copy link
Contributor Author

regular ubuntu 11.04 natty system with en_US.UTF-8
there are some unicode character in that td which are displayed as space. The file needs to be used exact, if you remove some lines after the td with the "spaces" it cannot be reproduced anymore

this is what goes into usr/lib/python2.7/StringIO.py(270) from /home/jtaylor/tmp/ipython/IPython/core/formatters.py(444)

['[', '<td>\xc2\xa0\xc2\xa0\xc2\xa0\xc2\xa0\xc2\xa0</td>', ',', u'\n', ' ', '<td>xxxxxxxxxxx</td>', ',', u'\n', ' ', '<td>yyyyyyyyy</td>', ',', u'\n', ' ', '<td>4asfag</td>', ']']

@takluyver
Copy link
Member

To be specific, I think the issue is when one of the objects in a list has a unicode repr*, and the pretty printer tries to format the list onto several lines (it doesn't happen when the list is short enough to be displayed on one line).

  • Unicode strings themselves have a (byte) string repr in Python 2, which was why the issue wasn't more obvious. BeautifulSoup objects are probably the commonest case of something with a unicode repr.

@takluyver
Copy link
Member

The relevant bit from the docs:

The StringIO object can accept either Unicode or 8-bit strings, but mixing the two may take some care. If both are used, 8-bit strings that cannot be interpreted as 7-bit ASCII (that use the 8th bit) will cause a UnicodeError to be raised when getvalue() is called.

So it seems the pretty printer is encoding things to bytestrings, and that gets the StringIO in a muddle.

takluyver added a commit to takluyver/ipython that referenced this issue Jul 19, 2011
@takluyver
Copy link
Member

The linked commit above is a simple but inelegant fix for this issue. Ideally, pretty should use and return unicode, but that's a much bigger set of changes.

takluyver added a commit to takluyver/ipython that referenced this issue Jul 22, 2011
@takluyver
Copy link
Member

Closed by 6b2de8f.

mattvonrocketstein pushed a commit to mattvonrocketstein/ipython that referenced this issue Nov 3, 2014
mattvonrocketstein pushed a commit to mattvonrocketstein/ipython that referenced this issue Nov 3, 2014
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants