Skip to content

HTTPS clone URL

Subversion checkout URL

You can clone with HTTPS or Subversion.

Download ZIP

Loading…

Add `encoding` attribute to `OutStream` class. #2500

Merged
merged 1 commit into from

4 participants

@bfroehle

This means that sys.stdout.encoding returns something useful instead of raising an AttributeError.

Closes #2220.

@takluyver
Owner

I wonder if explicitly setting UTF-8 makes more sense than using DEFAULT_ENCODING? As far as I'm aware, the Qt console and the notebook both work in UTF-8 even in Windows.

@bfroehle

Yeah, I have no idea. In write we do use encoding.DEFAULT_ENCODING.

But this is a completely backwards usage in some sense, as in a regular file object, .encoding is used by .write and print to control how unicode text is output.

Since we never encode the text (and instead just keep it with type unicode), I'm not sure encoding is even well-defined.

@y-p

My use-case is defining a repr() for objects so that unicode
code points render properly when I work interactively in ipython.

In py2.x repr() must return a str() not unicode().
if a repr returns a unicode obj, repr(obj) tries to encode it using "ascii" resulting in a
UnicodeEncodeError. so repr returning a correctly-encoded string is the only choice.

With the current implementation of write, the encoding to use is exactly
that encoding which write uses to decode strings back to unicode, before
pushing data over the socket.

I think encoding.DEFAULT_ENCODING is the right value to use here,
at least with write as it is.

@takluyver
Owner

Hmm, but I suspect write() should be decoding using UTF-8:

  1. This means code can output any characters, instead of being restricted to the subset in the current code page.
  2. If the user types print "þø→ŧ", I think decoding as UTF-8 is necessary for the output to match the input.
@y-p

True, although if a character is not supported, then it won't display correctly
no matter where the replacement occurs.

@takluyver
Owner

But the Qt console and the notebook don't have to rely on Windows' character set - both should be able to display arbitrary unicode characters.

@bfroehle

For the following reasons, I suggest that we close this issue and mark #2220 as invalid.

  1. It's not obvious what the appropriate value is.
  2. The value is never used.
  3. It's not a required attribute. "The attribute is read-only and may not be present on all file-like objects. It may also be None, in which case the file uses the system default encoding for converting Unicode strings."
@bfroehle

On the other hand io.TextIOBase.encoding should always exist in Python 3. In that case, we just want to choose something so that stdout.write(unicode_string) and write(unicode_string.encode(stdout.encoding)) are equivalent.

@y-p

You just wouldn't want a situtation where sys.stdout.encoding (specifying what write expects) is different
from what the display is capable of, because that would mislead clients about capabilities.

if that's not a possibility, so much the better.

@y-p

that would be a problem, since sys.getdefaultencoding() returns ascii,
encoding.py actually tries locale.getpreferedencoding() first.

@y-p

'utf-8' is good, I just can't the bit of code that explicitly states that qt and nb always use utf-8 and nothing but.

@takluyver
Owner

I suggest: @bfroehle , can you change this, and write() to UTF-8. Then we'll want a Windows user to test that it does the right thing with various bits of unicode. @y-p , are you on Windows? Or @jstenar is our unicode-on-Windows expert.

@y-p

I have access to a windows machine, would gladly test whatever you'd like.

@takluyver
Owner

It's rather confusing: when you encode something to utf-8 to return from repr(), OutStream will decode it to unicode to pack it into JSON (all strings in JSON are unicode). In serialising the JSON, it is then encoded once again to UTF-8. The frontend unserialises, again getting unicode strings, which it passes to the display mechanism (Qt or the browser), which should be unicode capable.

@y-p

what's confusing about that? :)

@takluyver
Owner

Perhaps 'convoluted' would be a better word ;-)

@bfroehle bfroehle Use 'UTF-8' as the `encoding` attribute in `OutStream` class.
This means that `sys.stdout.encoding` returns something useful instead
of raising an AttributeError. 'UTF-8' was chosen because our version of
write actually wants a unicode string (i.e., not `bytes`) and we want
something universal which an encode the entire range of unicode characters.
f1c1155
@bfroehle

@takluyver & @y-p I think this is what you both were suggesting... Let's give this a thorough test in Windows now.

@y-p

Will do.

@y-p

Test suite passes on 2.7 on win7 (no errors beyond those of current git master)

and this works for me, using qtconsole

import sys 
class A():
    def __repr__(self):
        return u"שלום".encode(sys.stdout.encoding)
A()

שלום

as a sidenote, settting up for tests was pretty painful. no CI server?

@y-p

and of course:

sys.stdout.encoding == 'UTF-8'
@minrk
Owner

as a sidenote, settting up for tests was pretty painful. no CI server?

As usual, Windows is dramatically more painful than other platforms, principally because there are no IPython devs who use it. That said, I have found that easy_install nose was the only step I have ever had to take to be ready to test on my Windows VMs.

And we do have two CI environments running. Travis automatically runs all PRs (as seen in the 'Good to merge' notes on this and every PR), and we also have ShiningPanda, which recently started Windows tests (again, much more painful to do CI on Windows than other platforms).

@bfroehle

Before we merge this, we should note that this pull request brings us almost back to before 19d5c41 ("Handle unicode properly in IPython.zmq.iostream") (pull request #534, issue #529). @y-p those linked issues should give you a bunch of additional test cases to try out.

@y-p

@minrk, python, then setuptools (from web), then pip (that's what I use on linux),then nose, then setup.py develop, then all the deps by hand, then back to easy_install because some deps where egg-only, and then more binaries from
the web because the machine doesn't have a compiler installed. I obviously went off the well-traveled path somewhere.

@y-p

replicating the tests in #529:

In [7]: print u"åäö".encode('utf-8')
åäö

In [8]: print u"åäö"
åäö

In [5]: "åäö"
Out[5]: '\xc3\xa5\xc3\xa4\xc3\xb6'

for completeness

In [2]: u"åäö".encode("utf-8").decode("cp850")
Out[2]: u'\u251c\xd1\u251c\xf1\u251c\xc2'

In [3]: print u"åäö".encode("utf-8").decode("cp850")
├Ñ├ñ├Â

In [4]: print u"åäö".encode('cp850')
���

In [6]: print u"åäö".encode('cp1252')
���

where � is the unicode replacement symbol, which windows mangles for some reason, but
renders as the usual blotchy-question-mark in qtconsole.

@takluyver
Owner

@y-p : Could you also test in the qtconsole print "שלוםåäö" (without the u prefix)? I know that doesn't behave correctly in the terminal, but I think with this test it should work in the Qt console and the notebook.

Our Windows CI runs are currently failing, unfortunately. If you've got spare time, could you test #2478, which should address the test failure.

@y-p

In notebook.

In [16]: print "שלוםåäö"
שלוםåäö

In qtconsole, the display is right aligned because it begins with RTL script,
but that's not new to this patch. other then that it prints fine.

@takluyver
Owner

OK, then I think this is all OK. @bfroehle , anything else you want to check? If not, let's land this.

@bfroehle

Okay, hope it doesn't break anything! :) Merging now.

@bfroehle bfroehle merged commit 2e1736c into from
@minrk minrk referenced this pull request from a commit
@minrk minrk Backport PR #2500: Add `encoding` attribute to `OutStream` class.
This means that `sys.stdout.encoding` returns something useful instead of raising an AttributeError.

Closes #2220.
f635516
@takluyver takluyver referenced this pull request from a commit in takluyver/pandas
@y-p y-p ENH: try to detect encoding via sys.stdout first, then sys.stdin
when redirected to a file stdout is None, in ipython zmq frontends
stdin is None, and stdout (only recently in 0.14) is utf-8.

ipython/ipython#2500
7cc9779
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Commits on Oct 22, 2012
  1. @bfroehle

    Use 'UTF-8' as the `encoding` attribute in `OutStream` class.

    bfroehle authored
    This means that `sys.stdout.encoding` returns something useful instead
    of raising an AttributeError. 'UTF-8' was chosen because our version of
    write actually wants a unicode string (i.e., not `bytes`) and we want
    something universal which an encode the entire range of unicode characters.
This page is out of date. Refresh to see the latest.
Showing with 4 additions and 4 deletions.
  1. +4 −4 IPython/zmq/iostream.py
View
8 IPython/zmq/iostream.py
@@ -4,7 +4,7 @@
from session import extract_header, Message
-from IPython.utils import io, text, encoding
+from IPython.utils import io, text
from IPython.utils import py3compat
#-----------------------------------------------------------------------------
@@ -23,6 +23,7 @@ class OutStream(object):
topic=None
def __init__(self, session, pub_socket, name):
+ self.encoding = 'UTF-8'
self.session = session
self.pub_socket = pub_socket
self.name = name
@@ -73,9 +74,8 @@ def write(self, string):
else:
# Make sure that we're handling unicode
if not isinstance(string, unicode):
- enc = encoding.DEFAULT_ENCODING
- string = string.decode(enc, 'replace')
-
+ string = string.decode(self.encoding, 'replace')
+
self._buffer.write(string)
current_time = time.time()
if self._start <= 0:
Something went wrong with that request. Please try again.