codepage handling of output from scripts and shellcommands are not handled properly by qtconsole #768

Closed
jstenar opened this Issue Sep 6, 2011 · 14 comments

Comments

Projects
None yet
4 participants
@jstenar
Member

jstenar commented Sep 6, 2011

On my machine when running ls in a qtconsole any non-ascii characters in the output are garbage (diamond shaped question mark) .

I have a testscript at https://gist.github.com/1198529 that can be used to illustrate the problem

In a regular ipython terminal I get correct result for:

In [1]: %run run-encoding.py cp1252
Test data åäö

But as expected I get incorrect results for

In [2]: %run run-encoding.py cp850
Test data †„”

In [3]: %run run-encoding.py utf-8
Test data åäö

However when running in qtconsole I get incorrect results in all three cases.

/Jörgen

@minrk

This comment has been minimized.

Show comment
Hide comment
@minrk

minrk Sep 6, 2011

Member

The basic reason is that the 'encoding' associated with the qtconsole is sys.getdefaultencoding(), so just like you get the wrong answer in everything but cp1252 in your Windows terminal, you get the wrong answer in everything but the default encoding (generally ascii) in the qtconsole. The question marks are the result of s.decode(sys.getdefaultencoding(), 'replace').

The general idea is that if you are printing unicode, you should be printing unicode objects, which will behave correctly, not bytes objects, which have discarded the character meaning of their contents.

Member

minrk commented Sep 6, 2011

The basic reason is that the 'encoding' associated with the qtconsole is sys.getdefaultencoding(), so just like you get the wrong answer in everything but cp1252 in your Windows terminal, you get the wrong answer in everything but the default encoding (generally ascii) in the qtconsole. The question marks are the result of s.decode(sys.getdefaultencoding(), 'replace').

The general idea is that if you are printing unicode, you should be printing unicode objects, which will behave correctly, not bytes objects, which have discarded the character meaning of their contents.

@jstenar

This comment has been minimized.

Show comment
Hide comment
@jstenar

jstenar Sep 6, 2011

Member

minrk skrev 2011-09-06 21:00:

The basic reason is that the 'encoding' associated with the qtconsole is sys.getdefaultencoding(), so just like you get the wrong answer in everything but cp1252 in your Windows terminal, you get the wrong answer in everything but the default encoding (generally ascii) in the qtconsole. The question marks are the result of s.decode(sys.getdefaultencoding(), 'replace').

The general idea is that if you are printing unicode, you should be printing unicode objects, which will behave correctly, not bytes objects, which have discarded the character meaning of their contents.

As far as I can see using unicode objects only works for scripts you use
with %run not things you launch with ! or aliases

By adding "print data" to the run-encoding.py script I get something
that prints ok when %run but not when using !python run-encoding.py
then I get

In [5]: !python run-encoding.py cp1252
Traceback (most recent call last):
File "run-encoding.py", line 8, in
print data
UnicodeEncodeError: 'ascii' codec can't encode characters in position
10-12: ordinal not in range(128)

I think it would be nice to be able to set the encoding expected from
stdout of ! commands and aliases. Because most builtin dos commands
respect the codepage setting you use so you should be able to determine
a reasonable value at least on a case by case basis.

/Jörgen

Member

jstenar commented Sep 6, 2011

minrk skrev 2011-09-06 21:00:

The basic reason is that the 'encoding' associated with the qtconsole is sys.getdefaultencoding(), so just like you get the wrong answer in everything but cp1252 in your Windows terminal, you get the wrong answer in everything but the default encoding (generally ascii) in the qtconsole. The question marks are the result of s.decode(sys.getdefaultencoding(), 'replace').

The general idea is that if you are printing unicode, you should be printing unicode objects, which will behave correctly, not bytes objects, which have discarded the character meaning of their contents.

As far as I can see using unicode objects only works for scripts you use
with %run not things you launch with ! or aliases

By adding "print data" to the run-encoding.py script I get something
that prints ok when %run but not when using !python run-encoding.py
then I get

In [5]: !python run-encoding.py cp1252
Traceback (most recent call last):
File "run-encoding.py", line 8, in
print data
UnicodeEncodeError: 'ascii' codec can't encode characters in position
10-12: ordinal not in range(128)

I think it would be nice to be able to set the encoding expected from
stdout of ! commands and aliases. Because most builtin dos commands
respect the codepage setting you use so you should be able to determine
a reasonable value at least on a case by case basis.

/Jörgen

@minrk

This comment has been minimized.

Show comment
Hide comment
@minrk

minrk Sep 6, 2011

Member

I was mistaken, we actually start with sys.stdin.encoding, and fallback to getdefaultencoding, but sys.stdin.encoding is often None for subprocesses like the kernel.

In any case, I think if we give the OutStream (what we replace sys.stdout with) object a configurable encoding attr, much of these should be helped, and would be configurable.

Member

minrk commented Sep 6, 2011

I was mistaken, we actually start with sys.stdin.encoding, and fallback to getdefaultencoding, but sys.stdin.encoding is often None for subprocesses like the kernel.

In any case, I think if we give the OutStream (what we replace sys.stdout with) object a configurable encoding attr, much of these should be helped, and would be configurable.

@takluyver

This comment has been minimized.

Show comment
Hide comment
@takluyver

takluyver Sep 6, 2011

Member

It's not entirely clear what the 'correct' encoding is, because we're not limited by the terminal code page. If you do print "åäö", should we assume that to be in the encoding a terminal would force you to use, or UTF-8, or something else?

For external processes, I think we should decode the bytes as we read them from the other process, and assume that it's using the system code page. I thought we already did this, but I guess it must be going wrong somewhere.

Member

takluyver commented Sep 6, 2011

It's not entirely clear what the 'correct' encoding is, because we're not limited by the terminal code page. If you do print "åäö", should we assume that to be in the encoding a terminal would force you to use, or UTF-8, or something else?

For external processes, I think we should decode the bytes as we read them from the other process, and assume that it's using the system code page. I thought we already did this, but I guess it must be going wrong somewhere.

@minrk

This comment has been minimized.

Show comment
Hide comment
@minrk

minrk Sep 6, 2011

Member

We use sys.stdin.encoding, which can be (and often is for subprocesses) None. If we give the OutStream object an encoding with the same default behavior it currently has, it should improve the situation, allowing users to set it when stdin encoding doesn't tell us anything.

Member

minrk commented Sep 6, 2011

We use sys.stdin.encoding, which can be (and often is for subprocesses) None. If we give the OutStream object an encoding with the same default behavior it currently has, it should improve the situation, allowing users to set it when stdin encoding doesn't tell us anything.

@minrk

This comment has been minimized.

Show comment
Hide comment
@minrk

minrk Sep 7, 2011

Member

@jstenar, can you check if the code in PR #770 makes the behavior more reasonable for you? It adds checking the locale for encoding information, so if you change the locale, it will change the default interpretation of bytes objects.

Member

minrk commented Sep 7, 2011

@jstenar, can you check if the code in PR #770 makes the behavior more reasonable for you? It adds checking the locale for encoding information, so if you change the locale, it will change the default interpretation of bytes objects.

@jstenar

This comment has been minimized.

Show comment
Hide comment
@jstenar

jstenar Sep 8, 2011

Member

Min RK skrev 2011-09-08 01:13:

@jstenar, can you check if the code in PR #770 makes the behavior more reasonable for you? It adds checking the locale for encoding information, so if you change the locale, it will change the default interpretation of bytes objects.

Yes it seems to work.

/Jörgen

Member

jstenar commented Sep 8, 2011

Min RK skrev 2011-09-08 01:13:

@jstenar, can you check if the code in PR #770 makes the behavior more reasonable for you? It adds checking the locale for encoding information, so if you change the locale, it will change the default interpretation of bytes objects.

Yes it seems to work.

/Jörgen

@fperez

This comment has been minimized.

Show comment
Hide comment
@fperez

fperez Sep 12, 2011

Member

I've just merged #770 which supposedly helped with this, but on linux I still see problems. On the terminal I get:

In [4]: %run run-encoding.py utf-8
Test data åäö

but on the qtconsole I see the little question-mark-diamonds:

In [1]: %run run-encoding.py utf-8
Test data ������

So it seems we still have issues, no?

Member

fperez commented Sep 12, 2011

I've just merged #770 which supposedly helped with this, but on linux I still see problems. On the terminal I get:

In [4]: %run run-encoding.py utf-8
Test data åäö

but on the qtconsole I see the little question-mark-diamonds:

In [1]: %run run-encoding.py utf-8
Test data ������

So it seems we still have issues, no?

@minrk

This comment has been minimized.

Show comment
Hide comment
@minrk

minrk Sep 13, 2011

Member

Arg, I switched getpreferredencoding() to getpreferredencoding(False), since I thought it was safer. Turns out the opposite makes the most sense, and fixes this particular case.

Member

minrk commented Sep 13, 2011

Arg, I switched getpreferredencoding() to getpreferredencoding(False), since I thought it was safer. Turns out the opposite makes the most sense, and fixes this particular case.

@fperez

This comment has been minimized.

Show comment
Hide comment
@fperez

fperez Sep 13, 2011

Member

@minrk, since #770 is already merged, do you want to just make that change in master? We can then retest this...

Member

fperez commented Sep 13, 2011

@minrk, since #770 is already merged, do you want to just make that change in master? We can then retest this...

@minrk

This comment has been minimized.

Show comment
Hide comment
@minrk

minrk Sep 13, 2011

Member

Sure, change pushed.

Member

minrk commented Sep 13, 2011

Sure, change pushed.

@fperez

This comment has been minimized.

Show comment
Hide comment
@fperez

fperez Sep 13, 2011

Member

OK, with Min's fix, master does work for me now both at the terminal and the qtconsole. I should note that only utf-8 shows the output correctly, the cp1252 still shows the diamonds on linux. But I imagine that's correct on a linux box...

So now that this has been merged, should we close the original issue? @jstenar?

Member

fperez commented Sep 13, 2011

OK, with Min's fix, master does work for me now both at the terminal and the qtconsole. I should note that only utf-8 shows the output correctly, the cp1252 still shows the diamonds on linux. But I imagine that's correct on a linux box...

So now that this has been merged, should we close the original issue? @jstenar?

@jstenar

This comment has been minimized.

Show comment
Hide comment
@jstenar

jstenar Sep 13, 2011

Member

Fernando Perez skrev 2011-09-13 02:37:

OK, with Min's fix, master does work for me now both at the terminal and the qtconsole. I should note that only utf-8 shows the output correctly, the cp1252 still shows the diamonds on linux. But I imagine that's correct on a linux box...

So now that this has been merged, should we close the original issue? @jstenar?

Yes,
go ahead and close it.

/Jörgen

Member

jstenar commented Sep 13, 2011

Fernando Perez skrev 2011-09-13 02:37:

OK, with Min's fix, master does work for me now both at the terminal and the qtconsole. I should note that only utf-8 shows the output correctly, the cp1252 still shows the diamonds on linux. But I imagine that's correct on a linux box...

So now that this has been merged, should we close the original issue? @jstenar?

Yes,
go ahead and close it.

/Jörgen

@minrk

This comment has been minimized.

Show comment
Hide comment
@minrk

minrk Sep 13, 2011

Member

closed by PR #770

Member

minrk commented Sep 13, 2011

closed by PR #770

@minrk minrk closed this Sep 13, 2011

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment