ANSI and screen don't support Unicode #84

dcoshea · 2014-07-02T06:05:06Z

It would be nice if the ANSI and screen packages supported Unicode strings, or there were additional packages/classes that did. I only had to change str() to unicode() and add u in front of the quoted strings (and perhaps even some of that wasn't necessary) in screen.pretty() and put_abs() in 345eb58 to get this working quickly for me, but I didn't test all methods.

If you could suggest your preferred solution, I might be able to implement it and submit a pull request.

The text was updated successfully, but these errors were encountered:

takluyver · 2014-07-02T20:20:06Z

Hmmm. I suspect the simple changes would fail with non-ascii bytes. On some level, it would make sense to make it unicode only, because a screen does display text, not bytes, but that would break backwards compatibility. On Python 3, meanwhile, it probably already only accepts unicode, but quite possibly no-one has used it on Python 3 yet.

What about adding an encoding parameter & attribute to screen? It would then handle unicode internally, decoding any input before storing it. encoding would probably default to latin-1, so that arbitrary byte sequences can be handled. @jquast , thoughts?

dcoshea · 2014-07-03T11:43:18Z

Hmmm. I suspect the simple changes would fail with non-ascii bytes.

With the changes suggested, I am in fact reading in bytes, passing them to a CP437 decoder, passing the resulting unicode strings to ANSI, then later calling get_region() to extract the unicode string, converting it to UTF-8, and outputting it, and stuff is coming back out the way it went in, e.g.:

Welcome to CentOS for x86_64
         ┌────────────────────────┤ Scanning ├────────────────────────┐
         │                                                            │
         │ Looking for installation images on CD device /dev/sr0      │
         │                                                            │
         └────────────────────────────────────────────────────────────┘

so I think that is sort of proof it is working, at least for the methods I updated. I realize there are other methods that probably need to be changed too.

With the encoding parameter, I assume you mean that any data passed in to ANSI would automatically be decoded, and any data returned would automatically be encoded? That seems reasonable, but what would I do if I actually wanted to pass in unicode strings I'd already decoded myself and/or get them back as unicode? It just so happens I want to do the decoding myself, because the stream's encoding changes. I'm not sure if there's an easy way to obtain some kind of null/dummy encoder/decoder?

takluyver · 2014-07-03T22:42:27Z

I was thinking that it would decode if it got bytes, and use unicode directly if it got that, and the principle output would be unicode. str(s) on Python 2 would have to convert back to bytes, though.

dcoshea · 2014-07-04T00:36:24Z

And we could provide __unicode__() so that you can use unicode(s) on Python 2.

This all sounds usable to me, thanks!

Should I go ahead and try to implement this myself?

takluyver · 2014-07-04T00:44:17Z

Sure, go for it. I'm at a conference all next week, so I might not get to review it immediately, but I'll look at it soon.

dcoshea · 2014-07-07T05:33:44Z

I was thinking that it would decode if it got bytes, and use unicode directly if it got that, and the principle output would be unicode.

On further thought, that doesn't seem symmetrical, nor backwards-compatible. Perhaps it would be more appropriate for the principal output to be bytes, and if you want unicode, you can use unicode(screen)?

I also figured that since, in my case, I don't want any encoding/decoding to occur, I would allow the codec to be specified as "None", meaning that all functions only accept unicode and only return unicode. My code to handle this looks like this:

    def __init__ (self, r=24,c=80,codec='latin-1',codec_errors='replace'):
[...]
        if codec is not None:
            self.decoder = codecs.getdecoder(codec)
            self.encoder = codecs.getencoder(codec)
            self.codec_errors = codec_errors
        else:
            self.decoder = None
            self.encoder = None
            self.codec_errors = None
[...]
    def _decode (self, s):
        '''This converts from the external coding system (as passed to
        the constructor) to the internal one (unicode). '''
        if self.decoder is not None:
            return self.decoder (s,self.codec_errors)[0]
        else:
            return unicode(s)

    def _encode (self, s):
        '''This converts from the internal coding system (unicode) to
        the external one (as passed to the constructor). '''
        if self.encoder is not None:
            return self.encoder (s,self.codec_errors)[0]
        else:
            return unicode(s)

put_abs() can call _decode(); __str__(), dump(), pretty() etc. can call _encode().

I realize this is of no use in Python 3 where str == unicode; I'll leave it for someone else to add support for bytes instances being passed in if they need it.

Does this sound reasonable?

takluyver · 2014-07-07T15:30:53Z

That's roughly what I was thinking, but:

It shouldn't try to decode if it's already unicode.
I would probably return unicode from methods like dump() and pretty(). It's not breaking backwards compatibility much, and we're planning to do a 4.0 release anyway, so some minor breaks in backwards compatibility are OK.
It shouldn't refer to unicode, because that won't work on Python 3. I would do an isinstance(s, bytes) check to decide what to do (bytes is defined as an alias for str on Python 2).

dcoshea · 2014-07-08T00:33:34Z

It shouldn't try to decode if it's already unicode.

I'm already making that check in the caller, although perhaps it would be better to put it into the _decode() function, I'll have a look.

I would probably return unicode from methods like dump() and pretty().

So really everything other than __str__() then, like you said originally?

When will this 4.0 release be happening?

It shouldn't refer to unicode [...]

Thanks, wasn't aware of that!

takluyver · 2014-07-08T01:26:00Z

I would probably return unicode from methods like dump() and pretty().

So really everything other than str() then, like you said originally?

I think that makes most sense, yes. In Python 2, unicode mostly works like str anyway.

When will this 4.0 release be happening?

We don't have an exact timeframe, but the two main aims are asyncio integration, which I already have PR #69 open for, and merging Windows support (issue #17). I don't think that should take too long (famous last words).

dcoshea · 2014-07-08T06:47:26Z

Thanks for the info.

I realize now that, unlike a solution where screen just starts accepting unicode only, if we're going to do encoding and decoding for the user, then ANSI has to be changed too, because its write() method splits the byte sequence up into individual bytes, passing them one at a time to write_ch(). Obviously for proper support of multi-byte encodings, the decoding has to be done before write() splits it up into characters.

It seems like the scope of the work is getting a bigger than I expected. If non-backward-compatible changes are okay for a major release, would it be okay for these packages to just start treating their input as unicode and not perform decoding? I would assume it would be no more difficult for users to have to decode input before passing it to these packages than it would be for them to have to encode it after calling methods like get_region().

takluyver · 2014-07-08T13:47:41Z

I'll try to look into it. A major release means backwards incompatible changes are acceptable, but I still prefer to avoid them if it's practical.

jquast · 2014-07-11T08:14:19Z

catching up.. I'm very familiar with these kinds of things, the tests that we have for this module are a bit weak, I'll plan to provide a cp437-encoded interface screen and work from there. New keyword parameters like encoding="latin-1" sounds good to me so far.

dcoshea · 2014-07-14T03:26:31Z

Thanks, are you planning to do this soon, and do you plan to allow (in Python 2) unicode instances to be passed in/out?

jquast · 2014-07-14T21:39:30Z

unicode-everywhere is definitely the intent, yes. Soon... trying my best :) I have a few things to wrap up in pexpect first, patches welcome :)

dcoshea · 2014-07-21T01:00:16Z

patches welcome :)

I'd like to have the fairly trivial (at least as far as I thought) pull request #89 accepted, and file some other pull requests I have piled up here, before I continue with this bigger task.

dcoshea · 2014-07-23T11:04:29Z

Thanks for taking care of pull request #89!

I assume that, for input passed to screen and ANSI, an incremental decoder should be used?

jquast · 2014-07-23T17:43:34Z

yes, incremental decoder must be used, thanks.

dcoshea · 2014-07-24T05:13:54Z

Thanks.

I'm making some progress and I think I've implemented things as desired in a way that works in Python 2. For Python 3, I gather that prior to 3.3 the u'' syntax for string literals was not available. I assume I need to implement a workaround for this, i.e. supporting only 2.6, 2.7 and 3.3+ would not be acceptable?

takluyver · 2014-07-24T05:29:54Z

The next version of pexpect will only support python 2.6, 3.3 and above, so
you shouldn't need to work around the syntax for Unicode literals.

dcoshea · 2014-07-24T05:37:02Z

Great news, thanks!

This commit updates the the screen and ANSI modules to support Unicode under Python 2.x. Under Python 3.x, it was already supported because strings are Unicode by default. Now, on both Python versions: - The constructors accept a codec name (defaults to 'latin-1') and a scheme for handling encoding/decoding errors (defaults to 'replace'). The codec may be set to None to inhibit encoding/decoding. - Unicode is now used internally for storing the screen contents. - Methods that accept input characters will, if passed input of type 'bytes' (or, under Python 2.x, 'str'), use the specified codec to decode the input, otherwise treating it as Unicode. - Methods that return screen contents now return Unicode, with the exception of __str__() under Python 2.x, and __bytes__() in all versions of Python, which return the screen contents encoded using the specified codec. These changes are designed to work only with Python 2.6, 2.7, and 3.3 and later, specifically versions that provide both b'' and u'' string literals. The check in ANSI for characters being printable is also removed, as this prevents non-ASCII characters being accepted, which is not compatible with the goal of adding Unicode support. This addresses issue pexpect#88.

dcoshea · 2014-07-24T12:39:43Z

Filed pull request #96 with a fix for this.

This commit updates the the screen and ANSI modules to support Unicode under Python 2.x. Under Python 3.x, it was already supported because strings are Unicode by default. Now, on both Python versions: - The constructors accept a codec name (defaults to 'latin-1') and a scheme for handling encoding/decoding errors (defaults to 'replace'). The codec may be set to None to inhibit encoding/decoding. - Unicode is now used internally for storing the screen contents. - Methods that accept input characters will, if passed input of type 'bytes' (or, under Python 2.x, 'str'), use the specified codec to decode the input, otherwise treating it as Unicode. - Methods that return screen contents now return Unicode, with the exception of __str__() under Python 2.x, and __bytes__() in all versions of Python, which return the screen contents encoded using the specified codec. These changes are designed to work only with Python 2.6, 2.7, and 3.3 and later, specifically versions that provide both b'' and u'' string literals. The check in ANSI for characters being printable is also removed, as this prevents non-ASCII characters being accepted, which is not compatible with the goal of adding Unicode support. This addresses issue pexpect#83.

jquast · 2015-09-19T18:20:41Z

Closing, pexpect's terminal emulation code remains next release but no longer improved, marked deprecated by #240 Suggest any terminal emulation / screen scraping code efforts moved to more concerted project efforts such as https://github.com/selectel/pyte

dcoshea mentioned this issue Jul 23, 2014

ANSI doesn't allow nonprintable characters #83

Closed

dcoshea mentioned this issue Jul 24, 2014

Exception when non-ASCII character appears in invalid ANSI escape sequence #88

Closed

jquast closed this as completed Sep 19, 2015

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ANSI and screen don't support Unicode #84

ANSI and screen don't support Unicode #84

dcoshea commented Jul 2, 2014

takluyver commented Jul 2, 2014

dcoshea commented Jul 3, 2014

takluyver commented Jul 3, 2014

dcoshea commented Jul 4, 2014

takluyver commented Jul 4, 2014

dcoshea commented Jul 7, 2014

takluyver commented Jul 7, 2014

dcoshea commented Jul 8, 2014

takluyver commented Jul 8, 2014

dcoshea commented Jul 8, 2014

takluyver commented Jul 8, 2014

jquast commented Jul 11, 2014

dcoshea commented Jul 14, 2014

jquast commented Jul 14, 2014

dcoshea commented Jul 21, 2014

dcoshea commented Jul 23, 2014

jquast commented Jul 23, 2014

dcoshea commented Jul 24, 2014

takluyver commented Jul 24, 2014

dcoshea commented Jul 24, 2014

dcoshea commented Jul 24, 2014

jquast commented Sep 19, 2015

ANSI and screen don't support Unicode #84

ANSI and screen don't support Unicode #84

Comments

dcoshea commented Jul 2, 2014

takluyver commented Jul 2, 2014

dcoshea commented Jul 3, 2014

takluyver commented Jul 3, 2014

dcoshea commented Jul 4, 2014

takluyver commented Jul 4, 2014

dcoshea commented Jul 7, 2014

takluyver commented Jul 7, 2014

dcoshea commented Jul 8, 2014

takluyver commented Jul 8, 2014

dcoshea commented Jul 8, 2014

takluyver commented Jul 8, 2014

jquast commented Jul 11, 2014

dcoshea commented Jul 14, 2014

jquast commented Jul 14, 2014

dcoshea commented Jul 21, 2014

dcoshea commented Jul 23, 2014

jquast commented Jul 23, 2014

dcoshea commented Jul 24, 2014

takluyver commented Jul 24, 2014

dcoshea commented Jul 24, 2014

dcoshea commented Jul 24, 2014

jquast commented Sep 19, 2015