Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ANSI and screen don't support Unicode #84

Closed
dcoshea opened this issue Jul 2, 2014 · 22 comments
Closed

ANSI and screen don't support Unicode #84

dcoshea opened this issue Jul 2, 2014 · 22 comments

Comments

@dcoshea
Copy link

dcoshea commented Jul 2, 2014

It would be nice if the ANSI and screen packages supported Unicode strings, or there were additional packages/classes that did. I only had to change str() to unicode() and add u in front of the quoted strings (and perhaps even some of that wasn't necessary) in screen.pretty() and put_abs() in 345eb58 to get this working quickly for me, but I didn't test all methods.

If you could suggest your preferred solution, I might be able to implement it and submit a pull request.

@takluyver
Copy link
Member

Hmmm. I suspect the simple changes would fail with non-ascii bytes. On some level, it would make sense to make it unicode only, because a screen does display text, not bytes, but that would break backwards compatibility. On Python 3, meanwhile, it probably already only accepts unicode, but quite possibly no-one has used it on Python 3 yet.

What about adding an encoding parameter & attribute to screen? It would then handle unicode internally, decoding any input before storing it. encoding would probably default to latin-1, so that arbitrary byte sequences can be handled. @jquast , thoughts?

@dcoshea
Copy link
Author

dcoshea commented Jul 3, 2014

Hmmm. I suspect the simple changes would fail with non-ascii bytes.

With the changes suggested, I am in fact reading in bytes, passing them to a CP437 decoder, passing the resulting unicode strings to ANSI, then later calling get_region() to extract the unicode string, converting it to UTF-8, and outputting it, and stuff is coming back out the way it went in, e.g.:

Welcome to CentOS for x86_64
         ┌────────────────────────┤ Scanning ├────────────────────────┐
         │                                                            │
         │ Looking for installation images on CD device /dev/sr0      │
         │                                                            │
         └────────────────────────────────────────────────────────────┘

so I think that is sort of proof it is working, at least for the methods I updated. I realize there are other methods that probably need to be changed too.

With the encoding parameter, I assume you mean that any data passed in to ANSI would automatically be decoded, and any data returned would automatically be encoded? That seems reasonable, but what would I do if I actually wanted to pass in unicode strings I'd already decoded myself and/or get them back as unicode? It just so happens I want to do the decoding myself, because the stream's encoding changes. I'm not sure if there's an easy way to obtain some kind of null/dummy encoder/decoder?

@takluyver
Copy link
Member

I was thinking that it would decode if it got bytes, and use unicode directly if it got that, and the principle output would be unicode. str(s) on Python 2 would have to convert back to bytes, though.

@dcoshea
Copy link
Author

dcoshea commented Jul 4, 2014

And we could provide __unicode__() so that you can use unicode(s) on Python 2.

This all sounds usable to me, thanks!

Should I go ahead and try to implement this myself?

@takluyver
Copy link
Member

Sure, go for it. I'm at a conference all next week, so I might not get to review it immediately, but I'll look at it soon.

@dcoshea
Copy link
Author

dcoshea commented Jul 7, 2014

I was thinking that it would decode if it got bytes, and use unicode directly if it got that, and the principle output would be unicode.

On further thought, that doesn't seem symmetrical, nor backwards-compatible. Perhaps it would be more appropriate for the principal output to be bytes, and if you want unicode, you can use unicode(screen)?

I also figured that since, in my case, I don't want any encoding/decoding to occur, I would allow the codec to be specified as "None", meaning that all functions only accept unicode and only return unicode. My code to handle this looks like this:

    def __init__ (self, r=24,c=80,codec='latin-1',codec_errors='replace'):
[...]
        if codec is not None:
            self.decoder = codecs.getdecoder(codec)
            self.encoder = codecs.getencoder(codec)
            self.codec_errors = codec_errors
        else:
            self.decoder = None
            self.encoder = None
            self.codec_errors = None
[...]
    def _decode (self, s):
        '''This converts from the external coding system (as passed to
        the constructor) to the internal one (unicode). '''
        if self.decoder is not None:
            return self.decoder (s,self.codec_errors)[0]
        else:
            return unicode(s)

    def _encode (self, s):
        '''This converts from the internal coding system (unicode) to
        the external one (as passed to the constructor). '''
        if self.encoder is not None:
            return self.encoder (s,self.codec_errors)[0]
        else:
            return unicode(s)

put_abs() can call _decode(); __str__(), dump(), pretty() etc. can call _encode().

I realize this is of no use in Python 3 where str == unicode; I'll leave it for someone else to add support for bytes instances being passed in if they need it.

Does this sound reasonable?

@takluyver
Copy link
Member

That's roughly what I was thinking, but:

  • It shouldn't try to decode if it's already unicode.
  • I would probably return unicode from methods like dump() and pretty(). It's not breaking backwards compatibility much, and we're planning to do a 4.0 release anyway, so some minor breaks in backwards compatibility are OK.
  • It shouldn't refer to unicode, because that won't work on Python 3. I would do an isinstance(s, bytes) check to decide what to do (bytes is defined as an alias for str on Python 2).

@dcoshea
Copy link
Author

dcoshea commented Jul 8, 2014

It shouldn't try to decode if it's already unicode.

I'm already making that check in the caller, although perhaps it would be better to put it into the _decode() function, I'll have a look.

I would probably return unicode from methods like dump() and pretty().

So really everything other than __str__() then, like you said originally?

When will this 4.0 release be happening?

It shouldn't refer to unicode [...]

Thanks, wasn't aware of that!

@takluyver
Copy link
Member

I would probably return unicode from methods like dump() and pretty().

So really everything other than str() then, like you said originally?

I think that makes most sense, yes. In Python 2, unicode mostly works like str anyway.

When will this 4.0 release be happening?

We don't have an exact timeframe, but the two main aims are asyncio integration, which I already have PR #69 open for, and merging Windows support (issue #17). I don't think that should take too long (famous last words).

@dcoshea
Copy link
Author

dcoshea commented Jul 8, 2014

Thanks for the info.

I realize now that, unlike a solution where screen just starts accepting unicode only, if we're going to do encoding and decoding for the user, then ANSI has to be changed too, because its write() method splits the byte sequence up into individual bytes, passing them one at a time to write_ch(). Obviously for proper support of multi-byte encodings, the decoding has to be done before write() splits it up into characters.

It seems like the scope of the work is getting a bigger than I expected. If non-backward-compatible changes are okay for a major release, would it be okay for these packages to just start treating their input as unicode and not perform decoding? I would assume it would be no more difficult for users to have to decode input before passing it to these packages than it would be for them to have to encode it after calling methods like get_region().

@takluyver
Copy link
Member

I'll try to look into it. A major release means backwards incompatible changes are acceptable, but I still prefer to avoid them if it's practical.

@jquast
Copy link
Member

jquast commented Jul 11, 2014

catching up.. I'm very familiar with these kinds of things, the tests that we have for this module are a bit weak, I'll plan to provide a cp437-encoded interface screen and work from there. New keyword parameters like encoding="latin-1" sounds good to me so far.

@dcoshea
Copy link
Author

dcoshea commented Jul 14, 2014

Thanks, are you planning to do this soon, and do you plan to allow (in Python 2) unicode instances to be passed in/out?

@jquast
Copy link
Member

jquast commented Jul 14, 2014

unicode-everywhere is definitely the intent, yes. Soon... trying my best :) I have a few things to wrap up in pexpect first, patches welcome :)

@dcoshea
Copy link
Author

dcoshea commented Jul 21, 2014

patches welcome :)

I'd like to have the fairly trivial (at least as far as I thought) pull request #89 accepted, and file some other pull requests I have piled up here, before I continue with this bigger task.

@dcoshea
Copy link
Author

dcoshea commented Jul 23, 2014

Thanks for taking care of pull request #89!

I assume that, for input passed to screen and ANSI, an incremental decoder should be used?

@jquast
Copy link
Member

jquast commented Jul 23, 2014

yes, incremental decoder must be used, thanks.

@dcoshea
Copy link
Author

dcoshea commented Jul 24, 2014

Thanks.

I'm making some progress and I think I've implemented things as desired in a way that works in Python 2. For Python 3, I gather that prior to 3.3 the u'' syntax for string literals was not available. I assume I need to implement a workaround for this, i.e. supporting only 2.6, 2.7 and 3.3+ would not be acceptable?

@takluyver
Copy link
Member

The next version of pexpect will only support python 2.6, 3.3 and above, so
you shouldn't need to work around the syntax for Unicode literals.

@dcoshea
Copy link
Author

dcoshea commented Jul 24, 2014

Great news, thanks!

dcoshea pushed a commit to dcoshea/pexpect that referenced this issue Jul 24, 2014
This commit updates the the screen and ANSI modules to support Unicode
under Python 2.x.  Under Python 3.x, it was already supported because
strings are Unicode by default.  Now, on both Python versions:

- The constructors accept a codec name (defaults to 'latin-1') and a
  scheme for handling encoding/decoding errors (defaults to
  'replace').  The codec may be set to None to inhibit
  encoding/decoding.

- Unicode is now used internally for storing the screen contents.

- Methods that accept input characters will, if passed input of type
  'bytes' (or, under Python 2.x, 'str'), use the specified codec to
  decode the input, otherwise treating it as Unicode.

- Methods that return screen contents now return Unicode, with the
  exception of __str__() under Python 2.x, and __bytes__() in all
  versions of Python, which return the screen contents encoded using
  the specified codec.

These changes are designed to work only with Python 2.6, 2.7, and 3.3
and later, specifically versions that provide both b'' and u'' string
literals.

The check in ANSI for characters being printable is also removed, as
this prevents non-ASCII characters being accepted, which is not
compatible with the goal of adding Unicode support.  This addresses
issue pexpect#88.
@dcoshea
Copy link
Author

dcoshea commented Jul 24, 2014

Filed pull request #96 with a fix for this.

dcoshea pushed a commit to dcoshea/pexpect that referenced this issue Jul 24, 2014
This commit updates the the screen and ANSI modules to support Unicode
under Python 2.x.  Under Python 3.x, it was already supported because
strings are Unicode by default.  Now, on both Python versions:

- The constructors accept a codec name (defaults to 'latin-1') and a
  scheme for handling encoding/decoding errors (defaults to
  'replace').  The codec may be set to None to inhibit
  encoding/decoding.

- Unicode is now used internally for storing the screen contents.

- Methods that accept input characters will, if passed input of type
  'bytes' (or, under Python 2.x, 'str'), use the specified codec to
  decode the input, otherwise treating it as Unicode.

- Methods that return screen contents now return Unicode, with the
  exception of __str__() under Python 2.x, and __bytes__() in all
  versions of Python, which return the screen contents encoded using
  the specified codec.

These changes are designed to work only with Python 2.6, 2.7, and 3.3
and later, specifically versions that provide both b'' and u'' string
literals.

The check in ANSI for characters being printable is also removed, as
this prevents non-ASCII characters being accepted, which is not
compatible with the goal of adding Unicode support.  This addresses
issue pexpect#83.
dcoshea pushed a commit to dcoshea/pexpect that referenced this issue Aug 5, 2014
This commit updates the the screen and ANSI modules to support Unicode
under Python 2.x.  Under Python 3.x, it was already supported because
strings are Unicode by default.  Now, on both Python versions:

- The constructors accept a codec name (defaults to 'latin-1') and a
  scheme for handling encoding/decoding errors (defaults to
  'replace').  The codec may be set to None to inhibit
  encoding/decoding.

- Unicode is now used internally for storing the screen contents.

- Methods that accept input characters will, if passed input of type
  'bytes' (or, under Python 2.x, 'str'), use the specified codec to
  decode the input, otherwise treating it as Unicode.

- Methods that return screen contents now return Unicode, with the
  exception of __str__() under Python 2.x, and __bytes__() in all
  versions of Python, which return the screen contents encoded using
  the specified codec.

These changes are designed to work only with Python 2.6, 2.7, and 3.3
and later, specifically versions that provide both b'' and u'' string
literals.

The check in ANSI for characters being printable is also removed, as
this prevents non-ASCII characters being accepted, which is not
compatible with the goal of adding Unicode support.  This addresses
issue pexpect#83.
@jquast
Copy link
Member

jquast commented Sep 19, 2015

Closing, pexpect's terminal emulation code remains next release but no longer improved, marked deprecated by #240 Suggest any terminal emulation / screen scraping code efforts moved to more concerted project efforts such as https://github.com/selectel/pyte

@jquast jquast closed this as completed Sep 19, 2015
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants