-
Notifications
You must be signed in to change notification settings - Fork 478
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ANSI and screen don't support Unicode #84
Comments
Hmmm. I suspect the simple changes would fail with non-ascii bytes. On some level, it would make sense to make it unicode only, because a screen does display text, not bytes, but that would break backwards compatibility. On Python 3, meanwhile, it probably already only accepts unicode, but quite possibly no-one has used it on Python 3 yet. What about adding an encoding parameter & attribute to |
With the changes suggested, I am in fact reading in bytes, passing them to a CP437 decoder, passing the resulting unicode strings to ANSI, then later calling get_region() to extract the unicode string, converting it to UTF-8, and outputting it, and stuff is coming back out the way it went in, e.g.:
so I think that is sort of proof it is working, at least for the methods I updated. I realize there are other methods that probably need to be changed too. With the encoding parameter, I assume you mean that any data passed in to ANSI would automatically be decoded, and any data returned would automatically be encoded? That seems reasonable, but what would I do if I actually wanted to pass in unicode strings I'd already decoded myself and/or get them back as unicode? It just so happens I want to do the decoding myself, because the stream's encoding changes. I'm not sure if there's an easy way to obtain some kind of null/dummy encoder/decoder? |
I was thinking that it would decode if it got bytes, and use unicode directly if it got that, and the principle output would be unicode. |
And we could provide This all sounds usable to me, thanks! Should I go ahead and try to implement this myself? |
Sure, go for it. I'm at a conference all next week, so I might not get to review it immediately, but I'll look at it soon. |
On further thought, that doesn't seem symmetrical, nor backwards-compatible. Perhaps it would be more appropriate for the principal output to be bytes, and if you want unicode, you can use I also figured that since, in my case, I don't want any encoding/decoding to occur, I would allow the codec to be specified as "None", meaning that all functions only accept unicode and only return unicode. My code to handle this looks like this:
I realize this is of no use in Python 3 where str == unicode; I'll leave it for someone else to add support for Does this sound reasonable? |
That's roughly what I was thinking, but:
|
I'm already making that check in the caller, although perhaps it would be better to put it into the _decode() function, I'll have a look.
So really everything other than When will this 4.0 release be happening?
Thanks, wasn't aware of that! |
I think that makes most sense, yes. In Python 2, unicode mostly works like str anyway.
We don't have an exact timeframe, but the two main aims are asyncio integration, which I already have PR #69 open for, and merging Windows support (issue #17). I don't think that should take too long (famous last words). |
Thanks for the info. I realize now that, unlike a solution where It seems like the scope of the work is getting a bigger than I expected. If non-backward-compatible changes are okay for a major release, would it be okay for these packages to just start treating their input as unicode and not perform decoding? I would assume it would be no more difficult for users to have to decode input before passing it to these packages than it would be for them to have to encode it after calling methods like |
I'll try to look into it. A major release means backwards incompatible changes are acceptable, but I still prefer to avoid them if it's practical. |
catching up.. I'm very familiar with these kinds of things, the tests that we have for this module are a bit weak, I'll plan to provide a cp437-encoded interface screen and work from there. New keyword parameters like encoding="latin-1" sounds good to me so far. |
Thanks, are you planning to do this soon, and do you plan to allow (in Python 2) |
unicode-everywhere is definitely the intent, yes. Soon... trying my best :) I have a few things to wrap up in pexpect first, patches welcome :) |
I'd like to have the fairly trivial (at least as far as I thought) pull request #89 accepted, and file some other pull requests I have piled up here, before I continue with this bigger task. |
Thanks for taking care of pull request #89! I assume that, for input passed to screen and ANSI, an incremental decoder should be used? |
yes, incremental decoder must be used, thanks. |
Thanks. I'm making some progress and I think I've implemented things as desired in a way that works in Python 2. For Python 3, I gather that prior to 3.3 the u'' syntax for string literals was not available. I assume I need to implement a workaround for this, i.e. supporting only 2.6, 2.7 and 3.3+ would not be acceptable? |
The next version of pexpect will only support python 2.6, 3.3 and above, so |
Great news, thanks! |
This commit updates the the screen and ANSI modules to support Unicode under Python 2.x. Under Python 3.x, it was already supported because strings are Unicode by default. Now, on both Python versions: - The constructors accept a codec name (defaults to 'latin-1') and a scheme for handling encoding/decoding errors (defaults to 'replace'). The codec may be set to None to inhibit encoding/decoding. - Unicode is now used internally for storing the screen contents. - Methods that accept input characters will, if passed input of type 'bytes' (or, under Python 2.x, 'str'), use the specified codec to decode the input, otherwise treating it as Unicode. - Methods that return screen contents now return Unicode, with the exception of __str__() under Python 2.x, and __bytes__() in all versions of Python, which return the screen contents encoded using the specified codec. These changes are designed to work only with Python 2.6, 2.7, and 3.3 and later, specifically versions that provide both b'' and u'' string literals. The check in ANSI for characters being printable is also removed, as this prevents non-ASCII characters being accepted, which is not compatible with the goal of adding Unicode support. This addresses issue pexpect#88.
Filed pull request #96 with a fix for this. |
This commit updates the the screen and ANSI modules to support Unicode under Python 2.x. Under Python 3.x, it was already supported because strings are Unicode by default. Now, on both Python versions: - The constructors accept a codec name (defaults to 'latin-1') and a scheme for handling encoding/decoding errors (defaults to 'replace'). The codec may be set to None to inhibit encoding/decoding. - Unicode is now used internally for storing the screen contents. - Methods that accept input characters will, if passed input of type 'bytes' (or, under Python 2.x, 'str'), use the specified codec to decode the input, otherwise treating it as Unicode. - Methods that return screen contents now return Unicode, with the exception of __str__() under Python 2.x, and __bytes__() in all versions of Python, which return the screen contents encoded using the specified codec. These changes are designed to work only with Python 2.6, 2.7, and 3.3 and later, specifically versions that provide both b'' and u'' string literals. The check in ANSI for characters being printable is also removed, as this prevents non-ASCII characters being accepted, which is not compatible with the goal of adding Unicode support. This addresses issue pexpect#83.
This commit updates the the screen and ANSI modules to support Unicode under Python 2.x. Under Python 3.x, it was already supported because strings are Unicode by default. Now, on both Python versions: - The constructors accept a codec name (defaults to 'latin-1') and a scheme for handling encoding/decoding errors (defaults to 'replace'). The codec may be set to None to inhibit encoding/decoding. - Unicode is now used internally for storing the screen contents. - Methods that accept input characters will, if passed input of type 'bytes' (or, under Python 2.x, 'str'), use the specified codec to decode the input, otherwise treating it as Unicode. - Methods that return screen contents now return Unicode, with the exception of __str__() under Python 2.x, and __bytes__() in all versions of Python, which return the screen contents encoded using the specified codec. These changes are designed to work only with Python 2.6, 2.7, and 3.3 and later, specifically versions that provide both b'' and u'' string literals. The check in ANSI for characters being printable is also removed, as this prevents non-ASCII characters being accepted, which is not compatible with the goal of adding Unicode support. This addresses issue pexpect#83.
Closing, pexpect's terminal emulation code remains next release but no longer improved, marked deprecated by #240 Suggest any terminal emulation / screen scraping code efforts moved to more concerted project efforts such as https://github.com/selectel/pyte |
It would be nice if the ANSI and screen packages supported Unicode strings, or there were additional packages/classes that did. I only had to change
str()
tounicode()
and addu
in front of the quoted strings (and perhaps even some of that wasn't necessary) inscreen.pretty()
andput_abs()
in 345eb58 to get this working quickly for me, but I didn't test all methods.If you could suggest your preferred solution, I might be able to implement it and submit a pull request.
The text was updated successfully, but these errors were encountered: