Fix object_repr slicing in Python2 #1403

technic · 2019-05-05T00:55:27Z

Sorry, this is a bit of necromancy (python 2 is eol soon).
In case of utf-8 non ascii characters we can not just slice python2 strings, because characters can be two-byte. This change converts it to unicode and back for slicing.

This is a bit of necromancy. In case of utf-8 non ascii characters we can not just slice python2 strings, because characters can be two-byte. This change converts it to unicode and back for slicing.

int19h · 2019-05-05T11:00:47Z

We can't assume it's UTF-8, though. It might not be a Unicode locale.

By default, if __repr__ returns a unicode instance, Python 2 will convert it to str using sys.getdefaultencoding(). This is usually either the default "ascii" or "utf-8", but not necessarily, so this code needs to do the same.

Also, doesn't this break on Python 3, since str wouldn't have decode()?

technic · 2019-05-05T11:09:01Z

I think I need to add something like if python2: for this change, but I don't know the best solution for such if.

According to https://docs.python.org/2.7/reference/datamodel.html#object.__repr__ the __repr__ should return a string object. If we don't know the encoding this is bad, but anyway slicing is bad idea.

int19h · 2019-05-05T21:58:38Z

The docs are a bit vague - they don't say str, specifically, just "string object", which can imply either string type. In this case, I think that's the intent, because the implementation of repr() explicitly performs the conversion if __repr__() returns unicode. NULL arguments in the call mean that it's going to use the default string encoding for that, same as calling decode() without arguments in Python.

The version check would be something like this in general:

if sys.version_info < (3,): ...

However, for strings, we usually test the type instead, e.g. https://github.com/Microsoft/ptvsd/blob/4706f63bae590e568a5859b4e5a2a6c109ddcbdf/src/ptvsd/__main__.py#L72-L73

This works because in 2.7, bytes is an alias for str.

@fabioz, is this also the preferred approach for pydevd, given that it has to handle Jython etc as well?

technic · 2019-05-06T08:08:36Z

Ok, thanks for the information. Now I know how to switch this code on/of for 2/3.
But can we assume something about encoding of the string, for example utf-8 or sys.getdefaultencoding(). Then we can use errors='replace' option for decode. And for encode the format should be what the other side of debugger expects, is it utf-8?

fabioz

Hi @technic, thank you very much for the pull request...

This is a bit of a tricky area, so, besides the comments I added, this pull request should have at least a test which covers the situation you're fixing (add more tests to ptvsd\src\ptvsd\_vendored\pydevd\tests_python\test_safe_repr.py).

Note that unfortunately right now those tests don't currently run automatically on the ptvsd repo (pydevd tests are only run on the pydevd repo, so, you'll have to run them manually -- if you want, you can provide the pull request in https://github.com/fabioz/PyDev.Debugger/ so that tests are run -- or you can just run it manually in your machine and I'll backport and make sure it runs on pydevd before merging here).

-- to run manually you can do:

py.test src\ptvsd\_vendored\pydevd\tests_python\test_safe_repr.py  -n auto

(note that the -n auto requires https://pypi.org/project/pytest-xdist/ to run).

Note: tests should cover the following situations:

Test with binary string that can't be encoded
Test with binary string with multi-bytes
Test where __repr__ returns unicode
Test where __repr__ returns a wrong object (say, an int) -- the debugger shouldn't crash, just show the error that repr() would've thrown.

src/ptvsd/_vendored/pydevd/_pydevd_bundle/pydevd_safe_repr.py

fabioz · 2019-05-06T10:59:21Z

src/ptvsd/_vendored/pydevd/_pydevd_bundle/pydevd_safe_repr.py

+            obj_repr = unicode(obj_repr, 'utf-8', errors='replace')
+            yield obj_repr[:left_count].encode('utf-8')
+            yield '...'
+            yield obj_repr[-right_count:].encode('utf-8')


I think that you shouldn't use utf-8 as the encoding, rather, in this situation, the encoding used should be analogous to what repr would have used to encode it if a unicode was given (which is sys.getdefaultencoding() -- I had suggested sys.stdout.encoding before, but that's not correct as repr is really dependent on sys.getdefaultencoding and not the I/O encoding).

Note: this encoding is only for the decode (the encode later on should be kept at utf-8 as it's what the debugger expects).

I think decode would use sys.getdefaultencoding() by default.

On Python 2, it does if called with no arguments. On Python 3 it defaults to UTF-8, but it should never return bytes from repr(), so we should be good.

Actually I would like to decode from utf-8 instead of default, or at least have try to decode from default and then from utf-8. Because in PyCharm debugger works without sys.setdefaultencoding() and I didn't have any issues with __repr__() returning utf-8 bytestring so far.

Unicode is a mess in Python 2 in general, so there's no single right answer here. The best we can do is try to be consistent with what the language does elsewhere. In this case, there are two possibilities.

First one is to look at the implicit conversion from unicode to str. The fact that it's there in the code means that somebody is using it. Most likely inadvertently, when repr for an object just concatenates a bunch of strings, and one of them happens to be Unicode. In this case, implicit conversion to unicode will use the default encoding, so it makes sense for repr() to decode accordingly. To correctly support code that's doing that, we need to do the same.

On the other hand, we can look at what the standard Python REPL does when it prints objects out. And it basically just dumps them to stdout as is, which implies that the expected encoding is whatever sys.stdout.encoding is. If you're on Unix, sys.stdout.encoding is almost certainly UTF-8. But on Windows, it will be the current console codepage, which is usually not UTF-8. Consequently, __repr__ that always returns UTF-8 will not print correctly there. And conversely, if we treat it as if it were UTF-8, then any library that encodes it using sys.stdout.encoding - specifically so that it prints correctly - will not display correctly in debugger. Looking around, there's some code with e.g. return (...).encode(sys.stdout.encoding or 'utf-8') out there.

And since default encoding and stdout encoding are two distinct things, this also means that __repr__ that returns unicode won't always print correctly, either. So the language isn't fully consistent with itself.

I think of these two, we should go with sys.stdout.encoding if present (and default to UTF-8 otherwise). It makes more sense, since the default encoding is almost always ASCII, and changing it is not recommended.

Would this also solve the problem in your scenario?

Side note: Jupyter on Python 2 treats repr of outputs as UTF-8. But it also sets sys.stdout.encoding to UTF-8, so it's fully consistent with this behavior.

I think using sys.stdout.encoding or 'utf-8' will work in my case because I have UTF-8 everywhere.

Perfect, looks like we can make everybody happy! :)

One slight problem with that exact idiom - I looked it up, and sys.stdout is not guaranteed to have encoding at all, if it is replaced (i.e. you might get AttributeError exception instead of None). So I think it'll need to be:

getattr(sys.stdout, 'encoding', None) or 'utf-8'

This will cover both the missing and the present-but-None case.

technic · 2019-05-06T17:10:13Z

So should I close the pull in favor for a new one in PyDev repository?

fabioz · 2019-05-07T11:27:39Z

So should I close the pull in favor for a new one in PyDev repository?

That's up to you... (if you decide to keep it here I'll do the backport). But you do need to add the tests to the pull request.

technic · 2019-05-07T16:40:07Z

But you do need to add the tests to the pull request.

I was hoping that somebody from the upstream team would take over this this pull request :)
OK, I still can add tests myself later.

technic · 2019-05-07T20:40:16Z

One question, I cannot find what is maximum line length in the style guide...

fabioz · 2019-05-08T13:35:04Z

Ok, created #1407 to address the creation of tests before integrating this pull request.

fabioz · 2019-05-17T13:50:24Z

I'm closing this pull request in favor of: #1429

Fix object_repr slicing in Python2

9fe3b08

This is a bit of necromancy. In case of utf-8 non ascii characters we can not just slice python2 strings, because characters can be two-byte. This change converts it to unicode and back for slicing.

Apply obj_repr unicode fix only for python 2

d655928

fabioz suggested changes May 6, 2019

View reviewed changes

technic added 2 commits May 6, 2019 19:11

Fix typo

4320633

Don't use unicode for Python 3 compatibility

de2b446

Use sys.stdout.encoding or 'utf-8' for decoding

e34e94b

Check that sys.stdout has encoding attribute

c859142

fabioz mentioned this pull request May 14, 2019

Create tests and integrate #1403 #1407

Closed

fabioz closed this May 17, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix object_repr slicing in Python2 #1403

Fix object_repr slicing in Python2 #1403

technic commented May 5, 2019

int19h commented May 5, 2019

technic commented May 5, 2019

int19h commented May 5, 2019

technic commented May 6, 2019

fabioz left a comment •

edited

fabioz May 6, 2019 •

edited

technic May 6, 2019

int19h May 6, 2019

technic May 7, 2019

int19h May 7, 2019

int19h May 7, 2019

technic May 7, 2019

int19h May 7, 2019

technic commented May 6, 2019

fabioz commented May 7, 2019

technic commented May 7, 2019

technic commented May 7, 2019

fabioz commented May 8, 2019

fabioz commented May 17, 2019

Fix object_repr slicing in Python2 #1403

Fix object_repr slicing in Python2 #1403

Conversation

technic commented May 5, 2019

int19h commented May 5, 2019

technic commented May 5, 2019

int19h commented May 5, 2019

technic commented May 6, 2019

fabioz left a comment • edited

Choose a reason for hiding this comment

fabioz May 6, 2019 • edited

Choose a reason for hiding this comment

technic May 6, 2019

Choose a reason for hiding this comment

int19h May 6, 2019

Choose a reason for hiding this comment

technic May 7, 2019

Choose a reason for hiding this comment

int19h May 7, 2019

Choose a reason for hiding this comment

int19h May 7, 2019

Choose a reason for hiding this comment

technic May 7, 2019

Choose a reason for hiding this comment

int19h May 7, 2019

Choose a reason for hiding this comment

technic commented May 6, 2019

fabioz commented May 7, 2019

technic commented May 7, 2019

technic commented May 7, 2019

fabioz commented May 8, 2019

fabioz commented May 17, 2019

fabioz left a comment •

edited

fabioz May 6, 2019 •

edited