Skip to content
This repository has been archived by the owner on Aug 2, 2023. It is now read-only.

Fix object_repr slicing in Python2 #1403

Closed
wants to merge 6 commits into from
Closed

Conversation

technic
Copy link

@technic technic commented May 5, 2019

Sorry, this is a bit of necromancy (python 2 is eol soon).
In case of utf-8 non ascii characters we can not just slice python2 strings, because characters can be two-byte. This change converts it to unicode and back for slicing.

This is a bit of necromancy. In case of utf-8 non ascii characters we can not just slice python2 strings, because characters can be two-byte.
This change converts it to unicode and back for slicing.
@int19h
Copy link
Contributor

int19h commented May 5, 2019

We can't assume it's UTF-8, though. It might not be a Unicode locale.

By default, if __repr__ returns a unicode instance, Python 2 will convert it to str using sys.getdefaultencoding(). This is usually either the default "ascii" or "utf-8", but not necessarily, so this code needs to do the same.

Also, doesn't this break on Python 3, since str wouldn't have decode()?

@technic
Copy link
Author

technic commented May 5, 2019

I think I need to add something like if python2: for this change, but I don't know the best solution for such if.

According to https://docs.python.org/2.7/reference/datamodel.html#object.__repr__ the __repr__ should return a string object. If we don't know the encoding this is bad, but anyway slicing is bad idea.

@int19h
Copy link
Contributor

int19h commented May 5, 2019

The docs are a bit vague - they don't say str, specifically, just "string object", which can imply either string type. In this case, I think that's the intent, because the implementation of repr() explicitly performs the conversion if __repr__() returns unicode. NULL arguments in the call mean that it's going to use the default string encoding for that, same as calling decode() without arguments in Python.

The version check would be something like this in general:

if sys.version_info < (3,): ...

However, for strings, we usually test the type instead, e.g. https://github.com/Microsoft/ptvsd/blob/4706f63bae590e568a5859b4e5a2a6c109ddcbdf/src/ptvsd/__main__.py#L72-L73

This works because in 2.7, bytes is an alias for str.

@fabioz, is this also the preferred approach for pydevd, given that it has to handle Jython etc as well?

@technic
Copy link
Author

technic commented May 6, 2019

Ok, thanks for the information. Now I know how to switch this code on/of for 2/3.
But can we assume something about encoding of the string, for example utf-8 or sys.getdefaultencoding(). Then we can use errors='replace' option for decode. And for encode the format should be what the other side of debugger expects, is it utf-8?

Copy link
Contributor

@fabioz fabioz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @technic, thank you very much for the pull request...

This is a bit of a tricky area, so, besides the comments I added, this pull request should have at least a test which covers the situation you're fixing (add more tests to ptvsd\src\ptvsd\_vendored\pydevd\tests_python\test_safe_repr.py).

Note that unfortunately right now those tests don't currently run automatically on the ptvsd repo (pydevd tests are only run on the pydevd repo, so, you'll have to run them manually -- if you want, you can provide the pull request in https://github.com/fabioz/PyDev.Debugger/ so that tests are run -- or you can just run it manually in your machine and I'll backport and make sure it runs on pydevd before merging here).

-- to run manually you can do:

py.test src\ptvsd\_vendored\pydevd\tests_python\test_safe_repr.py  -n auto

(note that the -n auto requires https://pypi.org/project/pytest-xdist/ to run).

Note: tests should cover the following situations:

  • Test with binary string that can't be encoded
  • Test with binary string with multi-bytes
  • Test where __repr__ returns unicode
  • Test where __repr__ returns a wrong object (say, an int) -- the debugger shouldn't crash, just show the error that repr() would've thrown.

obj_repr = unicode(obj_repr, 'utf-8', errors='replace')
yield obj_repr[:left_count].encode('utf-8')
yield '...'
yield obj_repr[-right_count:].encode('utf-8')
Copy link
Contributor

@fabioz fabioz May 6, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that you shouldn't use utf-8 as the encoding, rather, in this situation, the encoding used should be analogous to what repr would have used to encode it if a unicode was given (which is sys.getdefaultencoding() -- I had suggested sys.stdout.encoding before, but that's not correct as repr is really dependent on sys.getdefaultencoding and not the I/O encoding).

Note: this encoding is only for the decode (the encode later on should be kept at utf-8 as it's what the debugger expects).

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think decode would use sys.getdefaultencoding() by default.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On Python 2, it does if called with no arguments. On Python 3 it defaults to UTF-8, but it should never return bytes from repr(), so we should be good.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually I would like to decode from utf-8 instead of default, or at least have try to decode from default and then from utf-8. Because in PyCharm debugger works without sys.setdefaultencoding() and I didn't have any issues with __repr__() returning utf-8 bytestring so far.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unicode is a mess in Python 2 in general, so there's no single right answer here. The best we can do is try to be consistent with what the language does elsewhere. In this case, there are two possibilities.

First one is to look at the implicit conversion from unicode to str. The fact that it's there in the code means that somebody is using it. Most likely inadvertently, when repr for an object just concatenates a bunch of strings, and one of them happens to be Unicode. In this case, implicit conversion to unicode will use the default encoding, so it makes sense for repr() to decode accordingly. To correctly support code that's doing that, we need to do the same.

On the other hand, we can look at what the standard Python REPL does when it prints objects out. And it basically just dumps them to stdout as is, which implies that the expected encoding is whatever sys.stdout.encoding is. If you're on Unix, sys.stdout.encoding is almost certainly UTF-8. But on Windows, it will be the current console codepage, which is usually not UTF-8. Consequently, __repr__ that always returns UTF-8 will not print correctly there. And conversely, if we treat it as if it were UTF-8, then any library that encodes it using sys.stdout.encoding - specifically so that it prints correctly - will not display correctly in debugger. Looking around, there's some code with e.g. return (...).encode(sys.stdout.encoding or 'utf-8') out there.

And since default encoding and stdout encoding are two distinct things, this also means that __repr__ that returns unicode won't always print correctly, either. So the language isn't fully consistent with itself.

I think of these two, we should go with sys.stdout.encoding if present (and default to UTF-8 otherwise). It makes more sense, since the default encoding is almost always ASCII, and changing it is not recommended.

Would this also solve the problem in your scenario?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Side note: Jupyter on Python 2 treats repr of outputs as UTF-8. But it also sets sys.stdout.encoding to UTF-8, so it's fully consistent with this behavior.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think using sys.stdout.encoding or 'utf-8' will work in my case because I have UTF-8 everywhere.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perfect, looks like we can make everybody happy! :)

One slight problem with that exact idiom - I looked it up, and sys.stdout is not guaranteed to have encoding at all, if it is replaced (i.e. you might get AttributeError exception instead of None). So I think it'll need to be:

getattr(sys.stdout, 'encoding', None) or 'utf-8'

This will cover both the missing and the present-but-None case.

@technic
Copy link
Author

technic commented May 6, 2019

So should I close the pull in favor for a new one in PyDev repository?

@fabioz
Copy link
Contributor

fabioz commented May 7, 2019

So should I close the pull in favor for a new one in PyDev repository?

That's up to you... (if you decide to keep it here I'll do the backport). But you do need to add the tests to the pull request.

@technic
Copy link
Author

technic commented May 7, 2019

But you do need to add the tests to the pull request.

I was hoping that somebody from the upstream team would take over this this pull request :)
OK, I still can add tests myself later.

@technic
Copy link
Author

technic commented May 7, 2019

One question, I cannot find what is maximum line length in the style guide...

@fabioz
Copy link
Contributor

fabioz commented May 8, 2019

Ok, created #1407 to address the creation of tests before integrating this pull request.

@fabioz
Copy link
Contributor

fabioz commented May 17, 2019

I'm closing this pull request in favor of: #1429

@fabioz fabioz closed this May 17, 2019
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants