Navigation Menu

Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cannot walk path on Linux/Python 2 made of non-unicode/non-fs-decodable bytes #130

Closed
pombredanne opened this issue Aug 15, 2017 · 13 comments

Comments

@pombredanne
Copy link

pombredanne commented Aug 15, 2017

On Linux/Python 2/path.py 10.3.1 I am trying to walkfiles() a path that contains a file name which is raw bytes. The specific of this path is that it is not in the fs.encoding (which is UTF-8) and therefore cannot be decoded to unicode as-is, unless I guess surrogate escape are used or something else.

With os.walk it works when the top is bytes, but fails if the top is unicode.

With path.py it fails both when using a bytes or unicode top input.
You can see the tests here: https://github.com/nexB/scancode-toolkit/pull/723/files#diff-ada144052a705a1e2fc3c96a033cc425R552

And the test failures are visible here:

@pombredanne
Copy link
Author

@jaraco I think this can be closed alright as you mentioned in #121 (comment)

In reviewing #61, I see that the "fix" I put in place only deferred the failure until a later point. In other words, there's no value in attempting to support undecodable paths. Path.py has no interest in supporting non-text paths, especially on old and broken Python versions. Therefore, I'm going to drop this crufty workaround.

And you removed support for surrogateescape decoding in Py2 with de58c0c

In anycase this was likely not complete without the corresponding specialized encoding back to bytes when surrogates are used which is broken on Python2 UTF-8 codec yielding mojibake.

@pombredanne
Copy link
Author

Actually you could even remove your surrogateescape decode handler from the code entirely since this is not used anymore. And you could state in the documentation that when using Python2 and Linux the input should be bytes and never unicode (and POSIX in general except may be for macOS).

@jaraco
Copy link
Owner

jaraco commented Aug 27, 2017

And you could state in the documentation that when using Python2 and Linux [or POSIX] the input should be bytes and never unicode.

I'm not sure I follow. path.py requires its inputs to be unicode and never bytes and requires the system to be configured in such a way that all file system paths managed by path.py can be reliably encoded and decoded with the filesystemencoding.

I worry a little bit that this choice I've made--that path.py should be able to treat all file system paths as text--is violating the user's expectation.

I'm open to suggestions here, but I do still feel that users of a library like this should have fairly constrained interfaces (i.e. always text) and try to abstract away the nuances of edge cases in file systems, so I'm okay with this library being inadequate for universal file handling as long as it handles 99.9% of the cases where non-ascii file names are present. I think that's where the library is now, but I'm unsure.

@pombredanne
Copy link
Author

I worry a little bit that this choice I've made--that path.py should be able to treat all file system paths as text--is violating the user's expectation.

On Python 2, on Linux with a file like this $ touch tmp/foo$'\261'bar you cannot get path.py to list this file from a unicode input (say of its parent directory tmp). These are the rare enough (but enough to bite me) cases where the filename is made of bytes that cannot be decoded using the filesystemencoding short of using surrogate pairs in the common case of an UTF-8 FS encoding.
To make this working properly this requires the use of something similar of the Python 3 os.fsencode and fs.decode at the boundaries which is a tad engaged to get right on Python 2 as this touches pretty much everything (and is essentially what has been done in the Python 3 code)

so I'm okay with this library being inadequate for universal file handling as long as it handles 99.9% of the cases where non-ascii file names are present. I think that's where the library is now, but I'm unsure.

This is where you are alright IMHO e.g. in the 99.9% case... The 0.1% case is something that I am hitting frequently enough to be worried about.

@pombredanne
Copy link
Author

pombredanne commented Feb 11, 2018

FWIW, if you ever want to fix this issues, Pi's @pjdelport Python2 backport of os.fsencode and os.fsdecode at https://github.com/pjdelport/backports.os is the one library that works and get it right: it is battle tested on a few billion real world files

@jaraco jaraco reopened this Feb 11, 2018
@jaraco
Copy link
Owner

jaraco commented Feb 11, 2018

@pombredanne I'd welcome supporting this - as long as it doesn't substantially compromise the primary use case. Here are compromises I'd like to avoid:

  • Requiring other packages or modules. Currently, the path module is standalone and can be copied into any project and still work. Any dependencies are optional. It would be acceptable for backports.os:python_version<"3.2" as a dependency as long as it degrades gracefully in the implementation if the dependency isn't present.
  • Many touch points in the code. Consider, for example, Listdir performance optimization - separate class #120, where a performance enhancement was proposed, but where I was loathe to incorporate a swath of changes across 10 methods, adding a cognitive burden to the basic behavior and eliminating the simpler, degenerate, but less efficient behavior. If os.fsdecode/encode behavior can be a mix-in class or otherwise be non-intrusive to the basic concepts, then we're in good shape.

If you would be willing to draft a PR with these goals in mind, I'd be glad to review and help shepherd it to a release.

@pombredanne
Copy link
Author

Adopting os.fsdecode/fsencode is a rather invasive change based on my experience (and you could see also the many places where this required changes in the Python 3 stdlib as another example). So IMHO it would only make sense if it was vendored rather than an optional lib. It may be possible to use some mixin, but that remains to be seen.

@pombredanne
Copy link
Author

Vendoring would likely be possible since it is small enough : https://github.com/pjdelport/backports.os/blob/master/src/backports/os.py e.g. ~200 lines including comments. TBD though.

@jaraco
Copy link
Owner

jaraco commented Jun 5, 2018

I'd like to explore this further, but with the aim of not wasting your time if it turns out to be too involved.

Could you possible provide a partial PR (or just description) with a sample of the kinds of changes this effort would involve? And importantly - which of those changes would be required if Python 3 could be relied upon?

@pombredanne
Copy link
Author

@jaraco I ended up implementing support for bytes-only paths on Linux in scancode-toolkit (not using path.py though of course, since it could not handle this).

IMHO, you will need to support both unicode and byte paths everywhere for this... which is likely engaged.

The gist of it is this kind of thing for me in ScanCode: each time that a path is processed, and at the boundary, I check if we are on Linux or not (which is not a great way). Then essentially store the path either as unicode or bytes dependening on which OS I am on and I use the backports.os from @pjdelport as the codec.
See https://github.com/nexB/scancode-toolkit/blob/e7e6bebbf8260dcb1b367913e2a2f6c12364b284/src/scancode/resource.py#L157

if on_linux:
    location = fsencode(location)
else:
    location = fsdecode(location)

This is a rather ugly approach and I would have rather used unicode throughout, but this was even more work. This is super intrusive as this happens at all the boundaries and each time a location needs to be manipulated as a string or joined, etc. so there were many places impacted.

In the case of path.py and since Path extends unicode on Python2, I would not know where to start...
but the key to something somehow working is to always use fs-encoded byte paths when accessing the fs on Linux and always use Unicode on Windows and Mac: so I am not sure your str/unicode subclass could also behave as bytes.

@pombredanne
Copy link
Author

pombredanne commented Jun 5, 2018

So checking out the code of path.py, it feels like self needs to behave either as unicode in general or as fs-encoded bytes on Linux and OSes with a bytes path.
I cannot fathom how to make this work. And any place where you handle a path unicode string and some other string today would likely be touched. And I not even sure you could sanely subclass unicode anymore.

@jaraco
Copy link
Owner

jaraco commented Sep 15, 2018

On Python 2, on Linux with a file like this $ touch tmp/foo$'\261'bar you cannot get path.py to list this file from a unicode input (say of its parent directory tmp).

I tried replicating your description to understand the problem better. I created this dockerfile:

from ubuntu:bionic
RUN apt update
RUN apt install -y python3 python3-pip
RUN ln -s $(which python3) /usr/local/bin/python
RUN python -m pip install path.py
RUN touch $(echo tmp/foo$'\261'bar)
CMD python -c 'import path; print(path.Path("tmp").listdir())'

Annoyingly, I had to use the touch $(echo ...) workaround because Docker munges the syntax otherwise. Even with that, it ends up with an extra dollar sign in the filename, but oh well.

Running that Dockerfile, I get this output:

path.py master $ docker build -q . -t testing/pathpy; docker run -it testing/pathpy
sha256:f713408861680a3706dacd84c4bb8f0457df56fe5d5be95eaa9a24b5899e53ae
[Path('tmp/foo$\udcb1bar')]

Indeed, we can see Python is creating "invalid character" surrogates when it can't decode the character 0xB1 (0o261, 177).

And furthermore, everything just works in Python 3:

root@b6b7acf43cc9:/# python
Python 3.6.5 (default, Apr  1 2018, 05:46:30)
[GCC 7.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import path
>>> f = path.Path('tmp').listdir()[0]
>>> f.write_text('foo')
>>> f.text()
'foo'

So I then switch to see where things fail on Python 2 with this new Dockerfile:

from ubuntu:bionic
RUN apt update
RUN apt install -y python python-pip
RUN python -m pip install path.py
RUN touch $(echo tmp/foo$'\261'bar)
CMD python -c 'import path; print(path.Path(u"tmp").listdir())'

And I'm able to replicate one of the concerns:

path.py master $ docker build -q . -t testing/pathpy; docker run -it testing/pathpy
sha256:4081ef4361928243682bc6aaea4da3b0bc4e9370f798106f3dbb9af9eaf4551b
Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "/usr/local/lib/python2.7/dist-packages/path.py", line 557, in listdir
    if self._next_class(child).fnmatch(pattern)
UnicodeDecodeError: 'ascii' codec can't decode byte 0xb1 in position 4: ordinal not in range(128)

@jaraco
Copy link
Owner

jaraco commented Sep 15, 2018

@pombredanne Would you help me test with pip install git+https://github.com/jaraco/path.py@bugfix/130-unicode-bytes-on-linux-python2? Tests seem to indicate the technique is working (given that I was able to remove the xfail test). I'm hoping this OS class can be extended to monkey-patch the interface boundaries necessary to give Python 2 users a more robust experience.

I'm looking to drop support for Python 2, so if this technique is viable, I'd like to roll it out first. I'm eager for any feedback you have.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants