New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Cannot walk path on Linux/Python 2 made of non-unicode/non-fs-decodable bytes #130
Comments
@jaraco I think this can be closed alright as you mentioned in #121 (comment)
And you removed support for surrogateescape decoding in Py2 with de58c0c In anycase this was likely not complete without the corresponding specialized encoding back to bytes when surrogates are used which is broken on Python2 UTF-8 codec yielding mojibake. |
Actually you could even remove your surrogateescape decode handler from the code entirely since this is not used anymore. And you could state in the documentation that when using Python2 and Linux the input should be bytes and never unicode (and POSIX in general except may be for macOS). |
I'm not sure I follow. I worry a little bit that this choice I've made--that path.py should be able to treat all file system paths as text--is violating the user's expectation. I'm open to suggestions here, but I do still feel that users of a library like this should have fairly constrained interfaces (i.e. always text) and try to abstract away the nuances of edge cases in file systems, so I'm okay with this library being inadequate for universal file handling as long as it handles 99.9% of the cases where non-ascii file names are present. I think that's where the library is now, but I'm unsure. |
On Python 2, on Linux with a file like this
This is where you are alright IMHO e.g. in the 99.9% case... The 0.1% case is something that I am hitting frequently enough to be worried about. |
FWIW, if you ever want to fix this issues, Pi's @pjdelport Python2 backport of |
@pombredanne I'd welcome supporting this - as long as it doesn't substantially compromise the primary use case. Here are compromises I'd like to avoid:
If you would be willing to draft a PR with these goals in mind, I'd be glad to review and help shepherd it to a release. |
Adopting |
Vendoring would likely be possible since it is small enough : https://github.com/pjdelport/backports.os/blob/master/src/backports/os.py e.g. ~200 lines including comments. TBD though. |
I'd like to explore this further, but with the aim of not wasting your time if it turns out to be too involved. Could you possible provide a partial PR (or just description) with a sample of the kinds of changes this effort would involve? And importantly - which of those changes would be required if Python 3 could be relied upon? |
@jaraco I ended up implementing support for bytes-only paths on Linux in scancode-toolkit (not using path.py though of course, since it could not handle this). IMHO, you will need to support both unicode and byte paths everywhere for this... which is likely engaged. The gist of it is this kind of thing for me in ScanCode: each time that a path is processed, and at the boundary, I check if we are on Linux or not (which is not a great way). Then essentially store the path either as unicode or bytes dependening on which OS I am on and I use the
This is a rather ugly approach and I would have rather used unicode throughout, but this was even more work. This is super intrusive as this happens at all the boundaries and each time a location needs to be manipulated as a string or joined, etc. so there were many places impacted. In the case of path.py and since |
So checking out the code of path.py, it feels like |
I tried replicating your description to understand the problem better. I created this dockerfile:
Annoyingly, I had to use the Running that Dockerfile, I get this output:
Indeed, we can see Python is creating "invalid character" surrogates when it can't decode the character 0xB1 (0o261, 177). And furthermore, everything just works in Python 3:
So I then switch to see where things fail on Python 2 with this new Dockerfile:
And I'm able to replicate one of the concerns:
|
@pombredanne Would you help me test with I'm looking to drop support for Python 2, so if this technique is viable, I'd like to roll it out first. I'm eager for any feedback you have. |
On Linux/Python 2/path.py 10.3.1 I am trying to
walkfiles()
a path that contains a file name which is raw bytes. The specific of this path is that it is not in the fs.encoding (which is UTF-8) and therefore cannot be decoded to unicode as-is, unless I guess surrogate escape are used or something else.With
os.walk
it works when the top isbytes
, but fails if the top isunicode
.With
path.py
it fails both when using abytes
orunicode
top input.You can see the tests here: https://github.com/nexB/scancode-toolkit/pull/723/files#diff-ada144052a705a1e2fc3c96a033cc425R552
And the test failures are visible here:
The text was updated successfully, but these errors were encountered: