-
-
Notifications
You must be signed in to change notification settings - Fork 146
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Chokes up on valid filenames #61
Comments
What is the proper thing to do in this case? I was able to replicate your error, but I observed some things.
These findings lead me to believe that the filename in the file system is in fact incorrectly encoded and invalid. In my mind, there are only two reasonable and portable encodings suitable for filenames - sys.getdefaultencoding() and sys.getfilesystemencoding(). The current implementation uses the former (implicitly). It would be reasonable, I think, to instead use sys.getfilesystemencoding to decode the filenames, but that still wouldn't address the issue as reported above. I'm interested to know, though, what you would recommend in this case. Do you see that filename as valid? If so, how should Python know how to decode it? Should path.py use surrogate escapes, the same way as Python 3 does? Let me state up front that it is the opinion of path.py that paths must be text (Unicode). path.py is not interested in dealing with arbitrary byte strings. |
That is a perfectly respectable position, it is why I'm turning to a different library rather than attempting to fix (or get you to fix) yours.
The 'file system encoding' is more of a hint than anything else. Paths on POSIX are bytes, and this is the only representation that is safe. More over, it is user-specific, so another user on the same machine could totally be using latin-1 while you expect UTF-8 (like in my example). While it is probably acceptable to get back weird things for these (like Python 3), being unable to listdir at all seems a bit much (and can lead to DoS attacks).
Yes, Python 3 is able to handle all these paths safely. Rather than using bytes, they chose to roll up an augmented unicode representation that can account for non-character sequences. This allows to have unicode in most situations while still being able to represent every pathological case. FYI, personally I'll be using Unipath, which doesn't encode/decode at all and just uses native representations (bytes on Windows, unicode on POSIX). |
Please give path.py 5.2 a try - I believe it will work, at least for os.listdir. I did briefly scan through the other methods, but most of those won't be subject to the issue because they're already unicode. |
Yes, it works -- it returns a filename with a surrogate in it. Of course os.path doesn't accept these, so:
|
path.path.listdir() can fail with UnicodeDecodeError.
On Linux with Python 2:
I know you're not the only project out there that gets it wrong, but in my quest for something sensible I did test yours, so I thought I'd let you know. My search goes on!
The text was updated successfully, but these errors were encountered: