Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Chokes up on valid filenames #61

Closed
remram44 opened this issue Jun 5, 2014 · 4 comments
Closed

Chokes up on valid filenames #61

remram44 opened this issue Jun 5, 2014 · 4 comments

Comments

@remram44
Copy link

remram44 commented Jun 5, 2014

path.path.listdir() can fail with UnicodeDecodeError.

On Linux with Python 2:

$ echo -e 'r\xE9mi' | xargs touch
$ python
Python 2.7.7rc1 (default, May 21 2014, 12:59:42)
[GCC 4.8.2] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import path
>>> path.path('.').listdir()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/tmp/venv/local/lib/python2.7/site-packages/path.py", line 485, in listdir
    if self._next_class(child).fnmatch(pattern)
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe9 in position 1: ordinal not in range(128)

I know you're not the only project out there that gets it wrong, but in my quest for something sensible I did test yours, so I thought I'd let you know. My search goes on!

@jaraco
Copy link
Owner

jaraco commented Jun 6, 2014

What is the proper thing to do in this case? I was able to replicate your error, but I observed some things.

  1. The file you created has a file name that's not in the 'file system encoding'. Therefore, even if one attempts to decode the filename using the filesystemencoding, it fails.
>>> [path.decode(sys.getfilesystemencoding()) for path in os.listdir('.')]
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/lib/python2.7/encodings/utf_8.py", line 16, in decode
    return codecs.utf_8_decode(input, errors, True)
UnicodeDecodeError: 'utf8' codec can't decode byte 0xe9 in position 1: invalid continuation byte
  1. On Python 3, when doing a listdir on the directory, that filename creates a surrogate (essentially an invalid character).
>>> os.listdir('.')
['.cache', '.ssh', '.bashrc', '.profile', '.bash_logout', 'r\udce9mi']

These findings lead me to believe that the filename in the file system is in fact incorrectly encoded and invalid.

In my mind, there are only two reasonable and portable encodings suitable for filenames - sys.getdefaultencoding() and sys.getfilesystemencoding(). The current implementation uses the former (implicitly). It would be reasonable, I think, to instead use sys.getfilesystemencoding to decode the filenames, but that still wouldn't address the issue as reported above.

I'm interested to know, though, what you would recommend in this case. Do you see that filename as valid? If so, how should Python know how to decode it? Should path.py use surrogate escapes, the same way as Python 3 does?

Let me state up front that it is the opinion of path.py that paths must be text (Unicode). path.py is not interested in dealing with arbitrary byte strings.

@remram44
Copy link
Author

remram44 commented Jun 6, 2014

Let me state up front that it is the opinion of path.py that paths must be text (Unicode). path.py is not interested in dealing with arbitrary byte strings.

That is a perfectly respectable position, it is why I'm turning to a different library rather than attempting to fix (or get you to fix) yours.

The file you created has a file name that's not in the 'file system encoding'. Therefore, even if one attempts to decode the filename using the filesystemencoding, it fails.

The 'file system encoding' is more of a hint than anything else. Paths on POSIX are bytes, and this is the only representation that is safe. More over, it is user-specific, so another user on the same machine could totally be using latin-1 while you expect UTF-8 (like in my example). While it is probably acceptable to get back weird things for these (like Python 3), being unable to listdir at all seems a bit much (and can lead to DoS attacks).

On Python 3, when doing a listdir on the directory, that filename creates a surrogate (essentially an invalid character).

Yes, Python 3 is able to handle all these paths safely. Rather than using bytes, they chose to roll up an augmented unicode representation that can account for non-character sequences. This allows to have unicode in most situations while still being able to represent every pathological case.

FYI, personally I'll be using Unipath, which doesn't encode/decode at all and just uses native representations (bytes on Windows, unicode on POSIX).

@jaraco
Copy link
Owner

jaraco commented Jun 12, 2014

Please give path.py 5.2 a try - I believe it will work, at least for os.listdir. I did briefly scan through the other methods, but most of those won't be subject to the issue because they're already unicode.

@jaraco jaraco closed this as completed Jun 12, 2014
@remram44
Copy link
Author

Yes, it works -- it returns a filename with a surrogate in it. Of course os.path doesn't accept these, so:

>>> p = path.path('.').listdir()[-1]
>>> p
path(u'./r\udce9mi')
>>> p.listdir()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/tmp/venv2/local/lib/python2.7/site-packages/path.py", line 502, in lis dir
    for child in map(self._always_unicode, os.listdir(self))
OSError: [Errno 2] No such file or directory: './r\xed\xb3\xa9mi'

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants