Chokes up on valid filenames #61

remram44 · 2014-06-05T22:40:56Z

path.path.listdir() can fail with UnicodeDecodeError.

On Linux with Python 2:

$ echo -e 'r\xE9mi' | xargs touch
$ python
Python 2.7.7rc1 (default, May 21 2014, 12:59:42)
[GCC 4.8.2] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import path
>>> path.path('.').listdir()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/tmp/venv/local/lib/python2.7/site-packages/path.py", line 485, in listdir
    if self._next_class(child).fnmatch(pattern)
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe9 in position 1: ordinal not in range(128)

I know you're not the only project out there that gets it wrong, but in my quest for something sensible I did test yours, so I thought I'd let you know. My search goes on!

The text was updated successfully, but these errors were encountered:

jaraco · 2014-06-06T08:32:35Z

What is the proper thing to do in this case? I was able to replicate your error, but I observed some things.

The file you created has a file name that's not in the 'file system encoding'. Therefore, even if one attempts to decode the filename using the filesystemencoding, it fails.

>>> [path.decode(sys.getfilesystemencoding()) for path in os.listdir('.')]
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/lib/python2.7/encodings/utf_8.py", line 16, in decode
    return codecs.utf_8_decode(input, errors, True)
UnicodeDecodeError: 'utf8' codec can't decode byte 0xe9 in position 1: invalid continuation byte

On Python 3, when doing a listdir on the directory, that filename creates a surrogate (essentially an invalid character).

>>> os.listdir('.')
['.cache', '.ssh', '.bashrc', '.profile', '.bash_logout', 'r\udce9mi']

These findings lead me to believe that the filename in the file system is in fact incorrectly encoded and invalid.

In my mind, there are only two reasonable and portable encodings suitable for filenames - sys.getdefaultencoding() and sys.getfilesystemencoding(). The current implementation uses the former (implicitly). It would be reasonable, I think, to instead use sys.getfilesystemencoding to decode the filenames, but that still wouldn't address the issue as reported above.

I'm interested to know, though, what you would recommend in this case. Do you see that filename as valid? If so, how should Python know how to decode it? Should path.py use surrogate escapes, the same way as Python 3 does?

Let me state up front that it is the opinion of path.py that paths must be text (Unicode). path.py is not interested in dealing with arbitrary byte strings.

remram44 · 2014-06-06T15:05:59Z

Let me state up front that it is the opinion of path.py that paths must be text (Unicode). path.py is not interested in dealing with arbitrary byte strings.

That is a perfectly respectable position, it is why I'm turning to a different library rather than attempting to fix (or get you to fix) yours.

The file you created has a file name that's not in the 'file system encoding'. Therefore, even if one attempts to decode the filename using the filesystemencoding, it fails.

The 'file system encoding' is more of a hint than anything else. Paths on POSIX are bytes, and this is the only representation that is safe. More over, it is user-specific, so another user on the same machine could totally be using latin-1 while you expect UTF-8 (like in my example). While it is probably acceptable to get back weird things for these (like Python 3), being unable to listdir at all seems a bit much (and can lead to DoS attacks).

On Python 3, when doing a listdir on the directory, that filename creates a surrogate (essentially an invalid character).

Yes, Python 3 is able to handle all these paths safely. Rather than using bytes, they chose to roll up an augmented unicode representation that can account for non-character sequences. This allows to have unicode in most situations while still being able to represent every pathological case.

FYI, personally I'll be using Unipath, which doesn't encode/decode at all and just uses native representations (bytes on Windows, unicode on POSIX).

… filenames.

jaraco · 2014-06-12T20:34:40Z

Please give path.py 5.2 a try - I believe it will work, at least for os.listdir. I did briefly scan through the other methods, but most of those won't be subject to the issue because they're already unicode.

remram44 · 2014-06-12T21:26:28Z

Yes, it works -- it returns a filename with a surrogate in it. Of course os.path doesn't accept these, so:

>>> p = path.path('.').listdir()[-1]
>>> p
path(u'./r\udce9mi')
>>> p.listdir()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/tmp/venv2/local/lib/python2.7/site-packages/path.py", line 502, in lis dir
    for child in map(self._always_unicode, os.listdir(self))
OSError: [Errno 2] No such file or directory: './r\xed\xb3\xa9mi'

remram44 mentioned this issue Jun 6, 2014

List of alternatives to Unipath mikeorr/Unipath#12

Closed

jaraco added a commit that referenced this issue Jun 11, 2014

#61 Adding a test to capture the requirement to support badly-encoded…

3664454

… filenames.

jaraco added a commit that referenced this issue Jun 12, 2014

#61: Encode results from listdir using the new _always_unicode helper.

c859c52

jaraco closed this as completed Jun 12, 2014

gazpachoking mentioned this issue Sep 17, 2014

Crash on walk with errors set to warn #73

Closed

jaraco mentioned this issue Jan 2, 2017

_always_unicode is unnecessary #121

Closed

pombredanne mentioned this issue Aug 27, 2017

Cannot walk path on Linux/Python 2 made of non-unicode/non-fs-decodable bytes #130

Closed

jaraco mentioned this issue Jan 7, 2022

Some tests can't be run on filesystems that have strict utf-8 mode #205

Closed

jaraco pushed a commit that referenced this issue Aug 26, 2022

Update base URL for PEPs (#61)

a4f5b76

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Chokes up on valid filenames #61

Chokes up on valid filenames #61

remram44 commented Jun 5, 2014

jaraco commented Jun 6, 2014

remram44 commented Jun 6, 2014

jaraco commented Jun 12, 2014

remram44 commented Jun 12, 2014

Chokes up on valid filenames #61

Chokes up on valid filenames #61

Comments

remram44 commented Jun 5, 2014

jaraco commented Jun 6, 2014

remram44 commented Jun 6, 2014

jaraco commented Jun 12, 2014

remram44 commented Jun 12, 2014