Support Unicode characters in file paths #1426

Friday21 · 2018-03-06T08:15:23Z

when I have unicode directory name under docs, such as '导航', when I change the content under the direcory, and the mkdoc server auto detect the change and make build, it will throw UnicodeDecodeError:

[E 180306 16:13:35 ioloop:638] Exception in callback <bound method type.poll_tasks of <class 'livereload.handlers.LiveReloadHandler'>>
    Traceback (most recent call last):
      File "/usr/lib64/python2.7/site-packages/tornado/ioloop.py", line 1026, in _run
        return self.callback()
      File "/usr/lib/python2.7/site-packages/livereload/handlers.py", line 67, in poll_tasks
        filepath, delay = cls.watcher.examine()
      File "/usr/lib/python2.7/site-packages/livereload/watcher.py", line 73, in examine
        func and func()
      File "/usr/lib/python2.7/site-packages/mkdocs/commands/serve.py", line 112, in builder
        build(config, live_server=live_server, dirty=dirty)
      File "/usr/lib/python2.7/site-packages/mkdocs/commands/build.py", line 259, in build
        utils.clean_directory(config['site_dir'])
      File "/usr/lib/python2.7/site-packages/mkdocs/utils/__init__.py", line 138, in clean_directory
        if entry.startswith('.'):
    UnicodeDecodeError: 'ascii' codec can't decode byte 0xe9 in position 0: ordinal not in range(128)

waylan · 2018-03-06T15:08:12Z

Thanks for the patch. However, we should include a test with this. Also, I wonder if we should actually decode entry and pass it on rather than only decoding within the conditional test for "starts with a dot".

waylan · 2018-03-13T12:54:46Z

mkdocs/tests/utils/utils_tests.py

+            utils.clean_directory(utf8_directory)
+            utils.clean_directory(ascii_directory)
+            self.assertTrue(os.path.exists(utf_8_path), False)
+            self.assertTrue(os.path.exists(ascii_path), False)


Shouldn't this be:

self.assertFalse(os.path.exists(utf_8_path)) self.assertFalse(os.path.exists(ascii_path))

Might also want to add:

self.assertFalse(os.path.exists(utf8_utf8_directory)) self.assertFalse(os.path.exists(ascii_ascii_directory))

yeah, you're right, I use assertTrue wrong, I use it as assertEqual.

Friday21 · 2018-03-13T14:13:19Z

Wonder why this failed in appveypr, in Python2.7 it passed both on my local and travis

Traceback (most recent call last):
42  File "C:\projects\mkdocs\mkdocs\tests\utils\utils_tests.py", line 354, in test_unicode_clean_directory
43    shutil.rmtree(temp_dir)
44  File "c:\python27\Lib\shutil.py", line 266, in rmtree
45    onerror(os.remove, fullname, sys.exc_info())
46  File "c:\python27\Lib\shutil.py", line 264, in rmtree
47    os.remove(fullname)
48WindowsError: [Error 123] The filename, directory name, or volume label syntax is incorrect: 'c:\\users\\appveyor\\appdata\\local\\temp\\1\\tmpry7a63\\??'

waylan · 2018-03-13T15:57:13Z

I ran your test in a Windows system (Appveyopr runs the tests on Windows) and I am getting the same error. I believe the issue is related to how Windows encodes filenames and the fact that you are concatenating paths with some parts as Unicode strings and some parts as byte strings.

As I understand it, the path related functions in Python will do the right thing if you pass them all Unicode strings. However, in the mkdocs.utils.clean function and in your test you have some strings which are Unicode and some which are not. For example, I previously suggested decoding the entry item. However, you only decode it for the conditional, but not for later use. In other words, you might need to do this instead (untested):

entry = dec(entry)
if entry.startswith('.'):
    continue

path = os.path.join(directory, entry)

That ensures that the conditional works (as in your fix) and that a Unicode string is passed to os.path.join. Instead, you are passing a byte string to os.path.join. You have similar problems in the test itself.

Finally, I'm not sure if its related, but apparently Windows does not encode its file and directory names with utf-8, so decoding from utf-8 may be part of the problem. See sys.getfilesystemencoding() for more info. Note that the encoding of file and directory names is unrelated to the encoding of file contents. Therefore, even though MkDocs uses utf-8 exclusively for file contents, we still have to use the system native encoding for file names and directories. And that's a hard and complicated problem when supporting multiple different systems (Windows, MacOS, and Linux), which us why MkDocs hasn't fully addressed it yet. Its easier to require all file names and directories to be in ASCII. That said, it is a problem we want to solve so we don't have to require ASCII only paths.

waylan · 2018-03-14T01:37:25Z

mkdocs/tests/utils/utils_tests.py

+
+    def test_unicode_clean_directory(self):
+        temp_dir = tempfile.mkdtemp()
+        utf8_directory = os.path.join(temp_dir, '导航'.decode('utf-8'))


This isn't necessary. Up at the top of the file we have from __future__ import unicode_literals which ensures that all literal strings defined in in the module are Unicode strings. Its basically the same as the u prefix: u'unicode string'.

What we need t check is that temp_dir is Unicode. Or any other path which is not created by us.

Thanks, now I figure out that, when user os.path.join(arg1, arg2), if both arg1 and arg2 is str or unicode, it will work, but if arg1 and arg2 are mixed of str and unicode, python will encode the str arg with system defaultencoding, which on windows is not utf-8, and it throws exceptions.
as tempfile didn't work with utf-8 well , so I make utf-8 dir myself:

temp_dir = tempfile.mkdtemp() utf8_temp_dir = os.path.join(dec('tmp'), dec('tmp_utf8'))

waylan · 2018-04-05T15:14:25Z

I wonder if it would help to use backports.os which provides 3.2+ implementations of os.fsdecode and os.fsencode. That way we would get consistent behavior across Python 2 and 3.

waylan · 2018-04-05T15:48:52Z

Another solution is to use a Path object library, such as path.py or pathlib. By wrapping all paths in the Path object, we get support for all path related operations out-of-the-box. It would allow us to remove much of our current utils methods. Of course, it would require a larger refactor. Perhaps this would make sense as part of the Pages Refactor

Here are some differences between the two libs, in no particular order:

path.py objects are str (unicode in PY2) subclasses and work with standard string operations. pathlib objects require conversion to a string first (str(path)). In PY3.6 support for path like objects was added, which simplifies/automates the conversion to string. Unfortunately, the benefits are only available in PY3.6+ without a backport of all path related libs (such as os.path).
path.py is a third-party library with PY2.7+ support. pathlib is in the standard library for PY3.4+ with some newer features only available in PY3.5 or PY3.6. However, pathlib2 appears to be an up-to-date third-party backport with PY2.7+ support but only returns byte strings (see Unicode issues jazzband/pathlib2#34) so it's useless.
path.py includes features wrapping shlib, etc. for a fuller feature set.
pathlib has different object types for Windows vs. Posix paths. It also has pure path types (no relation to actual files) with limited functionality and concrete path types (connected to actual files) with additional functionality. path.py only has one object type which offers all functionality. Both libraries just-work on supported systems.

waylan · 2018-05-23T15:02:49Z

While researching another issue I came upon this:

The os.listdir() function returns filenames and raises an issue: should it return the Unicode version of filenames, or should it return bytes containing the encoded versions? os.listdir() will do both, depending on whether you provided the directory path as bytes or a Unicode string. If you pass a Unicode string as the path, filenames will be decoded using the filesystem’s encoding and a list of Unicode strings will be returned, while passing a byte path will return the filenames as bytes.

So, it sounds like we don't need to bother with determining the system's encoding. Just be sure to always pass Unicode strings to the various path related functions and we should be okay.

waylan · 2018-06-26T20:09:11Z

Closing this in favor of #1517. This was treating the symptoms, while the changes in #1517 and 92995d4 of #1504 treat the problem.

fix UnicodeDecoder error when docs has unicode directory

7ca1b66

Friday21 force-pushed the master branch 3 times, most recently from 815a7eb to 16dd8c7 Compare March 13, 2018 12:07

use dec and add test

e2f7512

Friday21 force-pushed the master branch from 16dd8c7 to e2f7512 Compare March 13, 2018 12:11

waylan requested changes Mar 13, 2018

View reviewed changes

fix assert bug

2dc860e

Friday21 force-pushed the master branch from 4e0d85b to 2dc860e Compare March 13, 2018 14:12

waylan added this to the 1.0.0 milestone Mar 13, 2018

waylan changed the title ~~fix UnicodeDecoder error when docs has unicode directory~~ Support Unicode characters in file paths Mar 13, 2018

waylan requested changes Mar 14, 2018

View reviewed changes

Friday21 force-pushed the master branch 2 times, most recently from db7e8a7 to b9bc336 Compare March 14, 2018 02:27

fix utf8 direcotory unittest error on windows

c1ab539

Friday21 force-pushed the master branch from b9bc336 to c1ab539 Compare March 14, 2018 02:28

waylan mentioned this pull request May 23, 2018

Include *.md files inside mkdocs/tests/integration in the tarball #1494

Closed

waylan closed this Jun 26, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support Unicode characters in file paths #1426

Support Unicode characters in file paths #1426

Friday21 commented Mar 6, 2018

waylan commented Mar 6, 2018

waylan Mar 13, 2018

Friday21 Mar 13, 2018

Friday21 commented Mar 13, 2018

waylan commented Mar 13, 2018 •

edited

waylan Mar 14, 2018

Friday21 Mar 14, 2018

waylan commented Apr 5, 2018

waylan commented Apr 5, 2018 •

edited

waylan commented May 23, 2018 •

edited

waylan commented Jun 26, 2018

Support Unicode characters in file paths #1426

Support Unicode characters in file paths #1426

Conversation

Friday21 commented Mar 6, 2018

waylan commented Mar 6, 2018

waylan Mar 13, 2018

Choose a reason for hiding this comment

Friday21 Mar 13, 2018

Choose a reason for hiding this comment

Friday21 commented Mar 13, 2018

waylan commented Mar 13, 2018 • edited

waylan Mar 14, 2018

Choose a reason for hiding this comment

Friday21 Mar 14, 2018

Choose a reason for hiding this comment

waylan commented Apr 5, 2018

waylan commented Apr 5, 2018 • edited

waylan commented May 23, 2018 • edited

waylan commented Jun 26, 2018

waylan commented Mar 13, 2018 •

edited

waylan commented Apr 5, 2018 •

edited

waylan commented May 23, 2018 •

edited