Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support Unicode characters in file paths #1426

Closed
wants to merge 4 commits into from

Conversation

Friday21
Copy link

@Friday21 Friday21 commented Mar 6, 2018

when I have unicode directory name under docs, such as '导航', when I change the content under the direcory, and the mkdoc server auto detect the change and make build, it will throw UnicodeDecodeError:

[E 180306 16:13:35 ioloop:638] Exception in callback <bound method type.poll_tasks of <class 'livereload.handlers.LiveReloadHandler'>>
    Traceback (most recent call last):
      File "/usr/lib64/python2.7/site-packages/tornado/ioloop.py", line 1026, in _run
        return self.callback()
      File "/usr/lib/python2.7/site-packages/livereload/handlers.py", line 67, in poll_tasks
        filepath, delay = cls.watcher.examine()
      File "/usr/lib/python2.7/site-packages/livereload/watcher.py", line 73, in examine
        func and func()
      File "/usr/lib/python2.7/site-packages/mkdocs/commands/serve.py", line 112, in builder
        build(config, live_server=live_server, dirty=dirty)
      File "/usr/lib/python2.7/site-packages/mkdocs/commands/build.py", line 259, in build
        utils.clean_directory(config['site_dir'])
      File "/usr/lib/python2.7/site-packages/mkdocs/utils/__init__.py", line 138, in clean_directory
        if entry.startswith('.'):
    UnicodeDecodeError: 'ascii' codec can't decode byte 0xe9 in position 0: ordinal not in range(128)

@waylan
Copy link
Member

waylan commented Mar 6, 2018

Thanks for the patch. However, we should include a test with this. Also, I wonder if we should actually decode entry and pass it on rather than only decoding within the conditional test for "starts with a dot".

@Friday21 Friday21 force-pushed the master branch 3 times, most recently from 815a7eb to 16dd8c7 Compare March 13, 2018 12:07
utils.clean_directory(utf8_directory)
utils.clean_directory(ascii_directory)
self.assertTrue(os.path.exists(utf_8_path), False)
self.assertTrue(os.path.exists(ascii_path), False)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't this be:

self.assertFalse(os.path.exists(utf_8_path))
self.assertFalse(os.path.exists(ascii_path))

Might also want to add:

self.assertFalse(os.path.exists(utf8_utf8_directory))
self.assertFalse(os.path.exists(ascii_ascii_directory))

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, you're right, I use assertTrue wrong, I use it as assertEqual.

@Friday21
Copy link
Author

Wonder why this failed in appveypr, in Python2.7 it passed both on my local and travis

Traceback (most recent call last):
42  File "C:\projects\mkdocs\mkdocs\tests\utils\utils_tests.py", line 354, in test_unicode_clean_directory
43    shutil.rmtree(temp_dir)
44  File "c:\python27\Lib\shutil.py", line 266, in rmtree
45    onerror(os.remove, fullname, sys.exc_info())
46  File "c:\python27\Lib\shutil.py", line 264, in rmtree
47    os.remove(fullname)
48WindowsError: [Error 123] The filename, directory name, or volume label syntax is incorrect: 'c:\\users\\appveyor\\appdata\\local\\temp\\1\\tmpry7a63\\??'

@waylan
Copy link
Member

waylan commented Mar 13, 2018

I ran your test in a Windows system (Appveyopr runs the tests on Windows) and I am getting the same error. I believe the issue is related to how Windows encodes filenames and the fact that you are concatenating paths with some parts as Unicode strings and some parts as byte strings.

As I understand it, the path related functions in Python will do the right thing if you pass them all Unicode strings. However, in the mkdocs.utils.clean function and in your test you have some strings which are Unicode and some which are not. For example, I previously suggested decoding the entry item. However, you only decode it for the conditional, but not for later use. In other words, you might need to do this instead (untested):

entry = dec(entry)
if entry.startswith('.'):
    continue

path = os.path.join(directory, entry)

That ensures that the conditional works (as in your fix) and that a Unicode string is passed to os.path.join. Instead, you are passing a byte string to os.path.join. You have similar problems in the test itself.

Finally, I'm not sure if its related, but apparently Windows does not encode its file and directory names with utf-8, so decoding from utf-8 may be part of the problem. See sys.getfilesystemencoding() for more info. Note that the encoding of file and directory names is unrelated to the encoding of file contents. Therefore, even though MkDocs uses utf-8 exclusively for file contents, we still have to use the system native encoding for file names and directories. And that's a hard and complicated problem when supporting multiple different systems (Windows, MacOS, and Linux), which us why MkDocs hasn't fully addressed it yet. Its easier to require all file names and directories to be in ASCII. That said, it is a problem we want to solve so we don't have to require ASCII only paths.

@waylan waylan added this to the 1.0.0 milestone Mar 13, 2018
@waylan waylan changed the title fix UnicodeDecoder error when docs has unicode directory Support Unicode characters in file paths Mar 13, 2018

def test_unicode_clean_directory(self):
temp_dir = tempfile.mkdtemp()
utf8_directory = os.path.join(temp_dir, '导航'.decode('utf-8'))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This isn't necessary. Up at the top of the file we have from __future__ import unicode_literals which ensures that all literal strings defined in in the module are Unicode strings. Its basically the same as the u prefix: u'unicode string'.

What we need t check is that temp_dir is Unicode. Or any other path which is not created by us.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, now I figure out that, when user os.path.join(arg1, arg2), if both arg1 and arg2 is str or unicode, it will work, but if arg1 and arg2 are mixed of str and unicode, python will encode the str arg with system defaultencoding, which on windows is not utf-8, and it throws exceptions.
as tempfile didn't work with utf-8 well , so I make utf-8 dir myself:

temp_dir = tempfile.mkdtemp()
utf8_temp_dir = os.path.join(dec('tmp'), dec('tmp_utf8'))

@Friday21 Friday21 force-pushed the master branch 2 times, most recently from db7e8a7 to b9bc336 Compare March 14, 2018 02:27
@waylan
Copy link
Member

waylan commented Apr 5, 2018

I wonder if it would help to use backports.os which provides 3.2+ implementations of os.fsdecode and os.fsencode. That way we would get consistent behavior across Python 2 and 3.

@waylan
Copy link
Member

waylan commented Apr 5, 2018

Another solution is to use a Path object library, such as path.py or pathlib. By wrapping all paths in the Path object, we get support for all path related operations out-of-the-box. It would allow us to remove much of our current utils methods. Of course, it would require a larger refactor. Perhaps this would make sense as part of the Pages Refactor

Here are some differences between the two libs, in no particular order:

  • path.py objects are str (unicode in PY2) subclasses and work with standard string operations. pathlib objects require conversion to a string first (str(path)). In PY3.6 support for path like objects was added, which simplifies/automates the conversion to string. Unfortunately, the benefits are only available in PY3.6+ without a backport of all path related libs (such as os.path).
  • path.py is a third-party library with PY2.7+ support. pathlib is in the standard library for PY3.4+ with some newer features only available in PY3.5 or PY3.6. However, pathlib2 appears to be an up-to-date third-party backport with PY2.7+ support but only returns byte strings (see Unicode issues jazzband/pathlib2#34) so it's useless.
  • path.py includes features wrapping shlib, etc. for a fuller feature set.
  • pathlib has different object types for Windows vs. Posix paths. It also has pure path types (no relation to actual files) with limited functionality and concrete path types (connected to actual files) with additional functionality. path.py only has one object type which offers all functionality. Both libraries just-work on supported systems.

@waylan
Copy link
Member

waylan commented May 23, 2018

While researching another issue I came upon this:

The os.listdir() function returns filenames and raises an issue: should it return the Unicode version of filenames, or should it return bytes containing the encoded versions? os.listdir() will do both, depending on whether you provided the directory path as bytes or a Unicode string. If you pass a Unicode string as the path, filenames will be decoded using the filesystem’s encoding and a list of Unicode strings will be returned, while passing a byte path will return the filenames as bytes.

So, it sounds like we don't need to bother with determining the system's encoding. Just be sure to always pass Unicode strings to the various path related functions and we should be okay.

@waylan
Copy link
Member

waylan commented Jun 26, 2018

Closing this in favor of #1517. This was treating the symptoms, while the changes in #1517 and 92995d4 of #1504 treat the problem.

@waylan waylan closed this Jun 26, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants