New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support the entire set of filename on Windows #839

Open
aragilar opened this Issue Jan 29, 2017 · 3 comments

Comments

Projects
None yet
4 participants
@aragilar
Contributor

aragilar commented Jan 29, 2017

This is related to #818, but is for tracking a complete solution to how filenames are dealt with on Windows, which we currently don't have. Note that this going to be a breaking change, but fairly limited (users of external links and VDS on Windows). A summary of the issues can be found in https://www.python.org/dev/peps/pep-0528/ and https://www.python.org/dev/peps/pep-0529/, which caused this issue to come up again. A discussion of unicode and its encoding, especially on Windows, can be found at http://utf8everywhere.org/.

Background:
Nearly every Windows still used (including XP), unlike Unix-like systems, has two different interfaces to call filesystem functions, a ANSI (char) interface and a unicode (wchar) interface. Additionally, Windows has for some time now used (effectively, see https://simonsapin.github.io/wtf-8/ for caveats) the utf-16 encoding of the unicode path as the canonical path.

The ANSI interface, similar to Unix-like locale, uses the concept of a current encoding (which for python is represented by the mbcs encoding), which varies from system to system, but is almost never one of the utf-* encodings (instead being latin-1, or being an encoding native to the user's language). This means most filepaths cannot be represented on a single system. This creates interoperability problems which are sometimes hard to diagnose.

Issue:
Python since 2.3 (https://www.python.org/dev/peps/pep-0277/), has supported the unicode interfaces, and has since Python 3 (until Python 3.6, see PEP 529), marked the use of bytes (i.e. the ANSI interfaces) for filepaths as deprecated (https://docs.python.org/3/library/os.path.html).

HDF5, most likely due to its Unix-heritage, has only supported the ANSI interfaces (or rather, has not explicitly added support for. HDF5 has also previously deemed unicode filesystem support out of scope (https://support.hdfgroup.org/HDF5/doc1.8/Advanced/UsingUnicode/index.html, see Filenames under Caveats, Pitfalls, and Things to Watch For). While this does create problems, see e.g. https://tschoonj.github.io/blog/2014/11/06/hdf5-on-windows-utf-8-filenames-support/ and the other links in #818 (comment), the major problem comes with external links (and VDS, but that is newer and probably less widely used at the moment), which do not specify what encoding the filename is in.

Solution:
Add support for the unicode interfaces, which given the current API of the external links and VDS would be using UTF-8 for storage, and for HDF5 to call out to the unicode interfaces rather than the ANSI interfaces. Also, we need to deal with old files which may have different encodings. This means we need to complete the following tasks:

  • Create a working patch for HDF5 which adds support for the unicode interfaces using UTF-8, raising errors where needed
  • Have the external link code in h5py raise an error when there's a problem, and in such a way that working around different encodings is possible, and is documented
  • Document how to deal with filepaths in user code (e.g. unicode over bytes for portability)
@aragilar

This comment has been minimized.

Contributor

aragilar commented Mar 22, 2017

Github autoclosed this, reopening

@takluyver

This comment has been minimized.

Member

takluyver commented Oct 16, 2018

As I understand it, this hinges on getting a patch into HDF5. Did that go anywhere? If not, I'd say this should be bumped to some future milestone, or remove the milestone altogether to indicate that it's not actively being worked on.

@tacaswell tacaswell modified the milestones: 2.9, 2.10 Oct 16, 2018

@tadeu

This comment has been minimized.

tadeu commented Oct 24, 2018

This support could be added as a patch in conda-forge's feedstock, hoping that Anaconda would also follow on using it: conda-forge/hdf5-feedstock#47

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment