Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support the entire set of filename on Windows #839

Closed
aragilar opened this issue Jan 29, 2017 · 12 comments
Closed

Support the entire set of filename on Windows #839

aragilar opened this issue Jan 29, 2017 · 12 comments

Comments

@aragilar
Copy link
Member

@aragilar aragilar commented Jan 29, 2017

This is related to #818, but is for tracking a complete solution to how filenames are dealt with on Windows, which we currently don't have. Note that this going to be a breaking change, but fairly limited (users of external links and VDS on Windows). A summary of the issues can be found in https://www.python.org/dev/peps/pep-0528/ and https://www.python.org/dev/peps/pep-0529/, which caused this issue to come up again. A discussion of unicode and its encoding, especially on Windows, can be found at http://utf8everywhere.org/.

Background:
Nearly every Windows still used (including XP), unlike Unix-like systems, has two different interfaces to call filesystem functions, a ANSI (char) interface and a unicode (wchar) interface. Additionally, Windows has for some time now used (effectively, see https://simonsapin.github.io/wtf-8/ for caveats) the utf-16 encoding of the unicode path as the canonical path.

The ANSI interface, similar to Unix-like locale, uses the concept of a current encoding (which for python is represented by the mbcs encoding), which varies from system to system, but is almost never one of the utf-* encodings (instead being latin-1, or being an encoding native to the user's language). This means most filepaths cannot be represented on a single system. This creates interoperability problems which are sometimes hard to diagnose.

Issue:
Python since 2.3 (https://www.python.org/dev/peps/pep-0277/), has supported the unicode interfaces, and has since Python 3 (until Python 3.6, see PEP 529), marked the use of bytes (i.e. the ANSI interfaces) for filepaths as deprecated (https://docs.python.org/3/library/os.path.html).

HDF5, most likely due to its Unix-heritage, has only supported the ANSI interfaces (or rather, has not explicitly added support for. HDF5 has also previously deemed unicode filesystem support out of scope (https://support.hdfgroup.org/HDF5/doc1.8/Advanced/UsingUnicode/index.html, see Filenames under Caveats, Pitfalls, and Things to Watch For). While this does create problems, see e.g. https://tschoonj.github.io/blog/2014/11/06/hdf5-on-windows-utf-8-filenames-support/ and the other links in #818 (comment), the major problem comes with external links (and VDS, but that is newer and probably less widely used at the moment), which do not specify what encoding the filename is in.

Solution:
Add support for the unicode interfaces, which given the current API of the external links and VDS would be using UTF-8 for storage, and for HDF5 to call out to the unicode interfaces rather than the ANSI interfaces. Also, we need to deal with old files which may have different encodings. This means we need to complete the following tasks:

  • Create a working patch for HDF5 which adds support for the unicode interfaces using UTF-8, raising errors where needed
  • Have the external link code in h5py raise an error when there's a problem, and in such a way that working around different encodings is possible, and is documented
  • Document how to deal with filepaths in user code (e.g. unicode over bytes for portability)
@aragilar
Copy link
Member Author

@aragilar aragilar commented Mar 22, 2017

Github autoclosed this, reopening

@takluyver
Copy link
Member

@takluyver takluyver commented Oct 16, 2018

As I understand it, this hinges on getting a patch into HDF5. Did that go anywhere? If not, I'd say this should be bumped to some future milestone, or remove the milestone altogether to indicate that it's not actively being worked on.

@tacaswell tacaswell modified the milestones: 2.9, 2.10 Oct 16, 2018
@tadeu
Copy link

@tadeu tadeu commented Oct 24, 2018

This support could be added as a patch in conda-forge's feedstock, hoping that Anaconda would also follow on using it: conda-forge/hdf5-feedstock#47

@aparamon
Copy link
Member

@aparamon aparamon commented Jan 11, 2019

@tadeu
Copy link

@tadeu tadeu commented Jan 28, 2019

@aparamon, thanks for the reference. Sadly that issue is mixing the UTF-8 filename solution with MINGW support, and it seems to be "stuck" because of this. What is needed is only UTF-8 support.

@tadeu
Copy link

@tadeu tadeu commented Feb 1, 2019

Now there is a public issue in HDF's JIRA strictly for this issue: https://jira.hdfgroup.org/browse/HDFFV-10691

@tadeu
Copy link

@tadeu tadeu commented Apr 11, 2019

It looks like this feature is already in upstream and will be out in 1.10.6:

https://github.com/live-clones/hdf5/commit/750b5c293076b6a446088fa3020e4e0787d489d7

@takluyver
Copy link
Member

@takluyver takluyver commented Apr 25, 2019

I'm going to drop the milestone for now, because this depends on changes in HDF5 itself. But I trust that someone will remind us when this is released in HDF5 and we can make any necessary changes in h5py.

@emmenlau
Copy link

@emmenlau emmenlau commented Jan 16, 2020

I've seen some changes in hdf5 1.10.6 that may be related. I did not check in detail, though. What I've seen is that there are new methods in H5win32defs.h:

    H5_DLL const wchar_t *H5_get_utf16_str(const char *s);
    H5_DLL int Wopen_utf8(const char *path, int oflag, ...);
    H5_DLL int Wremove_utf8(const char *path);

Could this be part of an effort to support Unicode and/or utf8?

@tadeu
Copy link

@tadeu tadeu commented Feb 5, 2020

@emmenlau @aragilar @takluyver @aparamon

I've made a quick small test, and HDF5 is now working with files that have UTF-8 filenames ;)

@emmenlau
Copy link

@emmenlau emmenlau commented Feb 5, 2020

Awesome! Did you need to set any special build flags for HDF5?

@tadeu
Copy link

@tadeu tadeu commented Feb 5, 2020

Did you need to set any special build flags for HDF5?

I've tested it with pre-compiled packages from Anaconda and conda-forge, and didn't see any special build flag for this issue in their recipes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

6 participants