Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unicode support for the Windows runtime: Let's do it! #1200

Merged
merged 72 commits into from Sep 18, 2017

Conversation

@nojb
Copy link
Contributor

@nojb nojb commented Jun 10, 2017

This PR is a follow-up to #153. See MPR#3771 for context.

The original patch (#153) was created by @ygrek and the patch here is the rebase of that made by Clément Franchini (contacted by email, not yet present in GH I think) so that it applies to trunk.

Over the years (the original patch dates from 2012) there has been a lot of interest in getting this code integrated but the considerable amount of work required has meant that each time the effort has run out of steam and the patch has been left to languish.

We (at LexiFi) are interested in getting this patch merged. Also, the consensus in #153 was that the approach taken here (wrapping Windows "wide" functions to translate to- and from- UTF-8) is the right one. So, let's push to get this merged!

As a first step, I integrated the tests provided by Clément to the OCaml testsuite so that they are run with make test (minus the symbolic link tests, which require fiddling with permissions).

I will be keeping this PR in sync with trunk so that it can be tested easily. Below is a list of issues that still need to be worked on. I will update the list as the discussion progresses.

  • Runtime switch rather than configure-time switch: the functionality is in place; all that is left is to discuss how/if to expose it.
  • Adapting functions dealing with environment variables (GetEnvironmentVariable, ...) and process creation (execv, CreateProcess, ...)
  • Fix the UTF-8 validation function (see comment)
  • Compile with UNICODE and _UNICODE defined (see comment)
  • Hande illegal Unicode: Windows file names are not UTF-16, but sequences of 16-byte values (so that unpaired or mismatched surrogates may appear, see comment). In particular, some file names cannot be represented in valid UTF-8. The approach taken in this patch is as follows: 1) invalid UTF-8 is never generated, and 2) four possible settings:
    • disabled
    • non-strict: Unicode translation between Windows (UTF-16) and OCaml (UTF-8) will silently drop illegal characters.
    • strict with fallback: if illegal characters are found when translating to UTF-16, then the argument string is considered to be encoded in the local codepage. This is the key mechanism used for backwards compatibility.
    • strict without fallback. Like strict with fallback except that there is no fallback: it simply fails if faced with illegal characters.
  • Investigate the segfaults in the testsuite: lib-bigarray-file and lib-unix
  • Adapt ocamlrun
  • Update flexlink (see alainfrisch/flexdll#34)

Any and all comments (as well as help reviewing) very much appreciated. Particularly valuable:

  • Reports from people who are able to test this code
  • Guidance from the core developers as to which points need to be addressed in order to get this to a mergeable state

Thanks!

/cc @alainfrisch @ygrek @dra27

@dbuenzli
Copy link
Contributor

@dbuenzli dbuenzli commented Jun 10, 2017

was that the approach taken here (wrapping Windows "wide" functions to translate to- and from- UTF-8) is the right one. So, let's push to get this merged!

It seems this patch is still using its own, wrong, UTF-8 validation function. These comments are not addressed by this PR.

@nojb
Copy link
Contributor Author

@nojb nojb commented Jun 10, 2017

Hi @dbuenzli, thanks for the reminder! I have added it to the TODO list and will be looking at that soon.

@dra27
Copy link
Contributor

@dra27 dra27 commented Jun 10, 2017

Thanks for taking this one on, @nojb! There is one further thing which should be on the TODO list, but doesn't necessarily have to be fixed in this PR, which is converting ocamlrun to be built using _UNICODE (see MSDN) - i.e. adding -D_UNICODE -DUNICODE to the building of all C files). Note that this has to be done with a certain amount of care - the aim here is to switch the OCaml codebase to build correctly for "modern" Windows (i.e. Windows NT!), but at this stage we wouldn't want -D_UNICODE -DUNICODE to leak to third party C stubs which correctly compile their C files using ocamlopt.

@nojb
Copy link
Contributor Author

@nojb nojb commented Jun 10, 2017

Hi @dra27, thanks for the reminder! It's been added.

@nojb nojb force-pushed the nojb:win_unicode branch from 3bb4f7a to cfa44b7 Jun 10, 2017
@shindere
Copy link
Contributor

@shindere shindere commented Jun 10, 2017

@dra27
Copy link
Contributor

@dra27 dra27 commented Jun 10, 2017

@shindere - thanks for confirming: I had a memory that was something you'd improved, but I didn't check!

@shindere
Copy link
Contributor

@shindere shindere commented Jun 10, 2017

@nojb
Copy link
Contributor Author

@nojb nojb commented Jun 10, 2017

@dbuenzli I switched the UTF-8 validation function to use the Windows API MultiByteToWideChar. There are some warnings in the doc (look under "Remarks") about false positives produced by this function under Windows XP, but if I understand correctly this issue is only present when checking validity of UTF-16, not UTF-8. @dra27, do you agree ?

@nojb
Copy link
Contributor Author

@nojb nojb commented Jun 10, 2017

OK, it turns out I was being too optimistic. It seems that MultiByteToWideChar will, under Windows XP, incorrectly validate surrogate characters (which are not valid UTF-8). So, what to do ?

  • Don't do anything.
  • Use an alternative code path for old versions of Windows, or
  • Fix/clean up the hand-written UTF-8 validator in the original code ?

Opinions welcome.

@nojb
Copy link
Contributor Author

@nojb nojb commented Jun 10, 2017

There is a lot of interesting information on this issue here and especially in the linked MSDN article. It turns out that Windows file names are not UTF-16 after all, but just a sequence of WCHARs (16-bit quantities). This means that Windows file names can contain unpaired or mismatched surrogates. If this is the case, then there are filenames that can not be represented in "strict" UTF-8 (i.e. without surrogate characters).

See also this issue from the Rust community.

@dra27
Copy link
Contributor

@dra27 dra27 commented Jun 10, 2017

Hmm, at first glance this is making my intuition that we might need a wchar type to do this properly a reality.

Quick thoughts: don't worry about XP for now, unless it's critical to your own objectives (better to worry about the port and then we'll worry about XP later). On the UCS-2 names, my instinct is that we should not generate invalid UTF-8, but possibly raise an exception for invalid pairs (in a similar way to having a filesize which is too large).

@nojb
Copy link
Contributor Author

@nojb nojb commented Jun 10, 2017

Personally, introducing any new types is one of the things I would really, really like to avoid (after all, Sys and Unix are precisely useful because they offer a uniform interface to both Linux and Windows).

In any case, agree completely with not worrying about the WinXP situation for now. Just to be clear, for later versions, the current implementation using WideCharToMultiByte and MultiByteToWideChar will only generate and decode "valid" UTF-8. This means that some "valid" Windows file names will not be representable, but we will worry about this later.

I am adding a TODO item to remind us to think about this point and marking the "Fix UTF-8 validator" as done.

@nojb nojb force-pushed the nojb:win_unicode branch from b96a0b2 to be20bd9 Jun 11, 2017
@dra27
Copy link
Contributor

@dra27 dra27 commented Jun 11, 2017

@nojb - I agree about the resistance to adding types, but at the moment we don't have uniformity (because a very common kind of filename breaks the Sys and Unix interfaces) and although supporting valid UTF-8-representable filenames on Windows is a vast leap in the right direction, we still won't have uniformity if there are valid Windows filenames which OCaml will return an exception if it's asked to read! But all that is on top of what you're doing here.

Copy link
Contributor

@dra27 dra27 left a comment

Good progress, @nojb! This is only a brief review for now. The way memory is being allocated concerns me - however largely irrelevant it may be for small strings, it does feel daft that Unix will now copy every single string which refers to a PATH and then free it.

On the Windows side, it's a shame that strings end up being copied twice - once to get the UTF-8 form and then again via caml_copy_string. This point, though, I think may be fixed by sorting the conversion functions - WideCharToMultiByte can be called without a buffer to determine the size of the UTF-8 output. At present, the code will fail on certain strings where really it should reallocate - so you could use heuristic for buffer size and, on failure, call with no buffer to get the actual size and reallocate - but at this point the string will have converted three times. Alternatively, use WideCharToMultiByte with no buffer to get the size and use caml_alloc_string to put the UTF-8 output directly into an OCaml string (there will also then be no need to spend time with memset zeroing the memory).

I was briefly concerned, but didn't look further, about the path checking in Unix - are there any implications for that on the Windows side with a UTF-8 encoded path (I can't remember what it does)?

Changes Outdated
@@ -76,6 +76,10 @@ Working version
- Resurrect tabulation boxes in module Format. Rewrite/extend documentation
of tabulation boxes.

- MPR#3771, GPR#153, GPR#1200: Unicode support for the Windows runtime.
(ygrek, Clement Franchini, Nicolas Ojeda Bar, review by Alain Frisch, David
Allsop, ...)

This comment has been minimized.

@dra27

dra27 Jun 11, 2017
Contributor

Two ps, please!

This comment has been minimized.

@nojb

nojb Jun 11, 2017
Author Contributor

Fixed, thanks!

#ifdef HAS_WINAPI_UTF16

#ifndef WCHAR
typedef unsigned short WCHAR;

This comment has been minimized.

@dra27

dra27 Jun 11, 2017
Contributor

This feels very wrong - the appropriate Windows header should be pulling in. Could possibly then have #ifndef WCHAR #error ...

extern const CRT_CHAR *const crt_dot;
extern const CRT_CHAR *const crt_dot_dot;

#endif /* CAML_U8TOU16_H */

This comment has been minimized.

@dra27

dra27 Jun 11, 2017
Contributor

I think this whole crt renaming thing is a massive reinventing of a wheel. The Microsoft C runtime already has everything necessary to do this rather more elegantly - why not use tchar.h and adapt the Unix side of things to have an emulated tchar.h (the header should be reasonably obvious, but we can also use mingw's for inspiration to avoid licensing concerns): the code will be much shorter, Windows API functions can be referred to by one name. It would mean that, for example, system becomes _tsystem.


#endif /* HAS_WINAPI_UTF16 */

#define Crt_str_free(p) caml_stat_free(p)

This comment has been minimized.

@dra27

dra27 Jun 11, 2017
Contributor

Is this define necessary, now that we have caml_stat_free?


outbuf_size = len*2 + 8;

outp = malloc(outbuf_size + 2);

This comment has been minimized.

@dra27

dra27 Jun 11, 2017
Contributor

Anything from U+0800 to U+FFFF in UCS-2 requires more than 2 bytes of output. This heuristic is dodgy.

This comment has been minimized.

@nojb

nojb Jun 11, 2017
Author Contributor

Indeed. I guess the safe way is to call WideCharToMultiByte twice, the first one to get the length of the output buffer and the second one to actually convert the string.

#define caml_copy_crt_str caml_copy_utf16
#else /* HAS_WINAPI_UTF16 */
#define caml_copy_crt_str caml_copy_string
#endif /* HAS_WINAPI_UTF16 */

This comment has been minimized.

@nojb

nojb Jun 11, 2017
Author Contributor

This should probably be in u8tou16.h (that's where it was in #153).

name = caml_stat_strconcat(2, prefix, ffblk.name);
free(aname);

This comment has been minimized.

@nojb

nojb Jun 11, 2017
Author Contributor

This seems to be another bad rebase: free(aname) should be protected by HAS_WINAPI_UTF16 and the second argument to caml_stat_strconcat should also be aname.

name = utf16_to_utf8(fileinfo.name);
#else
name = caml_stat_strdup(fileinfo.name);
#endif

This comment has been minimized.

@nojb

nojb Jun 11, 2017
Author Contributor

Maybe define a suitable macro in u8tou16.h for this operation as well ?

len = caml_string_length (cmd);
buf = caml_stat_alloc (len + 1);
memmove (buf, String_val (cmd), len + 1);
#endif /* HAS_WINAPI_UTF16 */

This comment has been minimized.

@nojb
Copy link
Contributor Author

@nojb nojb commented Jun 11, 2017

Thanks @dra27 for the review! I will be looking at the points raised. I also did a first quick reading and found a couple of places where the rebase seems to have gone bad, which will be fixed shortly.

@shayne-fletcher
Copy link

@shayne-fletcher shayne-fletcher commented Jun 11, 2017

@nojb
Copy link
Contributor Author

@nojb nojb commented Jun 11, 2017

@dra27 Re Unix path checking: it checks whether the OCaml string has embedded NULLs, so should work OK with UTF-8.

@nojb
Copy link
Contributor Author

@nojb nojb commented Jun 11, 2017

Found another bug in the UTF-16 -> UTF-8 conversion and added wrappers for getenv and command in Sys.

@nojb
Copy link
Contributor Author

@nojb nojb commented Jun 11, 2017

@dra27 Re the crt-naming issue: if I understand correctly you are suggesting to replace HAS_WINAPI_UTF16 by UNICODE and simply use the names defined in <tchar.h>.

However, my understanding is that we want to have a runtime switch for all this functionality, so that we would want to be able to explicitly refer to both versions (Unicode and ANSI), in which case I think using <tchar.h> wouldn't help us much...

@dra27
Copy link
Contributor

@dra27 dra27 commented Jun 12, 2017

@nojb - the present set-up doesn't help us with being able to use both versions at once either! I've just put some thoughts on Mantis about it.

@nojb nojb force-pushed the nojb:win_unicode branch from 80f59f6 to 49766a6 Jun 12, 2017
@nojb
Copy link
Contributor Author

@nojb nojb commented Jun 12, 2017

I can't reproduce the AppVeyor error locally. Any ideas ?

@dra27
Copy link
Contributor

@dra27 dra27 commented Jun 12, 2017

@avsm - I think that's transient (well, it's not - I've seen it very, very occasionally, but we can't debug it after the fact) - please could you restart the AppVeyor build?

@ygrek
Copy link
Contributor

@ygrek ygrek commented Jun 13, 2017

What will happen if environment has some not-valid unicode contents? I am not sure what wgetenv does in this case. Will it be impossible to pass arbitrary bytes with Unix.putenv?

@nojb
Copy link
Contributor Author

@nojb nojb commented Jun 13, 2017

Hi @ygrek ! My understanding is that Windows actually keeps two copies of the environment (one for Unicode and one for "legacy"). These are kept synchronized in general but I think there are situations where they can become out of sync. If you use _wputenv, then I don't think you can pass arbitrary bytes.

@dra27
Copy link
Contributor

@dra27 dra27 commented Jun 13, 2017

@nojb - _wputenv is not Windows, it's MSVCRT. I haven't checked thoroughly (I don't have access to the machine I have the MSVCRT sources on), but I think that on Windows NT (i.e. everywhere) the environment block is UCS-2 and calling GetEnvironmentVariableA converts the parameter to UCS-2 and then queries the environment block using that converted key. The value returned is then (potentially lossily) converted back to ANSI for the return. MSVCRT sits on top of that process - it caches the entire environment, and does indeed maintain two copies if you call both putenv and _wputenv.

@ygrek - I think it depends on your definition of "not-valid unicode" - nothing's invalid in UCS-2 (I think?). However, your arbitrary bytes should be fine - they'll have been converted to wide characters (so every other byte will be null) and this should successfully convert those normal 16-bit code-points to UTF-8.

@nojb nojb force-pushed the nojb:win_unicode branch from 9da5ddd to d051c32 Sep 16, 2017
@dra27
Copy link
Contributor

@dra27 dra27 commented Sep 16, 2017

@nojb
Copy link
Contributor Author

@nojb nojb commented Sep 16, 2017

Indeed you are right, so I put everything back into a CAML_INTERNALS block and added the missing #defines to otherlibs.

@nojb nojb force-pushed the nojb:win_unicode branch from 1280702 to a41931b Sep 16, 2017
Copy link
Member

@damiendoligez damiendoligez left a comment

Took a look at the diff and didn't see anything amiss. I'm going to trust @dra27 here.

@dra27
Copy link
Contributor

@dra27 dra27 commented Sep 18, 2017

Thank you - I expect it will be easier for everyone if we wait until #681 has been merged?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked issues

Successfully merging this pull request may close these issues.

None yet

You can’t perform that action at this time.