Unicode support for the Windows runtime: Let's do it! #1200

nojb · 2017-06-10T09:16:06Z

This PR is a follow-up to #153. See MPR#3771 for context.

The original patch (#153) was created by @ygrek and the patch here is the rebase of that made by Clément Franchini (contacted by email, not yet present in GH I think) so that it applies to trunk.

Over the years (the original patch dates from 2012) there has been a lot of interest in getting this code integrated but the considerable amount of work required has meant that each time the effort has run out of steam and the patch has been left to languish.

We (at LexiFi) are interested in getting this patch merged. Also, the consensus in #153 was that the approach taken here (wrapping Windows "wide" functions to translate to- and from- UTF-8) is the right one. So, let's push to get this merged!

As a first step, I integrated the tests provided by Clément to the OCaml testsuite so that they are run with make test (minus the symbolic link tests, which require fiddling with permissions).

I will be keeping this PR in sync with trunk so that it can be tested easily. Below is a list of issues that still need to be worked on. I will update the list as the discussion progresses.

Runtime switch rather than configure-time switch: the functionality is in place; all that is left is to discuss how/if to expose it.
Adapting functions dealing with environment variables (GetEnvironmentVariable, ...) and process creation (execv, CreateProcess, ...)
Fix the UTF-8 validation function (see comment)
Compile with UNICODE and _UNICODE defined (see comment)
Hande illegal Unicode: Windows file names are not UTF-16, but sequences of 16-byte values (so that unpaired or mismatched surrogates may appear, see comment). In particular, some file names cannot be represented in valid UTF-8. The approach taken in this patch is as follows: 1) invalid UTF-8 is never generated, and 2) four possible settings:
- disabled
- non-strict: Unicode translation between Windows (UTF-16) and OCaml (UTF-8) will silently drop illegal characters.
- strict with fallback: if illegal characters are found when translating to UTF-16, then the argument string is considered to be encoded in the local codepage. This is the key mechanism used for backwards compatibility.
- strict without fallback. Like strict with fallback except that there is no fallback: it simply fails if faced with illegal characters.
Investigate the segfaults in the testsuite: lib-bigarray-file and lib-unix
Adapt ocamlrun
Update flexlink (see Add Unicode support to flexdll (was: Add wide-character version of flexdll_dlopen) flexdll#34)

Any and all comments (as well as help reviewing) very much appreciated. Particularly valuable:

Reports from people who are able to test this code
Guidance from the core developers as to which points need to be addressed in order to get this to a mergeable state

Thanks!

/cc @alainfrisch @ygrek @dra27

dbuenzli · 2017-06-10T09:39:31Z

was that the approach taken here (wrapping Windows "wide" functions to translate to- and from- UTF-8) is the right one. So, let's push to get this merged!

It seems this patch is still using its own, wrong, UTF-8 validation function. These comments are not addressed by this PR.

nojb · 2017-06-10T09:41:49Z

Hi @dbuenzli, thanks for the reminder! I have added it to the TODO list and will be looking at that soon.

dra27 · 2017-06-10T10:05:19Z

Thanks for taking this one on, @nojb! There is one further thing which should be on the TODO list, but doesn't necessarily have to be fixed in this PR, which is converting ocamlrun to be built using _UNICODE (see MSDN) - i.e. adding -D_UNICODE -DUNICODE to the building of all C files). Note that this has to be done with a certain amount of care - the aim here is to switch the OCaml codebase to build correctly for "modern" Windows (i.e. Windows NT!), but at this stage we wouldn't want -D_UNICODE -DUNICODE to leak to third party C stubs which correctly compile their C files using ocamlopt.

nojb · 2017-06-10T10:42:39Z

Hi @dra27, thanks for the reminder! It's been added.

shindere · 2017-06-10T16:54:56Z

David Allsopp (2017/06/10 03:05 -0700):

Thanks for taking this one on, @nojb! There is one further thing which should be on the TODO list, but doesn't necessarily have to be fixed in this PR, which is converting ocamlrun to be built using `_UNICODE` (see [MSDN](https://msdn.microsoft.com/en-us/library/windows/desktop/ff381407(v=vs.85).aspx)) - i.e. adding `-D_UNICODE -DUNICODE` to the building of all C files). Note that this has to be done with a certain amount of care - the aim here is to switch the OCaml codebase to build correctly for "modern" Windows (i.e. Windows NT!), but at this stage we wouldn't want `-D_UNICODE -DUNICODE` to leak to third party C stubs which correctly compile their C files using `ocamlopt`.

It shouldn't be too difficult to achieve this because the C compiler flags used to compile files from the OCaml compiler itself are now distinct from those that shall be used when a third-party C source file is compiled by calling ocamlc/ocamlopt. See the OCAMLC_CFLAGS and OCAMLOPT_CFLAGS, IIRC. Flags not added to these build variables won't be passed to the C compiler as invoked by ocamlc/ocamlopt to compile third-party C source files.

dra27 · 2017-06-10T17:00:57Z

@shindere - thanks for confirming: I had a memory that was something you'd improved, but I didn't check!

shindere · 2017-06-10T17:07:06Z

Yeah it's far from being perfect but perhaps a bit better than it used to be.

nojb · 2017-06-10T19:26:01Z

@dbuenzli I switched the UTF-8 validation function to use the Windows API MultiByteToWideChar. There are some warnings in the doc (look under "Remarks") about false positives produced by this function under Windows XP, but if I understand correctly this issue is only present when checking validity of UTF-16, not UTF-8. @dra27, do you agree ?

nojb · 2017-06-10T19:47:17Z

OK, it turns out I was being too optimistic. It seems that MultiByteToWideChar will, under Windows XP, incorrectly validate surrogate characters (which are not valid UTF-8). So, what to do ?

Don't do anything.
Use an alternative code path for old versions of Windows, or
Fix/clean up the hand-written UTF-8 validator in the original code ?

Opinions welcome.

nojb · 2017-06-10T20:17:28Z

There is a lot of interesting information on this issue here and especially in the linked MSDN article. It turns out that Windows file names are not UTF-16 after all, but just a sequence of WCHARs (16-bit quantities). This means that Windows file names can contain unpaired or mismatched surrogates. If this is the case, then there are filenames that can not be represented in "strict" UTF-8 (i.e. without surrogate characters).

See also this issue from the Rust community.

dra27 · 2017-06-10T20:37:26Z

Hmm, at first glance this is making my intuition that we might need a wchar type to do this properly a reality.

Quick thoughts: don't worry about XP for now, unless it's critical to your own objectives (better to worry about the port and then we'll worry about XP later). On the UCS-2 names, my instinct is that we should not generate invalid UTF-8, but possibly raise an exception for invalid pairs (in a similar way to having a filesize which is too large).

nojb · 2017-06-10T22:25:35Z

Personally, introducing any new types is one of the things I would really, really like to avoid (after all, Sys and Unix are precisely useful because they offer a uniform interface to both Linux and Windows).

In any case, agree completely with not worrying about the WinXP situation for now. Just to be clear, for later versions, the current implementation using WideCharToMultiByte and MultiByteToWideChar will only generate and decode "valid" UTF-8. This means that some "valid" Windows file names will not be representable, but we will worry about this later.

I am adding a TODO item to remind us to think about this point and marking the "Fix UTF-8 validator" as done.

dra27 · 2017-06-11T07:27:00Z

@nojb - I agree about the resistance to adding types, but at the moment we don't have uniformity (because a very common kind of filename breaks the Sys and Unix interfaces) and although supporting valid UTF-8-representable filenames on Windows is a vast leap in the right direction, we still won't have uniformity if there are valid Windows filenames which OCaml will return an exception if it's asked to read! But all that is on top of what you're doing here.

dra27

Good progress, @nojb! This is only a brief review for now. The way memory is being allocated concerns me - however largely irrelevant it may be for small strings, it does feel daft that Unix will now copy every single string which refers to a PATH and then free it.

On the Windows side, it's a shame that strings end up being copied twice - once to get the UTF-8 form and then again via caml_copy_string. This point, though, I think may be fixed by sorting the conversion functions - WideCharToMultiByte can be called without a buffer to determine the size of the UTF-8 output. At present, the code will fail on certain strings where really it should reallocate - so you could use heuristic for buffer size and, on failure, call with no buffer to get the actual size and reallocate - but at this point the string will have converted three times. Alternatively, use WideCharToMultiByte with no buffer to get the size and use caml_alloc_string to put the UTF-8 output directly into an OCaml string (there will also then be no need to spend time with memset zeroing the memory).

I was briefly concerned, but didn't look further, about the path checking in Unix - are there any implications for that on the Windows side with a UTF-8 encoded path (I can't remember what it does)?

dra27 · 2017-06-11T07:32:51Z

Changes

@@ -76,6 +76,10 @@ Working version
 - Resurrect tabulation boxes in module Format. Rewrite/extend documentation
  of tabulation boxes.

+- MPR#3771, GPR#153, GPR#1200: Unicode support for the Windows runtime.
+  (ygrek, Clement Franchini, Nicolas Ojeda Bar, review by Alain Frisch, David
+  Allsop, ...)


Two ps, please!

Fixed, thanks!

dra27 · 2017-06-11T07:35:06Z

byterun/caml/u8tou16.h

+#ifdef HAS_WINAPI_UTF16
+
+#ifndef WCHAR
+typedef unsigned short WCHAR;


This feels very wrong - the appropriate Windows header should be pulling in. Could possibly then have #ifndef WCHAR #error ...

dra27 · 2017-06-11T07:40:03Z

byterun/caml/u8tou16.h

+extern const CRT_CHAR *const crt_dot;
+extern const CRT_CHAR *const crt_dot_dot;
+
+#endif /* CAML_U8TOU16_H */


I think this whole crt renaming thing is a massive reinventing of a wheel. The Microsoft C runtime already has everything necessary to do this rather more elegantly - why not use tchar.h and adapt the Unix side of things to have an emulated tchar.h (the header should be reasonably obvious, but we can also use mingw's for inspiration to avoid licensing concerns): the code will be much shorter, Windows API functions can be referred to by one name. It would mean that, for example, system becomes _tsystem.

dra27 · 2017-06-11T07:41:27Z

byterun/caml/u8tou16.h

+
+#endif /* HAS_WINAPI_UTF16 */
+
+#define Crt_str_free(p) caml_stat_free(p)


Is this define necessary, now that we have caml_stat_free?

dra27 · 2017-06-11T07:55:57Z

byterun/u8tou16.c

+
+    outbuf_size = len*2 + 8;
+
+    outp = malloc(outbuf_size + 2);


Anything from U+0800 to U+FFFF in UCS-2 requires more than 2 bytes of output. This heuristic is dodgy.

Indeed. I guess the safe way is to call WideCharToMultiByte twice, the first one to get the length of the output buffer and the second one to actually convert the string.

nojb · 2017-06-11T07:38:05Z

byterun/caml/sys.h

+#define caml_copy_crt_str caml_copy_utf16
+#else /* HAS_WINAPI_UTF16 */
+#define caml_copy_crt_str caml_copy_string
+#endif /* HAS_WINAPI_UTF16 */


This should probably be in u8tou16.h (that's where it was in #153).

nojb · 2017-06-11T08:35:38Z

byterun/win32.c

    name = caml_stat_strconcat(2, prefix, ffblk.name);
+    free(aname);


This seems to be another bad rebase: free(aname) should be protected by HAS_WINAPI_UTF16 and the second argument to caml_stat_strconcat should also be aname.

nojb · 2017-06-11T08:37:59Z

byterun/win32.c

+      name = utf16_to_utf8(fileinfo.name);
+#else
+      name = caml_stat_strdup(fileinfo.name);
+#endif


Maybe define a suitable macro in u8tou16.h for this operation as well ?

nojb · 2017-06-11T08:42:22Z

otherlibs/win32unix/system.c

  len = caml_string_length (cmd);
  buf = caml_stat_alloc (len + 1);
  memmove (buf, String_val (cmd), len + 1);
+#endif /* HAS_WINAPI_UTF16 */


This seems a bad rebase as well, compare https://github.com/ocaml/ocaml/pull/153/files#diff-6456858f78e5820f603ddc0a5c3998fbR35.

nojb · 2017-06-11T08:48:45Z

Thanks @dra27 for the review! I will be looking at the points raised. I also did a first quick reading and found a couple of places where the rebase seems to have gone bad, which will be fixed shortly.

shayne-fletcher · 2017-06-11T19:16:39Z

Nice work Nicolas :)

…

On Jun 11, 2017 04:48, "Nicolás Ojeda Bär" ***@***.***> wrote: Thanks @dra27 <https://github.com/dra27> for the review! I will be looking at the points raised. I also did a first quick reading and found a couple of places where the rebase seems to have gone bad, which will be fixed shortly. — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#1200 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ABbkBzVBD7vJkiC79og0Q5DMO8IK05L4ks5sC6nvgaJpZM4N2D3V> .

nojb · 2017-06-11T20:44:48Z

@dra27 Re Unix path checking: it checks whether the OCaml string has embedded NULLs, so should work OK with UTF-8.

nojb · 2017-06-11T22:11:56Z

Found another bug in the UTF-16 -> UTF-8 conversion and added wrappers for getenv and command in Sys.

nojb · 2017-06-11T22:21:30Z

@dra27 Re the crt-naming issue: if I understand correctly you are suggesting to replace HAS_WINAPI_UTF16 by UNICODE and simply use the names defined in <tchar.h>.

However, my understanding is that we want to have a runtime switch for all this functionality, so that we would want to be able to explicitly refer to both versions (Unicode and ANSI), in which case I think using <tchar.h> wouldn't help us much...

dra27 · 2017-06-12T07:58:13Z

@nojb - the present set-up doesn't help us with being able to use both versions at once either! I've just put some thoughts on Mantis about it.

nojb · 2017-06-12T21:32:52Z

I can't reproduce the AppVeyor error locally. Any ideas ?

dra27 · 2017-06-12T21:36:31Z

@avsm - I think that's transient (well, it's not - I've seen it very, very occasionally, but we can't debug it after the fact) - please could you restart the AppVeyor build?

ygrek · 2017-06-13T20:27:38Z

What will happen if environment has some not-valid unicode contents? I am not sure what wgetenv does in this case. Will it be impossible to pass arbitrary bytes with Unix.putenv?

nojb · 2017-06-13T20:36:06Z

Hi @ygrek ! My understanding is that Windows actually keeps two copies of the environment (one for Unicode and one for "legacy"). These are kept synchronized in general but I think there are situations where they can become out of sync. If you use _wputenv, then I don't think you can pass arbitrary bytes.

dra27 · 2017-06-13T20:50:13Z

@nojb - _wputenv is not Windows, it's MSVCRT. I haven't checked thoroughly (I don't have access to the machine I have the MSVCRT sources on), but I think that on Windows NT (i.e. everywhere) the environment block is UCS-2 and calling GetEnvironmentVariableA converts the parameter to UCS-2 and then queries the environment block using that converted key. The value returned is then (potentially lossily) converted back to ANSI for the return. MSVCRT sits on top of that process - it caches the entire environment, and does indeed maintain two copies if you call both putenv and _wputenv.

@ygrek - I think it depends on your definition of "not-valid unicode" - nothing's invalid in UCS-2 (I think?). However, your arbitrary bytes should be fine - they'll have been converted to wide characters (so every other byte will be null) and this should successfully convert those normal 16-bit code-points to UTF-8.

dra27 · 2017-09-16T10:20:29Z

I think otherlibs C files are allowed to declare CAML_INTERNALS?

nojb · 2017-09-16T22:38:28Z

Indeed you are right, so I put everything back into a CAML_INTERNALS block and added the missing #defines to otherlibs.

damiendoligez

Took a look at the diff and didn't see anything amiss. I'm going to trust @dra27 here.

dra27 · 2017-09-18T15:18:16Z

Thank you - I expect it will be easier for everyone if we wait until #681 has been merged?

Co-authored-by: Cuihtlauac ALVARADO <cuihtmlauac@tarides.com>

nojb force-pushed the win_unicode branch from 3bb4f7a to cfa44b7 Compare June 10, 2017 11:38

nojb force-pushed the win_unicode branch from b96a0b2 to be20bd9 Compare June 11, 2017 00:03

dra27 reviewed Jun 11, 2017

View reviewed changes

nojb commented Jun 11, 2017

View reviewed changes

nojb force-pushed the win_unicode branch from 80f59f6 to 49766a6 Compare June 12, 2017 21:26

gasche mentioned this pull request Jun 13, 2017

Should we have a function to copy files ? ocaml-batteries-team/batteries-included#758

Closed

nojb added 5 commits September 16, 2017 11:04

Fix Changes entry

6f95234

shell32.lib is no longer necessary

6330b16

Free allocated string

a55dc9f

Changes: signal breaking change

f3fa4d9

Disable exec_tests

d051c32

nojb force-pushed the win_unicode branch from 9da5ddd to d051c32 Compare September 16, 2017 09:04

Protect with CAML_INTERNALS

a41931b

nojb force-pushed the win_unicode branch from 1280702 to a41931b Compare September 16, 2017 22:51

damiendoligez approved these changes Sep 18, 2017

View reviewed changes

damiendoligez merged commit 9fe6d0e into ocaml:trunk Sep 18, 2017

This was referenced Sep 19, 2017

ocamltest #681

Merged

Fix ocamltest / Windows / Unicode issues #1357

Merged

Fix headernt.c & -output-obj Unicode support #1362

Merged

Fix naming of shared Unicode stubs #1363

Merged

Unix.environment on Windows: use _wenviron #1369

Merged

This was referenced Oct 4, 2017

One more Windows Unicode PR: do not use %S #1398

Merged

caml_sys_isatty: detect Cygwin/MSYS for better -color heuristic #1406

Merged

This was referenced Oct 24, 2017

Enable UTF-8 on Windows 10 Console (Mark #2) #1444

Merged

Update FlexDLL to 0.37 #1447

Merged

dra27 mentioned this pull request Nov 26, 2017

Use native Windows API for Unix.{getenv,putenv,environment}, Sys.getenv #1479

Merged

nojb mentioned this pull request Feb 27, 2019

configure: add --disable-windows-unicode option #2264

Merged

dra27 mentioned this pull request Apr 16, 2019

Windows Unicode support for ocamlyacc #8621

Merged

EmileTrotignon pushed a commit to EmileTrotignon/ocaml that referenced this pull request Jan 12, 2024

Create rss feed planet folder if missing (ocaml#1200)

a3663a4

Co-authored-by: Cuihtlauac ALVARADO <cuihtmlauac@tarides.com>


		#endif /* HAS_WINAPI_UTF16 */

		#define Crt_str_free(p) caml_stat_free(p)

		name = caml_stat_strconcat(2, prefix, ffblk.name);
		free(aname);

Unicode support for the Windows runtime: Let's do it! #1200

Unicode support for the Windows runtime: Let's do it! #1200

Conversation

nojb commented Jun 10, 2017 • edited Loading

dbuenzli commented Jun 10, 2017

nojb commented Jun 10, 2017

dra27 commented Jun 10, 2017

nojb commented Jun 10, 2017

shindere commented Jun 10, 2017 via email

dra27 commented Jun 10, 2017

shindere commented Jun 10, 2017 via email

nojb commented Jun 10, 2017

nojb commented Jun 10, 2017 • edited Loading

nojb commented Jun 10, 2017

dra27 commented Jun 10, 2017

nojb commented Jun 10, 2017

dra27 commented Jun 11, 2017

dra27 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

nojb commented Jun 11, 2017

shayne-fletcher commented Jun 11, 2017 via email

nojb commented Jun 11, 2017

nojb commented Jun 11, 2017

nojb commented Jun 11, 2017

dra27 commented Jun 12, 2017

nojb commented Jun 12, 2017

dra27 commented Jun 12, 2017

ygrek commented Jun 13, 2017

nojb commented Jun 13, 2017

dra27 commented Jun 13, 2017

dra27 commented Sep 16, 2017 via email • edited Loading

nojb commented Sep 16, 2017 • edited Loading

damiendoligez left a comment

Choose a reason for hiding this comment

dra27 commented Sep 18, 2017

nojb commented Jun 10, 2017 •

edited

Loading

nojb commented Jun 10, 2017 •

edited

Loading

dra27 commented Sep 16, 2017 via email •

edited

Loading

nojb commented Sep 16, 2017 •

edited

Loading