-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Unicode support for the Windows runtime: Let's do it! #1200
Conversation
It seems this patch is still using its own, wrong, UTF-8 validation function. These comments are not addressed by this PR. |
Hi @dbuenzli, thanks for the reminder! I have added it to the TODO list and will be looking at that soon. |
Thanks for taking this one on, @nojb! There is one further thing which should be on the TODO list, but doesn't necessarily have to be fixed in this PR, which is converting ocamlrun to be built using |
Hi @dra27, thanks for the reminder! It's been added. |
David Allsopp (2017/06/10 03:05 -0700):
Thanks for taking this one on, @nojb! There is one further thing which
should be on the TODO list, but doesn't necessarily have to be fixed
in this PR, which is converting ocamlrun to be built using `_UNICODE`
(see
[MSDN](https://msdn.microsoft.com/en-us/library/windows/desktop/ff381407(v=vs.85).aspx)) - i.e. adding `-D_UNICODE -DUNICODE` to the building of all C files). Note that this has to be done with a certain amount of care - the aim here is to switch the OCaml codebase to build correctly for "modern" Windows (i.e. Windows NT!), but at this stage we wouldn't want `-D_UNICODE -DUNICODE` to leak to third party C stubs which correctly compile their C files using `ocamlopt`.
It shouldn't be too difficult to achieve this because the C compiler
flags used to compile files from the OCaml compiler itself are now
distinct from those that shall be used when a third-party C source file
is compiled by calling ocamlc/ocamlopt. See the OCAMLC_CFLAGS and
OCAMLOPT_CFLAGS, IIRC. Flags not added to these build variables won't be
passed to the C compiler as invoked by ocamlc/ocamlopt to compile
third-party C source files.
|
@shindere - thanks for confirming: I had a memory that was something you'd improved, but I didn't check! |
Yeah it's far from being perfect but perhaps a bit better than it used
to be.
|
@dbuenzli I switched the UTF-8 validation function to use the Windows API |
OK, it turns out I was being too optimistic. It seems that
Opinions welcome. |
There is a lot of interesting information on this issue here and especially in the linked MSDN article. It turns out that Windows file names are not UTF-16 after all, but just a sequence of See also this issue from the Rust community. |
Hmm, at first glance this is making my intuition that we might need a wchar type to do this properly a reality. Quick thoughts: don't worry about XP for now, unless it's critical to your own objectives (better to worry about the port and then we'll worry about XP later). On the UCS-2 names, my instinct is that we should not generate invalid UTF-8, but possibly raise an exception for invalid pairs (in a similar way to having a filesize which is too large). |
Personally, introducing any new types is one of the things I would really, really like to avoid (after all, In any case, agree completely with not worrying about the WinXP situation for now. Just to be clear, for later versions, the current implementation using I am adding a TODO item to remind us to think about this point and marking the "Fix UTF-8 validator" as done. |
@nojb - I agree about the resistance to adding types, but at the moment we don't have uniformity (because a very common kind of filename breaks the Sys and Unix interfaces) and although supporting valid UTF-8-representable filenames on Windows is a vast leap in the right direction, we still won't have uniformity if there are valid Windows filenames which OCaml will return an exception if it's asked to read! But all that is on top of what you're doing here. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good progress, @nojb! This is only a brief review for now. The way memory is being allocated concerns me - however largely irrelevant it may be for small strings, it does feel daft that Unix will now copy every single string which refers to a PATH and then free it.
On the Windows side, it's a shame that strings end up being copied twice - once to get the UTF-8 form and then again via caml_copy_string. This point, though, I think may be fixed by sorting the conversion functions - WideCharToMultiByte can be called without a buffer to determine the size of the UTF-8 output. At present, the code will fail on certain strings where really it should reallocate - so you could use heuristic for buffer size and, on failure, call with no buffer to get the actual size and reallocate - but at this point the string will have converted three times. Alternatively, use WideCharToMultiByte with no buffer to get the size and use caml_alloc_string to put the UTF-8 output directly into an OCaml string (there will also then be no need to spend time with memset zeroing the memory).
I was briefly concerned, but didn't look further, about the path checking in Unix - are there any implications for that on the Windows side with a UTF-8 encoded path (I can't remember what it does)?
Changes
Outdated
@@ -76,6 +76,10 @@ Working version | |||
- Resurrect tabulation boxes in module Format. Rewrite/extend documentation | |||
of tabulation boxes. | |||
|
|||
- MPR#3771, GPR#153, GPR#1200: Unicode support for the Windows runtime. | |||
(ygrek, Clement Franchini, Nicolas Ojeda Bar, review by Alain Frisch, David | |||
Allsop, ...) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Two p
s, please!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed, thanks!
byterun/caml/u8tou16.h
Outdated
#ifdef HAS_WINAPI_UTF16 | ||
|
||
#ifndef WCHAR | ||
typedef unsigned short WCHAR; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This feels very wrong - the appropriate Windows header should be pulling in. Could possibly then have #ifndef WCHAR #error ...
byterun/caml/u8tou16.h
Outdated
extern const CRT_CHAR *const crt_dot; | ||
extern const CRT_CHAR *const crt_dot_dot; | ||
|
||
#endif /* CAML_U8TOU16_H */ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this whole crt renaming thing is a massive reinventing of a wheel. The Microsoft C runtime already has everything necessary to do this rather more elegantly - why not use tchar.h
and adapt the Unix side of things to have an emulated tchar.h (the header should be reasonably obvious, but we can also use mingw's for inspiration to avoid licensing concerns): the code will be much shorter, Windows API functions can be referred to by one name. It would mean that, for example, system
becomes _tsystem
.
byterun/caml/u8tou16.h
Outdated
|
||
#endif /* HAS_WINAPI_UTF16 */ | ||
|
||
#define Crt_str_free(p) caml_stat_free(p) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this define necessary, now that we have caml_stat_free?
byterun/u8tou16.c
Outdated
|
||
outbuf_size = len*2 + 8; | ||
|
||
outp = malloc(outbuf_size + 2); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Anything from U+0800 to U+FFFF in UCS-2 requires more than 2 bytes of output. This heuristic is dodgy.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indeed. I guess the safe way is to call WideCharToMultiByte
twice, the first one to get the length of the output buffer and the second one to actually convert the string.
byterun/caml/sys.h
Outdated
#define caml_copy_crt_str caml_copy_utf16 | ||
#else /* HAS_WINAPI_UTF16 */ | ||
#define caml_copy_crt_str caml_copy_string | ||
#endif /* HAS_WINAPI_UTF16 */ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should probably be in u8tou16.h
(that's where it was in #153).
byterun/win32.c
Outdated
name = caml_stat_strconcat(2, prefix, ffblk.name); | ||
free(aname); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems to be another bad rebase: free(aname)
should be protected by HAS_WINAPI_UTF16
and the second argument to caml_stat_strconcat
should also be aname
.
byterun/win32.c
Outdated
name = utf16_to_utf8(fileinfo.name); | ||
#else | ||
name = caml_stat_strdup(fileinfo.name); | ||
#endif |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe define a suitable macro in u8tou16.h
for this operation as well ?
otherlibs/win32unix/system.c
Outdated
len = caml_string_length (cmd); | ||
buf = caml_stat_alloc (len + 1); | ||
memmove (buf, String_val (cmd), len + 1); | ||
#endif /* HAS_WINAPI_UTF16 */ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems a bad rebase as well, compare https://github.com/ocaml/ocaml/pull/153/files#diff-6456858f78e5820f603ddc0a5c3998fbR35.
Thanks @dra27 for the review! I will be looking at the points raised. I also did a first quick reading and found a couple of places where the rebase seems to have gone bad, which will be fixed shortly. |
Nice work Nicolas :)
…On Jun 11, 2017 04:48, "Nicolás Ojeda Bär" ***@***.***> wrote:
Thanks @dra27 <https://github.com/dra27> for the review! I will be
looking at the points raised. I also did a first quick reading and found a
couple of places where the rebase seems to have gone bad, which will be
fixed shortly.
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
<#1200 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/ABbkBzVBD7vJkiC79og0Q5DMO8IK05L4ks5sC6nvgaJpZM4N2D3V>
.
|
@dra27 Re Unix path checking: it checks whether the OCaml string has embedded NULLs, so should work OK with UTF-8. |
Found another bug in the |
@dra27 Re the However, my understanding is that we want to have a runtime switch for all this functionality, so that we would want to be able to explicitly refer to both versions (Unicode and ANSI), in which case I think using |
@nojb - the present set-up doesn't help us with being able to use both versions at once either! I've just put some thoughts on Mantis about it. |
I can't reproduce the AppVeyor error locally. Any ideas ? |
@avsm - I think that's transient (well, it's not - I've seen it very, very occasionally, but we can't debug it after the fact) - please could you restart the AppVeyor build? |
What will happen if environment has some not-valid unicode contents? I am not sure what wgetenv does in this case. Will it be impossible to pass arbitrary bytes with |
Hi @ygrek ! My understanding is that Windows actually keeps two copies of the environment (one for Unicode and one for "legacy"). These are kept synchronized in general but I think there are situations where they can become out of sync. If you use |
@nojb - _wputenv is not Windows, it's MSVCRT. I haven't checked thoroughly (I don't have access to the machine I have the MSVCRT sources on), but I think that on Windows NT (i.e. everywhere) the environment block is UCS-2 and calling GetEnvironmentVariableA converts the parameter to UCS-2 and then queries the environment block using that converted key. The value returned is then (potentially lossily) converted back to ANSI for the return. MSVCRT sits on top of that process - it caches the entire environment, and does indeed maintain two copies if you call both putenv and _wputenv. @ygrek - I think it depends on your definition of "not-valid unicode" - nothing's invalid in UCS-2 (I think?). However, your arbitrary bytes should be fine - they'll have been converted to wide characters (so every other byte will be null) and this should successfully convert those normal 16-bit code-points to UTF-8. |
I think otherlibs C files are allowed to declare CAML_INTERNALS?
|
Indeed you are right, so I put everything back into a |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Took a look at the diff and didn't see anything amiss. I'm going to trust @dra27 here.
Thank you - I expect it will be easier for everyone if we wait until #681 has been merged? |
Co-authored-by: Cuihtlauac ALVARADO <cuihtmlauac@tarides.com>
This PR is a follow-up to #153. See MPR#3771 for context.
The original patch (#153) was created by @ygrek and the patch here is the rebase of that made by Clément Franchini (contacted by email, not yet present in GH I think) so that it applies to
trunk
.Over the years (the original patch dates from 2012) there has been a lot of interest in getting this code integrated but the considerable amount of work required has meant that each time the effort has run out of steam and the patch has been left to languish.
We (at LexiFi) are interested in getting this patch merged. Also, the consensus in #153 was that the approach taken here (wrapping Windows "wide" functions to translate to- and from- UTF-8) is the right one. So, let's push to get this merged!
As a first step, I integrated the tests provided by Clément to the OCaml testsuite so that they are run with
make test
(minus the symbolic link tests, which require fiddling with permissions).I will be keeping this PR in sync with
trunk
so that it can be tested easily. Below is a list of issues that still need to be worked on. I will update the list as the discussion progresses.configure
-time switch: the functionality is in place; all that is left is to discuss how/if to expose it.GetEnvironmentVariable
, ...) and process creation (execv
,CreateProcess
, ...)UNICODE
and_UNICODE
defined (see comment)disabled
non-strict
: Unicode translation between Windows (UTF-16) and OCaml (UTF-8) will silently drop illegal characters.strict with fallback
: if illegal characters are found when translating to UTF-16, then the argument string is considered to be encoded in the local codepage. This is the key mechanism used for backwards compatibility.strict without fallback
. Likestrict with fallback
except that there is no fallback: it simply fails if faced with illegal characters.lib-bigarray-file
andlib-unix
ocamlrun
flexlink
(see Add Unicode support to flexdll (was: Add wide-character version of flexdll_dlopen) flexdll#34)Any and all comments (as well as help reviewing) very much appreciated. Particularly valuable:
Thanks!
/cc @alainfrisch @ygrek @dra27