New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Unicode support for windows runtime #153
Conversation
Do you plan on adding the todos you spoke about in the bts "exposing runtime switch and use winapi for utf8 validation" ? |
Yeap, just wanted to get early feedback |
FWIW, it's high on our priority list to review and test this proposal. |
@alainfrisch any update on this? |
Still high on our priority list :-/ |
We discussed it at the last developer meeting, but the consensus was that it seemed to risky to include before more testing is done. |
Conflicts: asmrun/Makefile.nt byterun/Makefile.nt byterun/sys.c byterun/win32.c config/Makefile.mingw config/Makefile.msvc otherlibs/unix/chdir.c otherlibs/unix/chmod.c otherlibs/unix/getcwd.c otherlibs/unix/rmdir.c otherlibs/unix/unlink.c otherlibs/unix/utimes.c otherlibs/win32unix/Makefile.nt otherlibs/win32unix/link.c otherlibs/win32unix/rename.c utils/ccomp.ml
FYI: I visited a serious industrial OCaml user yesterday, and they spontaneously mentioned this issue (Unicode file name under Windows) as a significant problem for them. |
Hi, I implemented this patch to do some testing because our users with unicode paths have the same problem. I found that some functions still not work, like "caml_sys_open". I'll keep posting any new problem I find. |
@modlfo Any update on the problem you found? I was wondering if something should be done for interacting with environment variables which are often used to store filenames. Should this be addressed at the same time? (There are specific "wchar" APIs for environment variables.) |
@alainfrisch This patch broke when we moved our product from 4.02.3 to 4.03.0. We still had strange problems with functions that call processes like |
Thanks, this is already useful information. Indeed CreateProcess does not seem to have switched to the @ygrek : is this just an oversight, or would it break anything else? |
@dra27 Could you look at this and decide what to do with it? |
Unless someone else wishes to push this patch further this month (January), I'd just like to revisit this in 4.06 (it's important to me, but at the moment there are only so many days available!) - I've been musing on a possible alternate approach for a while, but I think discussion of it will be better with some code. My concerns at present are in two areas:
|
I'm interested to understand exactly what this means :-) Does it mean in particular that syscalls will fail for ill-formed UTF-16 sequences? Does it mean that no Unicode normalization happens (so that e.g. two different UTF-16 sequences representing both "é" after proper collation would produce two different files whose name would be printed in the same way by all graphical tools)? |
So am I! The docs are remarkably unclear - in fact, my incorrect understanding was that it was all still UCS-2. The key relevance for me is that Windows is natively 16-bit code points, not 8. My hunches for how UTF-16 it is are: a) It will vary by version of Windows (with a certain amount of associated sighing and gnashing of teeth, but Unicode has varied in the same time, so not so much Microsoft's fault) |
A glimmer of light in that otherwise bleak picture is that Microsoft needs
to have some basic sanity in their Unicode support in order to sell Windows
in China. The reason is that the Chinese government requires that all
software sold in China be certified as being GB-18030 compliant. The way
many Western companies do GB-18030 certification is to support some version
of Unicode, and then make the case that they _also_ support GB-18030
because they support Unicode.
That said, my guess would be that calls to, e.g. CreateFile() will do no
normalization, leading to nonsensical situations (e.g., the same string
normalized to NFK and NFC will produce two "different" objects in the file
system). I would also guess that blatantly invalid UTF-16 strings (e.g., a
lone low/high surrogate) would end up having the invalid characters
replaced by the Unicode Replacement Character.
…On Wed, Jan 4, 2017 at 4:28 AM, David Allsopp ***@***.***> wrote:
So am I! The docs are remarkably unclear - in fact, my incorrect
understanding was that it was all still UCS-2. The key relevance for me is
that Windows is natively 16-bit code points, not 8. My hunches for how
UTF-16 it is are:
a) It will vary by version of Windows (with a certain amount of associated
sighing and gnashing of teeth, but Unicode has varied in the same time, so
not so much Microsoft's fault)
b) Any "clever" parts will be missing (so I'm expecting that surrogates
will be the only thing which works, not normalisation). The key ominous
sentence here
<https://msdn.microsoft.com/en-us/library/windows/desktop/dd374069(v=vs.85).aspx>
is "Windows 2000 introduces support for basic input, output, and simple
sorting of supplementary characters. However, not all system components are
compatible with supplementary characters."!
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
<#153 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/AChkHLaDwc7RHy-NsPz1Lwaaug5lYDaaks5rO2Y8gaJpZM4DsJMU>
.
|
I don't think you'll be able to achieve this without introducing a lot of API noise in the system. OCaml's That said I don't understand why this PR is still using a half broken UTF-8 decoder written in C. What is the problem with simply assuming that OCaml strings are UTF-8 encoded and use |
@dbuenzli - indeed that may well end up being so, but if that proves the case then it would still be worth having a central (i.e. within OCaml) way of addressing them (a la UChar). Eliminating any C beyond calls into the Windows API is certainly something I'd be doing! Either way, I think it is worth exploring what changes it would entail - I just haven given it enough detailed thought yet. It's difficult to argue that Unix is a compatibility layer given both its name and its design ... but that's also on my Windows-platform-radar |
Addressing what ? |
16-bit Windows wchar strings |
Please don't. As I already said I don't see how you'd like to introduce this without significant API noise. Having a type for that would be relatively useless, you are constantly transforming file paths when you deal with file systems, and the natural way to do this within the existing ocaml file system apis is via |
But we're not just talking about file system paths, we're talking about the entire API. Anyway, whichever way it ends up being, it's not yet done and, at least from me, it's not going to get that much more thought for a few months... |
There is a lot of remaining work here but I think it's worth a collective effort. Could we tentatively target release 4.06 in 6 months from now? Concerning the overall design, I think we should keep filenames on the Caml side as UTF8-encoded strings, and convert to whatever internal representation Win32 wants just around the Win32 system calls. Conversions are cheap compared to the cost of system calls. Concerning the current prototype implementation, I tried to review it but balked at the number of #ifdefs |
Seen on another forum, more explanations on UTF8 and why it's probably the better alternative even for Windows programming: http://utf8everywhere.org/ |
While I certainly do not share all of the points made by this document and find it sometimes a bit shortsighted (at least in the version I read when it was brought to my attention a few months ago), it was recently pointed to me that the POSIX file system API actually doesn't care about/mention filename encodings at all an treats filenames simply as byte sequences. The encoding your provide/get depends on the underlying file system format (@dsheets's deeper knowledge on the matter could confirm that this is So |
Work as continued as part of #1200, closing this one. |
http://caml.inria.fr/mantis/view.php?id=3771
This is a preliminary patch, without runtime switch and still using manual utf8 checking.
Tested both on msvc and mingw versions, but somehow cannot bootstrap with unicode disabled at configure time.