Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unicode support for windows runtime #153

Closed
wants to merge 8 commits into from
Closed

Conversation

@ygrek
Copy link
Contributor

@ygrek ygrek commented Mar 10, 2015

http://caml.inria.fr/mantis/view.php?id=3771

This is a preliminary patch, without runtime switch and still using manual utf8 checking.
Tested both on msvc and mingw versions, but somehow cannot bootstrap with unicode disabled at configure time.

@bobot
Copy link
Contributor

@bobot bobot commented Jun 17, 2015

Do you plan on adding the todos you spoke about in the bts "exposing runtime switch and use winapi for utf8 validation" ?

@ygrek
Copy link
Contributor Author

@ygrek ygrek commented Jun 17, 2015

Yeap, just wanted to get early feedback

@alainfrisch
Copy link
Contributor

@alainfrisch alainfrisch commented Jul 1, 2015

FWIW, it's high on our priority list to review and test this proposal.

@let-def
Copy link
Contributor

@let-def let-def commented Nov 15, 2015

@alainfrisch any update on this?

@alainfrisch
Copy link
Contributor

@alainfrisch alainfrisch commented Nov 16, 2015

Still high on our priority list :-/

@gasche
Copy link
Member

@gasche gasche commented Nov 19, 2015

We discussed it at the last developer meeting, but the consensus was that it seemed to risky to include before more testing is done.

ygrek added 8 commits Feb 10, 2015
Conflicts:
	asmrun/Makefile.nt
	byterun/Makefile.nt
	byterun/sys.c
	byterun/win32.c
	config/Makefile.mingw
	config/Makefile.msvc
	otherlibs/unix/chdir.c
	otherlibs/unix/chmod.c
	otherlibs/unix/getcwd.c
	otherlibs/unix/rmdir.c
	otherlibs/unix/unlink.c
	otherlibs/unix/utimes.c
	otherlibs/win32unix/Makefile.nt
	otherlibs/win32unix/link.c
	otherlibs/win32unix/rename.c
	utils/ccomp.ml
@ygrek ygrek force-pushed the ygrek:win_unicode branch from 9370d95 to c82f9c8 Nov 23, 2015
@xavierleroy
Copy link
Contributor

@xavierleroy xavierleroy commented Dec 11, 2015

FYI: I visited a serious industrial OCaml user yesterday, and they spontaneously mentioned this issue (Unicode file name under Windows) as a significant problem for them.

@damiendoligez damiendoligez added this to the 4.04-or-later milestone Dec 23, 2015
@modlfo
Copy link

@modlfo modlfo commented Jan 8, 2016

Hi, I implemented this patch to do some testing because our users with unicode paths have the same problem.

I found that some functions still not work, like "caml_sys_open". I'll keep posting any new problem I find.

@alainfrisch
Copy link
Contributor

@alainfrisch alainfrisch commented Jul 13, 2016

@modlfo Any update on the problem you found?

I was wondering if something should be done for interacting with environment variables which are often used to store filenames. Should this be addressed at the same time? (There are specific "wchar" APIs for environment variables.)

@modlfo
Copy link

@modlfo modlfo commented Jul 13, 2016

@alainfrisch This patch broke when we moved our product from 4.02.3 to 4.03.0. We still had strange problems with functions that call processes like Unix.create_process in unicode paths or with spaces. Our solution was to stop using all the OCaml runtime functions and create bindings to Qt functions (our product already depends on Qt). So far that was the only way we could provide good support on Windows.

@alainfrisch
Copy link
Contributor

@alainfrisch alainfrisch commented Jul 13, 2016

Thanks, this is already useful information. Indeed CreateProcess does not seem to have switched to the W variant.

@ygrek : is this just an oversight, or would it break anything else?

@damiendoligez damiendoligez removed this from the 4.04 milestone Aug 3, 2016
@mshinwell
Copy link
Contributor

@mshinwell mshinwell commented Dec 27, 2016

@dra27 Could you look at this and decide what to do with it?

@dra27
Copy link
Contributor

@dra27 dra27 commented Dec 30, 2016

Unless someone else wishes to push this patch further this month (January), I'd just like to revisit this in 4.06 (it's important to me, but at the moment there are only so many days available!) - I've been musing on a possible alternate approach for a while, but I think discussion of it will be better with some code. My concerns at present are in two areas:

  1. I think that introducing Windows Unicode piecemeal will lead to serious design gotchas later - so I think the entire build system should be using -D_UNICODE and -DUNICODE (+ #include <tchar.h> as appropriate) rather than explicitly selecting A and W variants in certain places only.
  2. Windows is UTF-16 (all other calls are a conversion) apart from some annoying corners which remain UCS-2 (such as the Console), and I'm keen to see if there is a good solution which does not convert between UTF-8 and UTF-16 all the time (it is obvious that portability will require this option, so I'm wondering if there is a case for something rather like the LargeFile module in Unix). This could open a hornets nest of exposing too much Unicode in the standard library, but we'll see...
@dra27 dra27 added this to the 4.06-or-later milestone Dec 30, 2016
@alainfrisch
Copy link
Contributor

@alainfrisch alainfrisch commented Jan 3, 2017

Windows is UTF-16

I'm interested to understand exactly what this means :-) Does it mean in particular that syscalls will fail for ill-formed UTF-16 sequences? Does it mean that no Unicode normalization happens (so that e.g. two different UTF-16 sequences representing both "é" after proper collation would produce two different files whose name would be printed in the same way by all graphical tools)?

@dra27
Copy link
Contributor

@dra27 dra27 commented Jan 4, 2017

So am I! The docs are remarkably unclear - in fact, my incorrect understanding was that it was all still UCS-2. The key relevance for me is that Windows is natively 16-bit code points, not 8. My hunches for how UTF-16 it is are:

a) It will vary by version of Windows (with a certain amount of associated sighing and gnashing of teeth, but Unicode has varied in the same time, so not so much Microsoft's fault)
b) Any "clever" parts will be missing (so I'm expecting that surrogates will be the only thing which works, not normalisation). The key ominous sentence here is "Windows 2000 introduces support for basic input, output, and simple sorting of supplementary characters. However, not all system components are compatible with supplementary characters."!

@zhenya1007
Copy link

@zhenya1007 zhenya1007 commented Jan 5, 2017

@dbuenzli
Copy link
Contributor

@dbuenzli dbuenzli commented Jan 5, 2017

I'm keen to see if there is a good solution which does not convert between UTF-8 and UTF-16 all the time

I don't think you'll be able to achieve this without introducing a lot of API noise in the system. OCaml's Unix and Sys are portability layer, there is always a cost in portability layers. Efficient platform specifics can be done by external projects.

That said I don't understand why this PR is still using a half broken UTF-8 decoder written in C.

What is the problem with simply assuming that OCaml strings are UTF-8 encoded and use MultiByteToWideChar/WideCharToMultiByte and the wide version of the Win API when you get to/return from the syscalls ?

@dra27
Copy link
Contributor

@dra27 dra27 commented Jan 5, 2017

@dbuenzli - indeed that may well end up being so, but if that proves the case then it would still be worth having a central (i.e. within OCaml) way of addressing them (a la UChar). Eliminating any C beyond calls into the Windows API is certainly something I'd be doing! Either way, I think it is worth exploring what changes it would entail - I just haven given it enough detailed thought yet.

It's difficult to argue that Unix is a compatibility layer given both its name and its design ... but that's also on my Windows-platform-radar 😄

@dbuenzli
Copy link
Contributor

@dbuenzli dbuenzli commented Jan 5, 2017

but if that proves the case then it would still be worth having a central (i.e. within OCaml) way of addressing them (a la UChar).

Addressing what ?

@dra27
Copy link
Contributor

@dra27 dra27 commented Jan 5, 2017

16-bit Windows wchar strings

@dbuenzli
Copy link
Contributor

@dbuenzli dbuenzli commented Jan 5, 2017

16-bit Windows wchar strings

Please don't. As I already said I don't see how you'd like to introduce this without significant API noise. Having a type for that would be relatively useless, you are constantly transforming file paths when you deal with file systems, and the natural way to do this within the existing ocaml file system apis is via strings.

@dra27
Copy link
Contributor

@dra27 dra27 commented Jan 5, 2017

But we're not just talking about file system paths, we're talking about the entire API. Anyway, whichever way it ends up being, it's not yet done and, at least from me, it's not going to get that much more thought for a few months...

@xavierleroy
Copy link
Contributor

@xavierleroy xavierleroy commented Feb 17, 2017

There is a lot of remaining work here but I think it's worth a collective effort. Could we tentatively target release 4.06 in 6 months from now?

Concerning the overall design, I think we should keep filenames on the Caml side as UTF8-encoded strings, and convert to whatever internal representation Win32 wants just around the Win32 system calls. Conversions are cheap compared to the cost of system calls.

Concerning the current prototype implementation, I tried to review it but balked at the number of #ifdefs

@xavierleroy
Copy link
Contributor

@xavierleroy xavierleroy commented Mar 9, 2017

Seen on another forum, more explanations on UTF8 and why it's probably the better alternative even for Windows programming: http://utf8everywhere.org/

@dbuenzli
Copy link
Contributor

@dbuenzli dbuenzli commented Mar 9, 2017

While I certainly do not share all of the points made by this document and find it sometimes a bit shortsighted (at least in the version I read when it was brought to my attention a few months ago), it was recently pointed to me that the POSIX file system API actually doesn't care about/mention filename encodings at all an treats filenames simply as byte sequences. The encoding your provide/get depends on the underlying file system format (@dsheets's deeper knowledge on the matter could confirm that this is true).

So string, i.e. byte sequences, seems to be the right data structure to have here and UTF-8, which is compatible with string, the right encoding to assume is provided by the given strings for the win compatibility file system layer. I still think that this is the way to go.

@nojb nojb mentioned this pull request Jun 10, 2017
8 of 8 tasks complete
@alainfrisch
Copy link
Contributor

@alainfrisch alainfrisch commented Jul 20, 2017

Work as continued as part of #1200, closing this one.

lthls pushed a commit to lthls/ocaml that referenced this pull request Jun 10, 2020
lthls added a commit to lthls/ocaml that referenced this pull request Sep 23, 2020
lthls added a commit to lthls/ocaml that referenced this pull request Sep 23, 2020
lthls added a commit to lthls/ocaml that referenced this pull request Sep 24, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked issues

Successfully merging this pull request may close these issues.

None yet

You can’t perform that action at this time.