New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unicode support for windows runtime #153

Closed
wants to merge 8 commits into
base: trunk
from

Conversation

Projects
None yet
@ygrek
Contributor

ygrek commented Mar 10, 2015

http://caml.inria.fr/mantis/view.php?id=3771

This is a preliminary patch, without runtime switch and still using manual utf8 checking.
Tested both on msvc and mingw versions, but somehow cannot bootstrap with unicode disabled at configure time.

@bobot

This comment has been minimized.

Show comment
Hide comment
@bobot

bobot Jun 17, 2015

Contributor

Do you plan on adding the todos you spoke about in the bts "exposing runtime switch and use winapi for utf8 validation" ?

Contributor

bobot commented Jun 17, 2015

Do you plan on adding the todos you spoke about in the bts "exposing runtime switch and use winapi for utf8 validation" ?

@ygrek

This comment has been minimized.

Show comment
Hide comment
@ygrek

ygrek Jun 17, 2015

Contributor

Yeap, just wanted to get early feedback

Contributor

ygrek commented Jun 17, 2015

Yeap, just wanted to get early feedback

@alainfrisch

This comment has been minimized.

Show comment
Hide comment
@alainfrisch

alainfrisch Jul 1, 2015

Contributor

FWIW, it's high on our priority list to review and test this proposal.

Contributor

alainfrisch commented Jul 1, 2015

FWIW, it's high on our priority list to review and test this proposal.

@let-def

This comment has been minimized.

Show comment
Hide comment
@let-def

let-def Nov 15, 2015

Contributor

@alainfrisch any update on this?

Contributor

let-def commented Nov 15, 2015

@alainfrisch any update on this?

@alainfrisch

This comment has been minimized.

Show comment
Hide comment
@alainfrisch

alainfrisch Nov 16, 2015

Contributor

Still high on our priority list :-/

Contributor

alainfrisch commented Nov 16, 2015

Still high on our priority list :-/

@gasche

This comment has been minimized.

Show comment
Hide comment
@gasche

gasche Nov 19, 2015

Member

We discussed it at the last developer meeting, but the consensus was that it seemed to risky to include before more testing is done.

Member

gasche commented Nov 19, 2015

We discussed it at the last developer meeting, but the consensus was that it seemed to risky to include before more testing is done.

ygrek added some commits Feb 10, 2015

original patch rebased on trunk (with UTF16_TODO)
Conflicts:
	asmrun/Makefile.nt
	byterun/Makefile.nt
	byterun/sys.c
	byterun/win32.c
	config/Makefile.mingw
	config/Makefile.msvc
	otherlibs/unix/chdir.c
	otherlibs/unix/chmod.c
	otherlibs/unix/getcwd.c
	otherlibs/unix/rmdir.c
	otherlibs/unix/unlink.c
	otherlibs/unix/utimes.c
	otherlibs/win32unix/Makefile.nt
	otherlibs/win32unix/link.c
	otherlibs/win32unix/rename.c
	utils/ccomp.ml
@xavierleroy

This comment has been minimized.

Show comment
Hide comment
@xavierleroy

xavierleroy Dec 11, 2015

Contributor

FYI: I visited a serious industrial OCaml user yesterday, and they spontaneously mentioned this issue (Unicode file name under Windows) as a significant problem for them.

Contributor

xavierleroy commented Dec 11, 2015

FYI: I visited a serious industrial OCaml user yesterday, and they spontaneously mentioned this issue (Unicode file name under Windows) as a significant problem for them.

@damiendoligez damiendoligez added this to the 4.04-or-later milestone Dec 23, 2015

@modlfo

This comment has been minimized.

Show comment
Hide comment
@modlfo

modlfo Jan 8, 2016

Hi, I implemented this patch to do some testing because our users with unicode paths have the same problem.

I found that some functions still not work, like "caml_sys_open". I'll keep posting any new problem I find.

modlfo commented Jan 8, 2016

Hi, I implemented this patch to do some testing because our users with unicode paths have the same problem.

I found that some functions still not work, like "caml_sys_open". I'll keep posting any new problem I find.

@alainfrisch

This comment has been minimized.

Show comment
Hide comment
@alainfrisch

alainfrisch Jul 13, 2016

Contributor

@modlfo Any update on the problem you found?

I was wondering if something should be done for interacting with environment variables which are often used to store filenames. Should this be addressed at the same time? (There are specific "wchar" APIs for environment variables.)

Contributor

alainfrisch commented Jul 13, 2016

@modlfo Any update on the problem you found?

I was wondering if something should be done for interacting with environment variables which are often used to store filenames. Should this be addressed at the same time? (There are specific "wchar" APIs for environment variables.)

@modlfo

This comment has been minimized.

Show comment
Hide comment
@modlfo

modlfo Jul 13, 2016

@alainfrisch This patch broke when we moved our product from 4.02.3 to 4.03.0. We still had strange problems with functions that call processes like Unix.create_process in unicode paths or with spaces. Our solution was to stop using all the OCaml runtime functions and create bindings to Qt functions (our product already depends on Qt). So far that was the only way we could provide good support on Windows.

modlfo commented Jul 13, 2016

@alainfrisch This patch broke when we moved our product from 4.02.3 to 4.03.0. We still had strange problems with functions that call processes like Unix.create_process in unicode paths or with spaces. Our solution was to stop using all the OCaml runtime functions and create bindings to Qt functions (our product already depends on Qt). So far that was the only way we could provide good support on Windows.

@alainfrisch

This comment has been minimized.

Show comment
Hide comment
@alainfrisch

alainfrisch Jul 13, 2016

Contributor

Thanks, this is already useful information. Indeed CreateProcess does not seem to have switched to the W variant.

@ygrek : is this just an oversight, or would it break anything else?

Contributor

alainfrisch commented Jul 13, 2016

Thanks, this is already useful information. Indeed CreateProcess does not seem to have switched to the W variant.

@ygrek : is this just an oversight, or would it break anything else?

@damiendoligez damiendoligez removed this from the 4.04 milestone Aug 3, 2016

@mshinwell

This comment has been minimized.

Show comment
Hide comment
@mshinwell

mshinwell Dec 27, 2016

Contributor

@dra27 Could you look at this and decide what to do with it?

Contributor

mshinwell commented Dec 27, 2016

@dra27 Could you look at this and decide what to do with it?

@dra27

This comment has been minimized.

Show comment
Hide comment
@dra27

dra27 Dec 30, 2016

Contributor

Unless someone else wishes to push this patch further this month (January), I'd just like to revisit this in 4.06 (it's important to me, but at the moment there are only so many days available!) - I've been musing on a possible alternate approach for a while, but I think discussion of it will be better with some code. My concerns at present are in two areas:

  1. I think that introducing Windows Unicode piecemeal will lead to serious design gotchas later - so I think the entire build system should be using -D_UNICODE and -DUNICODE (+ #include <tchar.h> as appropriate) rather than explicitly selecting A and W variants in certain places only.
  2. Windows is UTF-16 (all other calls are a conversion) apart from some annoying corners which remain UCS-2 (such as the Console), and I'm keen to see if there is a good solution which does not convert between UTF-8 and UTF-16 all the time (it is obvious that portability will require this option, so I'm wondering if there is a case for something rather like the LargeFile module in Unix). This could open a hornets nest of exposing too much Unicode in the standard library, but we'll see...
Contributor

dra27 commented Dec 30, 2016

Unless someone else wishes to push this patch further this month (January), I'd just like to revisit this in 4.06 (it's important to me, but at the moment there are only so many days available!) - I've been musing on a possible alternate approach for a while, but I think discussion of it will be better with some code. My concerns at present are in two areas:

  1. I think that introducing Windows Unicode piecemeal will lead to serious design gotchas later - so I think the entire build system should be using -D_UNICODE and -DUNICODE (+ #include <tchar.h> as appropriate) rather than explicitly selecting A and W variants in certain places only.
  2. Windows is UTF-16 (all other calls are a conversion) apart from some annoying corners which remain UCS-2 (such as the Console), and I'm keen to see if there is a good solution which does not convert between UTF-8 and UTF-16 all the time (it is obvious that portability will require this option, so I'm wondering if there is a case for something rather like the LargeFile module in Unix). This could open a hornets nest of exposing too much Unicode in the standard library, but we'll see...

@dra27 dra27 added this to the 4.06-or-later milestone Dec 30, 2016

@alainfrisch

This comment has been minimized.

Show comment
Hide comment
@alainfrisch

alainfrisch Jan 3, 2017

Contributor

Windows is UTF-16

I'm interested to understand exactly what this means :-) Does it mean in particular that syscalls will fail for ill-formed UTF-16 sequences? Does it mean that no Unicode normalization happens (so that e.g. two different UTF-16 sequences representing both "é" after proper collation would produce two different files whose name would be printed in the same way by all graphical tools)?

Contributor

alainfrisch commented Jan 3, 2017

Windows is UTF-16

I'm interested to understand exactly what this means :-) Does it mean in particular that syscalls will fail for ill-formed UTF-16 sequences? Does it mean that no Unicode normalization happens (so that e.g. two different UTF-16 sequences representing both "é" after proper collation would produce two different files whose name would be printed in the same way by all graphical tools)?

@dra27

This comment has been minimized.

Show comment
Hide comment
@dra27

dra27 Jan 4, 2017

Contributor

So am I! The docs are remarkably unclear - in fact, my incorrect understanding was that it was all still UCS-2. The key relevance for me is that Windows is natively 16-bit code points, not 8. My hunches for how UTF-16 it is are:

a) It will vary by version of Windows (with a certain amount of associated sighing and gnashing of teeth, but Unicode has varied in the same time, so not so much Microsoft's fault)
b) Any "clever" parts will be missing (so I'm expecting that surrogates will be the only thing which works, not normalisation). The key ominous sentence here is "Windows 2000 introduces support for basic input, output, and simple sorting of supplementary characters. However, not all system components are compatible with supplementary characters."!

Contributor

dra27 commented Jan 4, 2017

So am I! The docs are remarkably unclear - in fact, my incorrect understanding was that it was all still UCS-2. The key relevance for me is that Windows is natively 16-bit code points, not 8. My hunches for how UTF-16 it is are:

a) It will vary by version of Windows (with a certain amount of associated sighing and gnashing of teeth, but Unicode has varied in the same time, so not so much Microsoft's fault)
b) Any "clever" parts will be missing (so I'm expecting that surrogates will be the only thing which works, not normalisation). The key ominous sentence here is "Windows 2000 introduces support for basic input, output, and simple sorting of supplementary characters. However, not all system components are compatible with supplementary characters."!

@zhenya1007

This comment has been minimized.

Show comment
Hide comment
@zhenya1007

zhenya1007 Jan 5, 2017

zhenya1007 commented Jan 5, 2017

@dbuenzli

This comment has been minimized.

Show comment
Hide comment
@dbuenzli

dbuenzli Jan 5, 2017

Contributor

I'm keen to see if there is a good solution which does not convert between UTF-8 and UTF-16 all the time

I don't think you'll be able to achieve this without introducing a lot of API noise in the system. OCaml's Unix and Sys are portability layer, there is always a cost in portability layers. Efficient platform specifics can be done by external projects.

That said I don't understand why this PR is still using a half broken UTF-8 decoder written in C.

What is the problem with simply assuming that OCaml strings are UTF-8 encoded and use MultiByteToWideChar/WideCharToMultiByte and the wide version of the Win API when you get to/return from the syscalls ?

Contributor

dbuenzli commented Jan 5, 2017

I'm keen to see if there is a good solution which does not convert between UTF-8 and UTF-16 all the time

I don't think you'll be able to achieve this without introducing a lot of API noise in the system. OCaml's Unix and Sys are portability layer, there is always a cost in portability layers. Efficient platform specifics can be done by external projects.

That said I don't understand why this PR is still using a half broken UTF-8 decoder written in C.

What is the problem with simply assuming that OCaml strings are UTF-8 encoded and use MultiByteToWideChar/WideCharToMultiByte and the wide version of the Win API when you get to/return from the syscalls ?

@dra27

This comment has been minimized.

Show comment
Hide comment
@dra27

dra27 Jan 5, 2017

Contributor

@dbuenzli - indeed that may well end up being so, but if that proves the case then it would still be worth having a central (i.e. within OCaml) way of addressing them (a la UChar). Eliminating any C beyond calls into the Windows API is certainly something I'd be doing! Either way, I think it is worth exploring what changes it would entail - I just haven given it enough detailed thought yet.

It's difficult to argue that Unix is a compatibility layer given both its name and its design ... but that's also on my Windows-platform-radar 😄

Contributor

dra27 commented Jan 5, 2017

@dbuenzli - indeed that may well end up being so, but if that proves the case then it would still be worth having a central (i.e. within OCaml) way of addressing them (a la UChar). Eliminating any C beyond calls into the Windows API is certainly something I'd be doing! Either way, I think it is worth exploring what changes it would entail - I just haven given it enough detailed thought yet.

It's difficult to argue that Unix is a compatibility layer given both its name and its design ... but that's also on my Windows-platform-radar 😄

@dbuenzli

This comment has been minimized.

Show comment
Hide comment
@dbuenzli

dbuenzli Jan 5, 2017

Contributor

but if that proves the case then it would still be worth having a central (i.e. within OCaml) way of addressing them (a la UChar).

Addressing what ?

Contributor

dbuenzli commented Jan 5, 2017

but if that proves the case then it would still be worth having a central (i.e. within OCaml) way of addressing them (a la UChar).

Addressing what ?

@dra27

This comment has been minimized.

Show comment
Hide comment
@dra27

dra27 Jan 5, 2017

Contributor

16-bit Windows wchar strings

Contributor

dra27 commented Jan 5, 2017

16-bit Windows wchar strings

@dbuenzli

This comment has been minimized.

Show comment
Hide comment
@dbuenzli

dbuenzli Jan 5, 2017

Contributor

16-bit Windows wchar strings

Please don't. As I already said I don't see how you'd like to introduce this without significant API noise. Having a type for that would be relatively useless, you are constantly transforming file paths when you deal with file systems, and the natural way to do this within the existing ocaml file system apis is via strings.

Contributor

dbuenzli commented Jan 5, 2017

16-bit Windows wchar strings

Please don't. As I already said I don't see how you'd like to introduce this without significant API noise. Having a type for that would be relatively useless, you are constantly transforming file paths when you deal with file systems, and the natural way to do this within the existing ocaml file system apis is via strings.

@dra27

This comment has been minimized.

Show comment
Hide comment
@dra27

dra27 Jan 5, 2017

Contributor

But we're not just talking about file system paths, we're talking about the entire API. Anyway, whichever way it ends up being, it's not yet done and, at least from me, it's not going to get that much more thought for a few months...

Contributor

dra27 commented Jan 5, 2017

But we're not just talking about file system paths, we're talking about the entire API. Anyway, whichever way it ends up being, it's not yet done and, at least from me, it's not going to get that much more thought for a few months...

@xavierleroy

This comment has been minimized.

Show comment
Hide comment
@xavierleroy

xavierleroy Feb 17, 2017

Contributor

There is a lot of remaining work here but I think it's worth a collective effort. Could we tentatively target release 4.06 in 6 months from now?

Concerning the overall design, I think we should keep filenames on the Caml side as UTF8-encoded strings, and convert to whatever internal representation Win32 wants just around the Win32 system calls. Conversions are cheap compared to the cost of system calls.

Concerning the current prototype implementation, I tried to review it but balked at the number of #ifdefs

Contributor

xavierleroy commented Feb 17, 2017

There is a lot of remaining work here but I think it's worth a collective effort. Could we tentatively target release 4.06 in 6 months from now?

Concerning the overall design, I think we should keep filenames on the Caml side as UTF8-encoded strings, and convert to whatever internal representation Win32 wants just around the Win32 system calls. Conversions are cheap compared to the cost of system calls.

Concerning the current prototype implementation, I tried to review it but balked at the number of #ifdefs

@xavierleroy

This comment has been minimized.

Show comment
Hide comment
@xavierleroy

xavierleroy Mar 9, 2017

Contributor

Seen on another forum, more explanations on UTF8 and why it's probably the better alternative even for Windows programming: http://utf8everywhere.org/

Contributor

xavierleroy commented Mar 9, 2017

Seen on another forum, more explanations on UTF8 and why it's probably the better alternative even for Windows programming: http://utf8everywhere.org/

@dbuenzli

This comment has been minimized.

Show comment
Hide comment
@dbuenzli

dbuenzli Mar 9, 2017

Contributor

While I certainly do not share all of the points made by this document and find it sometimes a bit shortsighted (at least in the version I read when it was brought to my attention a few months ago), it was recently pointed to me that the POSIX file system API actually doesn't care about/mention filename encodings at all an treats filenames simply as byte sequences. The encoding your provide/get depends on the underlying file system format (@dsheets's deeper knowledge on the matter could confirm that this is true).

So string, i.e. byte sequences, seems to be the right data structure to have here and UTF-8, which is compatible with string, the right encoding to assume is provided by the given strings for the win compatibility file system layer. I still think that this is the way to go.

Contributor

dbuenzli commented Mar 9, 2017

While I certainly do not share all of the points made by this document and find it sometimes a bit shortsighted (at least in the version I read when it was brought to my attention a few months ago), it was recently pointed to me that the POSIX file system API actually doesn't care about/mention filename encodings at all an treats filenames simply as byte sequences. The encoding your provide/get depends on the underlying file system format (@dsheets's deeper knowledge on the matter could confirm that this is true).

So string, i.e. byte sequences, seems to be the right data structure to have here and UTF-8, which is compatible with string, the right encoding to assume is provided by the given strings for the win compatibility file system layer. I still think that this is the way to go.

@nojb nojb referenced this pull request Jun 10, 2017

Merged

Unicode support for the Windows runtime: Let's do it! #1200

8 of 8 tasks complete
@alainfrisch

This comment has been minimized.

Show comment
Hide comment
@alainfrisch

alainfrisch Jul 20, 2017

Contributor

Work as continued as part of #1200, closing this one.

Contributor

alainfrisch commented Jul 20, 2017

Work as continued as part of #1200, closing this one.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment