New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Windows Console Unicode Support #1408

Open
wants to merge 3 commits into
base: trunk
from

Conversation

Projects
None yet
6 participants
@dra27
Contributor

dra27 commented Oct 6, 2017

The explanation of this is slightly picky, so please bear with me. This is a continuation of #1398, but it picks up a few other bugs on the way.

On normal OSes, you don't have to worry so much about your terminal - the caller tells you about the terminal, and you just send appropriate output to it. Not quite the case on Windows, especially if you want Unicode characters to display correctly.

At a minimum, I would like the change in byterun/sys.c to call SetConsoleOutputCP(CP_UTF8); to be included in 4.06.0 (if the change in byterun/win32.c is not included, then it is necessary to guard that call with a check that we're running on Windows 10). I think the entire change is safe, especially as it only affects input/output from/to keyboards/screens, and not redirections.

It seems to me that having gone to the (serious) trouble of adding proper handling for UTF-8 through the runtime on Windows, that it's a PR (in the "public relations") disaster to have the toplevel displaying raw UTF-8 sequences when it the Unix toplevel now gets to display proper strings.

The primary goal is to ensure that the Windows console is set to interpret UTF-8 text and display the correct Unicode characters. This can be achieved from the console by executing chcp 65001 but it has some important caveats:

  • Prior to Windows 10, the underlying WriteConsole API call contains a bug. This is tracked in MPR#6925. I have included a fix in caml_write_fd which detects that the fd is a console and on the appropriate version of Windows, manually writes the output.
  • Prior to Windows 10 1607, there is a bug when running the UTF-8 codepage which crashes applications which attempt to read Unicode characters entered by the user. This can be seen in the toplevel of OCaml by trying to put UTF-8 directly into a string - it simply segfaults if you're running chcp 65001. The fact that OCaml now supports Unicode filenames, and so forth, means this is something users are more likely to do (run in chcp 65001, I mean). I therefore include a fix in caml_read_fd which similarly upon detection that fd is a console, reads Windows wide-characters directly and then calls the Windows functions to translate those sequences to UTF-8 as the official result of read.

There is also some weirdness in that prior to Windows 10 1607, the "raster fonts" mode of the console (which was the default prior to Windows 10 1507/RTM) cannot ever display Unicode characters and can cause all kinds of crashes for any process trying to do so. For this reason, all the shims back off completely if the console has not been set to a truetype font. This means in the specific case of raster fonts selected and code page 65001 selected that ocamlrun will crash if presented with Unicode input from the console (but that's the same as anything - even cmd.exe displays an error prior to 1607 if you try to echo a string containing extended characters).

Obviously, this has the potential to alter both the way input and output are processed. I will stress here: this only happens when writing to consoles, it will never affect redirection to files. This is therefore a much lower-risk change than it may appear.

The behaviour, as with many of these things, depends on WINDOWS_UNICODE in config/Makefile.

If WINDOWS_UNICODE is 1, then:

  • The runtime will always select CP_UTF8 for output
  • For versions prior to Windows 10, it will use the output shim so that UTF-8 is processed correctly
  • For all versions, it uses the input shim to read UTF-8 correctly

If WINDOWS_UNICODE is 0, then:

  • The runtime does not alter the codepage for output
  • For versions prior to Windows 10, the output shim is used only if the user has already selected CP_UTF8 (e.g. by running chcp 65001). This fixes MPR#6925
  • If the user has selected CP_UTF8, the input shim is used. I don't know if the crash associated with this is recorded in Mantis, but it fixes that too.

There are some semantic changes to caml_read_fd and caml_write_fd when dealing with console handles, but I believe these are either within spec, or not relevant:

  1. caml_write_fd will not write more than 1KiB to the console at once (that's within spec)
  2. caml_read_fd will often read considerably less than requested (and never more than 4KiB), because it must allow for converting the characters read to UTF-8 sequences
  3. caml_read_fd will fail if asked to read less than 4 bytes from a console

The caml_write_fd change should have no particular effect, unless a program is very badly coded to assume the write always succeeds. The caml_read_fd change results in a subtle change to caml_ml_input_line to prevent it from every trying to request too few bytes - all the other uses of caml_read_fd are to fill channel buffers completely, so will work.

@gasche

I had a look on the surface with mostly code-readability comments, but of course we would need a review from someone that knows about windows. ( @nojb, @alainfrisch ? )

To be honest I find the amount of change and their invasiveness a bit scary. I'm not convinced that it would be a "PR disaster" to display characters incorrectly under Windows (the rationale for #1200 was to not break down on unicode-using file paths, not to print nice things in the console), and that seems way better than overlooking something and screwing up I/O. Don't count on me to push this into 4.06 :-)

Show outdated Hide outdated byterun/io.c
Show outdated Hide outdated byterun/sys.c
int caml_read_fd(int fd, int flags, void * buf, int n)
{
int retcode;
if ((flags & CHANNEL_FLAG_FROM_SOCKET) == 0) {
caml_enter_blocking_section();
retcode = read(fd, buf, n);
#if WINDOWS_UNICODE
if ((flags & CHANNEL_FLAG_CONSOLE) && console_supports_unicode()) {

This comment has been minimized.

@gasche

gasche Oct 6, 2017

Member

console_supports_unicode makes what I suppose are three system calls, and you would call that each time we read from the console? Is there a performance cost to this repeated test? (It so, it could make sense to store the information in the channel flags at channel-creation time instead.)

@gasche

gasche Oct 6, 2017

Member

console_supports_unicode makes what I suppose are three system calls, and you would call that each time we read from the console? Is there a performance cost to this repeated test? (It so, it could make sense to store the information in the channel flags at channel-creation time instead.)

This comment has been minimized.

@dra27

dra27 Oct 6, 2017

Contributor

This is reading user input, so I think we could do largely anything, especially with the speed of the Windows console!

I debated caching, but the user is able to change these settings, so the cached value could realistically be out of date. I think it's possible to use SetWinEventHook to be told when console settings may have changed, but that seemed overkill for a function which shouldn't be speed critical.

@dra27

dra27 Oct 6, 2017

Contributor

This is reading user input, so I think we could do largely anything, especially with the speed of the Windows console!

I debated caching, but the user is able to change these settings, so the cached value could realistically be out of date. I think it's possible to use SetWinEventHook to be told when console settings may have changed, but that seemed overkill for a function which shouldn't be speed critical.

This comment has been minimized.

@gasche

gasche Oct 6, 2017

Member

Ok, so the idea is that all calls to this function in the I/O system are guarded by the more efficient CHANNEL_FLAG_CONSOLE check. That sounds reasonable.

@gasche

gasche Oct 6, 2017

Member

Ok, so the idea is that all calls to this function in the I/O system are guarded by the more efficient CHANNEL_FLAG_CONSOLE check. That sounds reasonable.

if ((flags & CHANNEL_FLAG_CONSOLE) && GetConsoleCP() == CP_UTF8
&& console_supports_unicode()) {
#endif
/* Cannot perform a UTF-8 read unless the buffer is at least 4 bytes */

This comment has been minimized.

@gasche

gasche Oct 6, 2017

Member

Is there a risk that some applications would be written to read bytes one at a time (so with n == 1 always), and happen to be run on unicode-supporting systems? Would it be possible to fallback to just a read call in that case (I guess that means designed-for-unicode code could sometimes behave incorrectly?)?

@gasche

gasche Oct 6, 2017

Member

Is there a risk that some applications would be written to read bytes one at a time (so with n == 1 always), and happen to be run on unicode-supporting systems? Would it be possible to fallback to just a read call in that case (I guess that means designed-for-unicode code could sometimes behave incorrectly?)?

This comment has been minimized.

@dra27

dra27 Oct 6, 2017

Contributor

Falling back to read would mean that you might get garbage (if you're lucky - often it's a segfault). I wasn't certain, but it seemed unlikely that you'd ever use this in C stubs - it's largely used internally in a context where it's refilling the channel's buffer.

I'm not sure whether the support of these two is worth extending to Unix.read and Unix.write (they will both exhibit the same problems if you use Unix.stdin and Unix.stdout/Unix.stderr)

@dra27

dra27 Oct 6, 2017

Contributor

Falling back to read would mean that you might get garbage (if you're lucky - often it's a segfault). I wasn't certain, but it seemed unlikely that you'd ever use this in C stubs - it's largely used internally in a context where it's refilling the channel's buffer.

I'm not sure whether the support of these two is worth extending to Unix.read and Unix.write (they will both exhibit the same problems if you use Unix.stdin and Unix.stdout/Unix.stderr)

@dra27

This comment has been minimized.

Show comment
Hide comment
@dra27

dra27 Oct 6, 2017

Contributor

@gasche - thanks for the review, and noted regarding 4.06. How amenable would you be to the smaller version mentioned at the start (i.e. call SetConsoleOutputCP(CP_UTF8) on Windows 10 only).

I can split that off to a separate branch if you'd like to see what it looks like separately, but it's essentially the changes to the Makefiles (to link with version.dll) and all the changes in byterun/sys.c with the addition of if (caml_win32_major >= 10) before the SetConsoleOutputCP(CP_UTF8); call.

Contributor

dra27 commented Oct 6, 2017

@gasche - thanks for the review, and noted regarding 4.06. How amenable would you be to the smaller version mentioned at the start (i.e. call SetConsoleOutputCP(CP_UTF8) on Windows 10 only).

I can split that off to a separate branch if you'd like to see what it looks like separately, but it's essentially the changes to the Makefiles (to link with version.dll) and all the changes in byterun/sys.c with the addition of if (caml_win32_major >= 10) before the SetConsoleOutputCP(CP_UTF8); call.

@gasche

This comment has been minimized.

Show comment
Hide comment
@gasche

gasche Oct 6, 2017

Member

I think I was too grumpy in formulating my review -- it seems that late Friday is fine for bisection work, but maybe not for actual communication. I'm not trying to say that I would veto the change, but that I right now I wouldn't want to merge it.

Re. the simpler version: it does sound less invasive, but I don't realize how risky the .dll-related changes are. Clearly this needs the opinion of someone that knows about Windows builds.

Member

gasche commented Oct 6, 2017

I think I was too grumpy in formulating my review -- it seems that late Friday is fine for bisection work, but maybe not for actual communication. I'm not trying to say that I would veto the change, but that I right now I wouldn't want to merge it.

Re. the simpler version: it does sound less invasive, but I don't realize how risky the .dll-related changes are. Clearly this needs the opinion of someone that knows about Windows builds.

@xavierleroy

This comment has been minimized.

Show comment
Hide comment
@xavierleroy

xavierleroy Oct 8, 2017

Contributor

Let me echo @gasche 's "abstraction boundary" concern in more brutal terms:

All code that needs #include <windows.h> should be in byterun/win32.c.

Having to #include <windows.h> in sys.c and io.c is a bad smell.

Now, let's work together to find the right "abstraction" functions to put into win32.c (and possibly in unix.c as well for symmetry) so that sys.c and io.c remain mostly Unix/Windows-agnostic.

Contributor

xavierleroy commented Oct 8, 2017

Let me echo @gasche 's "abstraction boundary" concern in more brutal terms:

All code that needs #include <windows.h> should be in byterun/win32.c.

Having to #include <windows.h> in sys.c and io.c is a bad smell.

Now, let's work together to find the right "abstraction" functions to put into win32.c (and possibly in unix.c as well for symmetry) so that sys.c and io.c remain mostly Unix/Windows-agnostic.

@xavierleroy

This comment has been minimized.

Show comment
Hide comment
@xavierleroy

xavierleroy Oct 8, 2017

Contributor

Also: I wonder whether this material is specific to the Windows console (i.e. running an OCaml program or the OCaml toplevel directly from cmd.exe) or whether it is also relevant to using OCaml from a Cygwin or MSYS console, or under Emacs, etc.

Contributor

xavierleroy commented Oct 8, 2017

Also: I wonder whether this material is specific to the Windows console (i.e. running an OCaml program or the OCaml toplevel directly from cmd.exe) or whether it is also relevant to using OCaml from a Cygwin or MSYS console, or under Emacs, etc.

@dra27

This comment has been minimized.

Show comment
Hide comment
@dra27

dra27 Oct 8, 2017

Contributor

@gasche - hah, I didn't think you were being particularly grumpy, and I was grateful for the review before the weekend! I will, with my tongue moderately in my cheek, observe that abstraction seems to be co-opted to mean "not Windows", given the free use of unistd.h where needed, but I also appreciate that where I see the joy of Windows ports behaving similarly to their Unix siblings, others will see a horror of #ifdefs and hacks 😈 I have, apropos of both yours and @xavierleroy's comments, refactored the changes in sys.c and io.c to move code to win32.c and have a little bit of #ifdef-ery in osdeps.h.

There is a version of this branch only making the SetConsoleOutputCP change for Windows 10 at https://github.com/dra27/ocaml/tree/windows10-unicode (I haven't amended the commit messages on that branch, if it is decided to merge that separately).

Contributor

dra27 commented Oct 8, 2017

@gasche - hah, I didn't think you were being particularly grumpy, and I was grateful for the review before the weekend! I will, with my tongue moderately in my cheek, observe that abstraction seems to be co-opted to mean "not Windows", given the free use of unistd.h where needed, but I also appreciate that where I see the joy of Windows ports behaving similarly to their Unix siblings, others will see a horror of #ifdefs and hacks 😈 I have, apropos of both yours and @xavierleroy's comments, refactored the changes in sys.c and io.c to move code to win32.c and have a little bit of #ifdef-ery in osdeps.h.

There is a version of this branch only making the SetConsoleOutputCP change for Windows 10 at https://github.com/dra27/ocaml/tree/windows10-unicode (I haven't amended the commit messages on that branch, if it is decided to merge that separately).

@dra27

This comment has been minimized.

Show comment
Hide comment
@dra27

dra27 Oct 8, 2017

Contributor

@xavierleroy - this code might be relevant in all those cases, but only when strictly using the Windows Console. So, when using whatever you Emacs people (!!) call gVim, this code would definitely not apply. Similarly, when using mintty (which is the bundled terminal emulator in Cygwin) for either MSYS2 or Cygwin, this wouldn't apply - standard handles then are pipes and the UTF-8 stuff "just works" because, well, mintty is a properly implemented terminal!

This code does apply if you run, say, Cygwin's bash in Cmd and then start native OCaml - it is possible that the code would be needed if you start a Cygwin-compiled OCaml from Cmd similarly. That said, trying to do any Cygwin/MSYS bash stuff through Cmd seems a moderately strange to do to me.

It's also possible that the Cygwin layer does this too - @nojb has noted, for example, that what I do hear is remarkably similar to how Git-for-Windows (which is a native Windows application containing a wrapped MSYS2 environment for running scripts) handles the UTF-8 console problems.

Contributor

dra27 commented Oct 8, 2017

@xavierleroy - this code might be relevant in all those cases, but only when strictly using the Windows Console. So, when using whatever you Emacs people (!!) call gVim, this code would definitely not apply. Similarly, when using mintty (which is the bundled terminal emulator in Cygwin) for either MSYS2 or Cygwin, this wouldn't apply - standard handles then are pipes and the UTF-8 stuff "just works" because, well, mintty is a properly implemented terminal!

This code does apply if you run, say, Cygwin's bash in Cmd and then start native OCaml - it is possible that the code would be needed if you start a Cygwin-compiled OCaml from Cmd similarly. That said, trying to do any Cygwin/MSYS bash stuff through Cmd seems a moderately strange to do to me.

It's also possible that the Cygwin layer does this too - @nojb has noted, for example, that what I do hear is remarkably similar to how Git-for-Windows (which is a native Windows application containing a wrapped MSYS2 environment for running scripts) handles the UTF-8 console problems.

@dra27

This comment has been minimized.

Show comment
Hide comment
@dra27

dra27 Oct 8, 2017

Contributor

@gasche - I should have added the version.dll change is entirely safe... it's just how Windows syscalls "work".

Contributor

dra27 commented Oct 8, 2017

@gasche - I should have added the version.dll change is entirely safe... it's just how Windows syscalls "work".

@gasche

This comment has been minimized.

Show comment
Hide comment
@gasche

gasche Oct 8, 2017

Member

I think it would make sense to have two PRs, one with the ConsoleOutputCP (including the version stuff etc.), and one with the I/O shims build on top of the first one. This would give interested people a natural place to discuss whether the first part should be merged in 4.06, without interference from the discussion about the second part (which can happen in parallel, although of course it would depend on the first one).

(This is the strategy I proposed for %S+ConsoleOutputCP, and I think we were better off discussing the two parts separately, as otherwise the %S part would still be waiting.)

Member

gasche commented Oct 8, 2017

I think it would make sense to have two PRs, one with the ConsoleOutputCP (including the version stuff etc.), and one with the I/O shims build on top of the first one. This would give interested people a natural place to discuss whether the first part should be merged in 4.06, without interference from the discussion about the second part (which can happen in parallel, although of course it would depend on the first one).

(This is the strategy I proposed for %S+ConsoleOutputCP, and I think we were better off discussing the two parts separately, as otherwise the %S part would still be waiting.)

@gasche

This comment has been minimized.

Show comment
Hide comment
@gasche

gasche Oct 8, 2017

Member

(The reader may have noticed that I wrote the parenthesis above as I was conviced that the %S part, #1398, had already been merged, but in fact it's not. I'll leave it there for comic value, and go merge.)

Member

gasche commented Oct 8, 2017

(The reader may have noticed that I wrote the parenthesis above as I was conviced that the %S part, #1398, had already been merged, but in fact it's not. I'll leave it there for comic value, and go merge.)

@dra27

This comment has been minimized.

Show comment
Hide comment
@dra27

dra27 Oct 8, 2017

Contributor

This GPR is now based on #1416, so only the last 3 commits "belong" here.

Contributor

dra27 commented Oct 8, 2017

This GPR is now based on #1416, so only the last 3 commits "belong" here.

@damiendoligez damiendoligez added this to the discuss-for-4.06.0 milestone Oct 11, 2017

@dra27

This comment has been minimized.

Show comment
Hide comment
@dra27

dra27 Oct 12, 2017

Contributor

Rebased (still suspended until #1416 is merged), but with the change to byterun/unix.c now in the correct GPR 😊

Contributor

dra27 commented Oct 12, 2017

Rebased (still suspended until #1416 is merged), but with the change to byterun/unix.c now in the correct GPR 😊

@dra27 dra27 removed the suspended label Oct 12, 2017

@dra27

This comment has been minimized.

Show comment
Hide comment
@dra27

dra27 Oct 12, 2017

Contributor

Rebased now that #1416 is merged.

Contributor

dra27 commented Oct 12, 2017

Rebased now that #1416 is merged.

@gasche

This comment has been minimized.

Show comment
Hide comment
@gasche

gasche Oct 14, 2017

Member

cc @xavierleroy, @nojb: if you have further opinions to give on this more advanced part of the "Windows console and Unicode" mystery novel, they are very welcome.

Member

gasche commented Oct 14, 2017

cc @xavierleroy, @nojb: if you have further opinions to give on this more advanced part of the "Windows console and Unicode" mystery novel, they are very welcome.

@xavierleroy

This comment has been minimized.

Show comment
Hide comment
@xavierleroy

xavierleroy Oct 15, 2017

Contributor

hoping I'm not too pushy, I'd like to see @nojb and @dra27 work on #1406 first, because #1406 looks to me like a lower hanging fruit than the present PR, and has greater benefits. (Nobody runs OCaml in a cmd.exe console; everyone uses a Cygwin or MSys console.) In particular I don't understand why #1406 has already been rescheduled for 4.07-or-later and the present PR is still marked consider-for-release.

Contributor

xavierleroy commented Oct 15, 2017

hoping I'm not too pushy, I'd like to see @nojb and @dra27 work on #1406 first, because #1406 looks to me like a lower hanging fruit than the present PR, and has greater benefits. (Nobody runs OCaml in a cmd.exe console; everyone uses a Cygwin or MSys console.) In particular I don't understand why #1406 has already been rescheduled for 4.07-or-later and the present PR is still marked consider-for-release.

@dra27

This comment has been minimized.

Show comment
Hide comment
@dra27

dra27 Oct 15, 2017

Contributor

Erm, I use OCaml in a Cmd console and have done for the last 14 years - the Cygwin console is an unstable catastrophe for using native applications on a regular basis (it's trivially easy to end up with the terminal freezing - examples including trying to terminate long output of git commands, breaking in the toplevel, ...).

I'm happy to focus effort on #1406 first, but to me the priority is the wrong way around. #1406 improves colour support in error messages, where this GPR in one configuration fixes a segfault...

Contributor

dra27 commented Oct 15, 2017

Erm, I use OCaml in a Cmd console and have done for the last 14 years - the Cygwin console is an unstable catastrophe for using native applications on a regular basis (it's trivially easy to end up with the terminal freezing - examples including trying to terminate long output of git commands, breaking in the toplevel, ...).

I'm happy to focus effort on #1406 first, but to me the priority is the wrong way around. #1406 improves colour support in error messages, where this GPR in one configuration fixes a segfault...

@xavierleroy

This comment has been minimized.

Show comment
Hide comment
@xavierleroy

xavierleroy Oct 15, 2017

Contributor

My sense of priorities could very well be wrong here; if that's the case, I apologize. Still, for me, the Windows console is this awful 1985 design with the rectangular text selection that everyone reimplements better and nicer. (That's 5 different links.)

Contributor

xavierleroy commented Oct 15, 2017

My sense of priorities could very well be wrong here; if that's the case, I apologize. Still, for me, the Windows console is this awful 1985 design with the rectangular text selection that everyone reimplements better and nicer. (That's 5 different links.)

@dra27

This comment has been minimized.

Show comment
Hide comment
@dra27

dra27 Oct 15, 2017

Contributor

Quick survey of them:

ConEmu already supports all this anyway (including coloured output - you just have to trick OCaml by setting the TERM variable) because it hooks all the console functions, and it does it very well.

ColorConsole (you included twice) is a disaster, at least from my quick trial - that appears to be implemented using entirely naïve mechanisms, even the basic prompt doesn't work properly.

mintty is addressed in #1406, but for using the native ports, I really question whether that's an improvement over the Windows 10 Console, given the instabilities.

ConsoleZ, like ConEmu, does this "properly" (you should see how hooking the Console API works, you think I've written some Windows hacks before.......), but doesn't translate ANSI escape sequences, so coloured output doesn't work. However, Unix.isatty correctly returns true (which ColorConsole and mintty do not).

The proper alternative consoles on Windows do a lot of heavy work to act like the Windows Console host, so the priority should be handling that correctly in our code. I don't think we should generally worry about terminal emulators which are just acting like Unix terminals - except that mintty is a very common alternative one, so for me it's an OK exception.

However, all that said, my personal wish was for #1416 to be in 4.06.0, which is merged - I don't mind both this and #1406 being pushed to 4.07, and we can try and sort all the Windows Consoles at the same time. Fundamentally, I'm uneasy with the priority that something works if you install and use an optional piece of software, but it's broken if you use the operating system's official way of doing it - it feels the wrong way round.

Contributor

dra27 commented Oct 15, 2017

Quick survey of them:

ConEmu already supports all this anyway (including coloured output - you just have to trick OCaml by setting the TERM variable) because it hooks all the console functions, and it does it very well.

ColorConsole (you included twice) is a disaster, at least from my quick trial - that appears to be implemented using entirely naïve mechanisms, even the basic prompt doesn't work properly.

mintty is addressed in #1406, but for using the native ports, I really question whether that's an improvement over the Windows 10 Console, given the instabilities.

ConsoleZ, like ConEmu, does this "properly" (you should see how hooking the Console API works, you think I've written some Windows hacks before.......), but doesn't translate ANSI escape sequences, so coloured output doesn't work. However, Unix.isatty correctly returns true (which ColorConsole and mintty do not).

The proper alternative consoles on Windows do a lot of heavy work to act like the Windows Console host, so the priority should be handling that correctly in our code. I don't think we should generally worry about terminal emulators which are just acting like Unix terminals - except that mintty is a very common alternative one, so for me it's an OK exception.

However, all that said, my personal wish was for #1416 to be in 4.06.0, which is merged - I don't mind both this and #1406 being pushed to 4.07, and we can try and sort all the Windows Consoles at the same time. Fundamentally, I'm uneasy with the priority that something works if you install and use an optional piece of software, but it's broken if you use the operating system's official way of doing it - it feels the wrong way round.

@xavierleroy

This comment has been minimized.

Show comment
Hide comment
@xavierleroy

xavierleroy Oct 15, 2017

Contributor

The 5th link should have been http://www.powercmd.com/ . (I'm fixing my previous comment, for posterity.) I haven't used any of these alternatives except mintty under Cygwin, it's just what I found in 5 minutes of Googling.

Contributor

xavierleroy commented Oct 15, 2017

The 5th link should have been http://www.powercmd.com/ . (I'm fixing my previous comment, for posterity.) I haven't used any of these alternatives except mintty under Cygwin, it's just what I found in 5 minutes of Googling.

@nojb
@gasche

This comment has been minimized.

Show comment
Hide comment
@gasche

gasche Oct 15, 2017

Member

Well, @dra27 and @nojb, you have all the cards in hand to overthrow the priority ladder by just being very efficient at cross-reviewing both of these PRs. Let a thousand reviews bloom!

Member

gasche commented Oct 15, 2017

Well, @dra27 and @nojb, you have all the cards in hand to overthrow the priority ladder by just being very efficient at cross-reviewing both of these PRs. Let a thousand reviews bloom!

@xavierleroy

This comment has been minimized.

Show comment
Hide comment
@xavierleroy

xavierleroy Oct 15, 2017

Contributor

I think one data point for not neglecting the Windows Console is the large amount of attention it is receiving from Microsoft lately

Colors do receive attention. Wake me up when they implement proper, non-rectangular text selections. I remember joking about it with a Microsoft Research colleague back in 1997...

Contributor

xavierleroy commented Oct 15, 2017

I think one data point for not neglecting the Windows Console is the large amount of attention it is receiving from Microsoft lately

Colors do receive attention. Wake me up when they implement proper, non-rectangular text selections. I remember joking about it with a Microsoft Research colleague back in 1997...

@dra27

This comment has been minimized.

Show comment
Hide comment
@dra27

dra27 Oct 15, 2017

Contributor

Do you mean something like this, @xavierleroy?
image

Contributor

dra27 commented Oct 15, 2017

Do you mean something like this, @xavierleroy?
image

Show outdated Hide outdated byterun/win32.c
#if WINDOWS_UNICODE
if ((flags & CHANNEL_FLAG_CONSOLE) && console_supports_unicode()) {
#else
if ((flags & CHANNEL_FLAG_CONSOLE) && GetConsoleCP() == CP_UTF8

This comment has been minimized.

@nojb

nojb Oct 15, 2017

Contributor

In this case, WINDOWS_UNICODE=0, shouldn't we translate whatever we get from ReadConsole into to the current code page instead of doing it handling only the case of CP_UTF8?

@nojb

nojb Oct 15, 2017

Contributor

In this case, WINDOWS_UNICODE=0, shouldn't we translate whatever we get from ReadConsole into to the current code page instead of doing it handling only the case of CP_UTF8?

This comment has been minimized.

@dra27

dra27 Oct 26, 2017

Contributor

That's what will happen - if the console isn't set to CP_UTF8, then we defer to the C read function which should do exactly that?

@dra27

dra27 Oct 26, 2017

Contributor

That's what will happen - if the console isn't set to CP_UTF8, then we defer to the C read function which should do exactly that?

}
#undef STATIC_BUFFER_SIZE
int console_supports_unicode (void) {

This comment has been minimized.

@nojb

nojb Oct 15, 2017

Contributor

Sorry if you explained this before, but could you say a few words about why this check (namely, that a non-raster Console font is selected) is necessary, especially for input (cf check below before win32_utf8_read) ?

@nojb

nojb Oct 15, 2017

Contributor

Sorry if you explained this before, but could you say a few words about why this check (namely, that a non-raster Console font is selected) is necessary, especially for input (cf check below before win32_utf8_read) ?

This comment has been minimized.

@dra27

dra27 Oct 26, 2017

Contributor

If a raster font is selected, then setting CP_UTF8 doesn't work - the console behaves differently. It's vaguely mentioned in SetConsoleOutputCP (and various gotchas on StackOverflow, etc.). One of the batteries of C tests I ran on this verified the difference in behaviour, but I can't remember which GPR I commented on (or, slightly embarrassingly, which server I left the C file on!)

@dra27

dra27 Oct 26, 2017

Contributor

If a raster font is selected, then setting CP_UTF8 doesn't work - the console behaves differently. It's vaguely mentioned in SetConsoleOutputCP (and various gotchas on StackOverflow, etc.). One of the batteries of C tests I ran on this verified the difference in behaviour, but I can't remember which GPR I commented on (or, slightly embarrassingly, which server I left the C file on!)

This comment has been minimized.

@dra27

dra27 Oct 26, 2017

Contributor

(@nojb - I realise that's an unsatisfactory answer to your question - I'm looking for the test case and will add a proper comment to this code when I find it, but I'm virtually certain that putting the test is necessary!)

@dra27

dra27 Oct 26, 2017

Contributor

(@nojb - I realise that's an unsatisfactory answer to your question - I'm looking for the test case and will add a proper comment to this code when I find it, but I'm virtually certain that putting the test is necessary!)

if ((flags & CHANNEL_FLAG_CONSOLE) && caml_win32_major < 10 && n > 0
&& console_supports_unicode())
#else
if ((flags & CHANNEL_FLAG_CONSOLE) && GetConsoleOutputCP() == CP_UTF8

This comment has been minimized.

@nojb

nojb Oct 15, 2017

Contributor

Same question here as for the case of win32_utf8_read above: shouldn't we be using the currently selected code page instead of handling just the case of UTF_8?

@nojb

nojb Oct 15, 2017

Contributor

Same question here as for the case of win32_utf8_read above: shouldn't we be using the currently selected code page instead of handling just the case of UTF_8?

This comment has been minimized.

@dra27

dra27 Oct 26, 2017

Contributor

This shim is only necessary if the output code page is UTF8 - otherwise the C write function works correctly. The unique problem when CP_UTF8 is selected is that WriteConsole (via WriteFile) returns the wrong answer to the write function causing the correct Unicode character to be written followed by some of the UTF-8 sequence characters themselves.

@dra27

dra27 Oct 26, 2017

Contributor

This shim is only necessary if the output code page is UTF8 - otherwise the C write function works correctly. The unique problem when CP_UTF8 is selected is that WriteConsole (via WriteFile) returns the wrong answer to the write function causing the correct Unicode character to be written followed by some of the UTF-8 sequence characters themselves.

@alainfrisch

This comment has been minimized.

Show comment
Hide comment
@alainfrisch

alainfrisch Oct 16, 2017

Contributor

Wake me up when they implement proper, non-rectangular text selections.

Driiiiiiiiing. In my French-speaking Windows 10 Creators Update, this is enabled through "Propriétés / Options / Sélection de texte / Activer la sélection du retour automatique à la ligne".

Contributor

alainfrisch commented Oct 16, 2017

Wake me up when they implement proper, non-rectangular text selections.

Driiiiiiiiing. In my French-speaking Windows 10 Creators Update, this is enabled through "Propriétés / Options / Sélection de texte / Activer la sélection du retour automatique à la ligne".

@dra27

This comment has been minimized.

Show comment
Hide comment
@dra27

dra27 Oct 26, 2017

Contributor

Rebased, and memory mismanagement bug fixed.

Contributor

dra27 commented Oct 26, 2017

Rebased, and memory mismanagement bug fixed.

@gasche gasche modified the milestones: consider-for-4.06.0, 4.07-or-later Oct 27, 2017

dra27 added some commits Oct 6, 2017

Tweak Windows caml_write_fd spacetime version
Eliminates the duplicate call to read.
Display Unicode correctly on the Windows console
For WINDOWS_UNICODE=1, enabled UTF-8 output on the Windows Console. In
Windows 10 and later, this simply works. However, prior to Windows 10 we
hit MPR#6925 because there is an error in the implementation of
WriteConsole when CP_UTF8 is selected.

This is worked around for any code writing to a console via
caml_write_fd in that caml_write_fd now converts the string to UCS-2
and writes it directly to the console.

The workaround is enabled even when WINDOWS_UNICODE=0 if the user has
manually selected CP_UTF8 (typically by running chcp 65001).
Read Unicode correctly from the Windows console
Previouly, reading Unicode code points > U+00FF would simply result in
"?" symbols. Even worse, in the UTF-8 code page (chcp 65001), OCaml
would crash when reading any UTF-8 multi-byte sequences. This commit
alters caml_read_fd to use ReadConsole to get UCS-2 characters which are
then converted to UTF-8.

A side-effect of this is that calls to caml_read_fd will typically
return much less data than they did before, since ReadConsole must be
called with a character buffer 1/4 of the size of the supplied byte
buffer in order to allow for worst-case conversion to UTF-8.

As with the output changes, this change in behaviour is automatically
enabled when WINDOWS_UNICODE=1, but it is only enabled otherwise if the
user has manually selected CP_UTF8 (typically by running chcp 65001).

This change has the small side-effect that caml_read_fd on Windows must
be asked to read at least 4 bytes or it will return an error.
@xavierleroy

This comment has been minimized.

Show comment
Hide comment
@xavierleroy

xavierleroy Nov 18, 2017

Contributor

Glad to learn that Microsoft understood text selections, 30 years too late.

As mentioned at the latest developer meeting, I'm worried about UTF-8 sequences being "cut in the middle" when presented to caml_write_fd. This can happen if the user maliciously flushes an out_channel after every byte of a multi-byte UTF-8 encoding, but also if the buffer of an out_channel happens to be full and needs flushing at the wrong time. What happens in this case? Should the Win32-specific console output code maintain its own buffers for this purpose? Should caml_write_fd grow a mechanism to say "I couldn't flush those last N bytes, please keep them in the buffer" ?

Contributor

xavierleroy commented Nov 18, 2017

Glad to learn that Microsoft understood text selections, 30 years too late.

As mentioned at the latest developer meeting, I'm worried about UTF-8 sequences being "cut in the middle" when presented to caml_write_fd. This can happen if the user maliciously flushes an out_channel after every byte of a multi-byte UTF-8 encoding, but also if the buffer of an out_channel happens to be full and needs flushing at the wrong time. What happens in this case? Should the Win32-specific console output code maintain its own buffers for this purpose? Should caml_write_fd grow a mechanism to say "I couldn't flush those last N bytes, please keep them in the buffer" ?

@xavierleroy xavierleroy modified the milestones: 4.07-or-later, 4.07 Nov 18, 2017

@dra27

This comment has been minimized.

Show comment
Hide comment
@dra27

dra27 May 28, 2018

Contributor

Apologies for my slowness on a GPR marked high priority.

I've done some work on this addressing @xavierleroy's point. Originally, I thought this was a corner case but in fact on further reflection and comparison with Unix terminal emulators, it's important. I have an implementation which correctly holds on to up to 3 bytes of at the end of a write (while still claiming to the caller that they were written) which allows a multi-byte UTF-8 sequence written character-by-character to succeed.

Having done this, it's fairly clear that rejecting the whole string if any part was invalid UTF-8 is a mistake which I shall work on later. Also, this shim is necessary for all versions of Windows, so it brings into question whether it was worth changing the console to UTF-8 mode at all (eliminating this would get rid of a couple of weird bug reports I've seen elsewhere relating to setting UTF-8 mode on some far eastern editions of Windows).

TL;DR this is going to have a lot more code, so I'm not sure this is suitable for 4.07 at this stage.

Contributor

dra27 commented May 28, 2018

Apologies for my slowness on a GPR marked high priority.

I've done some work on this addressing @xavierleroy's point. Originally, I thought this was a corner case but in fact on further reflection and comparison with Unix terminal emulators, it's important. I have an implementation which correctly holds on to up to 3 bytes of at the end of a write (while still claiming to the caller that they were written) which allows a multi-byte UTF-8 sequence written character-by-character to succeed.

Having done this, it's fairly clear that rejecting the whole string if any part was invalid UTF-8 is a mistake which I shall work on later. Also, this shim is necessary for all versions of Windows, so it brings into question whether it was worth changing the console to UTF-8 mode at all (eliminating this would get rid of a couple of weird bug reports I've seen elsewhere relating to setting UTF-8 mode on some far eastern editions of Windows).

TL;DR this is going to have a lot more code, so I'm not sure this is suitable for 4.07 at this stage.

@damiendoligez

This comment has been minimized.

Show comment
Hide comment
@damiendoligez

damiendoligez May 31, 2018

Member

I don't think it's reasonable to try to push this into 4.07 at this point. Let's take the time to do things right.

Member

damiendoligez commented May 31, 2018

I don't think it's reasonable to try to push this into 4.07 at this point. Let's take the time to do things right.

@damiendoligez damiendoligez removed this from the 4.07 milestone May 31, 2018

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment