Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Windows Console Unicode Support #1408

Open
wants to merge 3 commits into
base: trunk
Choose a base branch
from
Open

Conversation

dra27
Copy link
Member

@dra27 dra27 commented Oct 6, 2017

The explanation of this is slightly picky, so please bear with me. This is a continuation of #1398, but it picks up a few other bugs on the way.

On normal OSes, you don't have to worry so much about your terminal - the caller tells you about the terminal, and you just send appropriate output to it. Not quite the case on Windows, especially if you want Unicode characters to display correctly.

At a minimum, I would like the change in byterun/sys.c to call SetConsoleOutputCP(CP_UTF8); to be included in 4.06.0 (if the change in byterun/win32.c is not included, then it is necessary to guard that call with a check that we're running on Windows 10). I think the entire change is safe, especially as it only affects input/output from/to keyboards/screens, and not redirections.

It seems to me that having gone to the (serious) trouble of adding proper handling for UTF-8 through the runtime on Windows, that it's a PR (in the "public relations") disaster to have the toplevel displaying raw UTF-8 sequences when it the Unix toplevel now gets to display proper strings.

The primary goal is to ensure that the Windows console is set to interpret UTF-8 text and display the correct Unicode characters. This can be achieved from the console by executing chcp 65001 but it has some important caveats:

  • Prior to Windows 10, the underlying WriteConsole API call contains a bug. This is tracked in MPR#6925. I have included a fix in caml_write_fd which detects that the fd is a console and on the appropriate version of Windows, manually writes the output.
  • Prior to Windows 10 1607, there is a bug when running the UTF-8 codepage which crashes applications which attempt to read Unicode characters entered by the user. This can be seen in the toplevel of OCaml by trying to put UTF-8 directly into a string - it simply segfaults if you're running chcp 65001. The fact that OCaml now supports Unicode filenames, and so forth, means this is something users are more likely to do (run in chcp 65001, I mean). I therefore include a fix in caml_read_fd which similarly upon detection that fd is a console, reads Windows wide-characters directly and then calls the Windows functions to translate those sequences to UTF-8 as the official result of read.

There is also some weirdness in that prior to Windows 10 1607, the "raster fonts" mode of the console (which was the default prior to Windows 10 1507/RTM) cannot ever display Unicode characters and can cause all kinds of crashes for any process trying to do so. For this reason, all the shims back off completely if the console has not been set to a truetype font. This means in the specific case of raster fonts selected and code page 65001 selected that ocamlrun will crash if presented with Unicode input from the console (but that's the same as anything - even cmd.exe displays an error prior to 1607 if you try to echo a string containing extended characters).

Obviously, this has the potential to alter both the way input and output are processed. I will stress here: this only happens when writing to consoles, it will never affect redirection to files. This is therefore a much lower-risk change than it may appear.

The behaviour, as with many of these things, depends on WINDOWS_UNICODE in config/Makefile.

If WINDOWS_UNICODE is 1, then:

  • The runtime will always select CP_UTF8 for output
  • For versions prior to Windows 10, it will use the output shim so that UTF-8 is processed correctly
  • For all versions, it uses the input shim to read UTF-8 correctly

If WINDOWS_UNICODE is 0, then:

  • The runtime does not alter the codepage for output
  • For versions prior to Windows 10, the output shim is used only if the user has already selected CP_UTF8 (e.g. by running chcp 65001). This fixes MPR#6925
  • If the user has selected CP_UTF8, the input shim is used. I don't know if the crash associated with this is recorded in Mantis, but it fixes that too.

There are some semantic changes to caml_read_fd and caml_write_fd when dealing with console handles, but I believe these are either within spec, or not relevant:

  1. caml_write_fd will not write more than 1KiB to the console at once (that's within spec)
  2. caml_read_fd will often read considerably less than requested (and never more than 4KiB), because it must allow for converting the characters read to UTF-8 sequences
  3. caml_read_fd will fail if asked to read less than 4 bytes from a console

The caml_write_fd change should have no particular effect, unless a program is very badly coded to assume the write always succeeds. The caml_read_fd change results in a subtle change to caml_ml_input_line to prevent it from every trying to request too few bytes - all the other uses of caml_read_fd are to fill channel buffers completely, so will work.

Copy link
Member

@gasche gasche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had a look on the surface with mostly code-readability comments, but of course we would need a review from someone that knows about windows. ( @nojb, @alainfrisch ? )

To be honest I find the amount of change and their invasiveness a bit scary. I'm not convinced that it would be a "PR disaster" to display characters incorrectly under Windows (the rationale for #1200 was to not break down on unicode-using file paths, not to print nice things in the console), and that seems way better than overlooking something and screwing up I/O. Don't count on me to push this into 4.06 :-)

byterun/io.c Outdated
if (GetFileType(h) == FILE_TYPE_CHAR && GetConsoleMode(h, &mode))
channel->flags = CHANNEL_FLAG_CONSOLE;
else
#endif
channel->flags = 0;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not fond of the way the windows-specific logic pollutes the function code (I think it breaks abstraction levels). Also, having an else at the end of an optional block is evil. Could you encapsulate this whole logic in a separate function?

(You could either have an ifdef _WIN32 here with channel->flags = ... in one case and channel->flags = 0; in the other, or have the function declared to always return 0 outside Windows, no strong preference here.)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Meh - OK, I agree that ending the block with else was perhaps a little too towards the dark side.

I'm not sure that this breaks abstraction levels - handed an fd, the job of the function is to initialise the channel, which is what it's doing... it's a slightly unfair advantage to say that Unix can maintain the abstraction at an fd level, given that both C and the C runtime were invented for Unix's benefit...!

In this particular instance, I'm not certain about using a function, because it makes it tempting to use it elsewhere and the logic could become less clear. For example, one might be tempted to call this something caml_sys_isatty, but then #1406 might inadvertently affect it. I guess I could make it channel->flags = initialise_channel_flags(fd);, but I'm not that convinced by single-use functions.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I apologize for being unclear; what I meant here and above is not that you are breaking abstraction boundaries, but that I think you are not following the "don't mix different levels of abstraction" principle (incidentally, it's the only idea in Clean Code-style books that I found both reasonable and not completely obvious). What this function was doing before was very simple and I could grok it in one pass: it's initializing each flag with a value that sort of looks right, adds it in a linked list (this could be a function call imho) and returns it. Now the code is mixed with non-trivial Win32-only logic, with two ifdef block that are to be understood together (as the variables bound in one are used in the other), I find the code much harder to read.

byterun/sys.c Outdated
#if WINDOWS_UNICODE
SetConsoleOutputCP(CP_UTF8);
#endif
#endif
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again, this looks like a break of abstraction levels to me. Could this be moved to its own function?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Initialising the system in a function called caml_sys_init seemed vaguely appropriate to me?! I freely admit that I selected on the basis that it is called from all of the various entry points for starting an OCaml application!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was thinking that the logic to get the win32 version could be moved into a function called something like caml_win32_init_version that could be called from caml_sys_init (the OutputCP thing is of course a separate concern). The idea is, from the point of view of someone looking at caml_sys_init and interested about a high-level overview of what is happening here, to not have to read through 19 lines of low-level Windows-specific code.

int caml_read_fd(int fd, int flags, void * buf, int n)
{
int retcode;
if ((flags & CHANNEL_FLAG_FROM_SOCKET) == 0) {
caml_enter_blocking_section();
retcode = read(fd, buf, n);
#if WINDOWS_UNICODE
if ((flags & CHANNEL_FLAG_CONSOLE) && console_supports_unicode()) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

console_supports_unicode makes what I suppose are three system calls, and you would call that each time we read from the console? Is there a performance cost to this repeated test? (It so, it could make sense to store the information in the channel flags at channel-creation time instead.)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is reading user input, so I think we could do largely anything, especially with the speed of the Windows console!

I debated caching, but the user is able to change these settings, so the cached value could realistically be out of date. I think it's possible to use SetWinEventHook to be told when console settings may have changed, but that seemed overkill for a function which shouldn't be speed critical.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, so the idea is that all calls to this function in the I/O system are guarded by the more efficient CHANNEL_FLAG_CONSOLE check. That sounds reasonable.

if ((flags & CHANNEL_FLAG_CONSOLE) && GetConsoleCP() == CP_UTF8
&& console_supports_unicode()) {
#endif
/* Cannot perform a UTF-8 read unless the buffer is at least 4 bytes */
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a risk that some applications would be written to read bytes one at a time (so with n == 1 always), and happen to be run on unicode-supporting systems? Would it be possible to fallback to just a read call in that case (I guess that means designed-for-unicode code could sometimes behave incorrectly?)?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Falling back to read would mean that you might get garbage (if you're lucky - often it's a segfault). I wasn't certain, but it seemed unlikely that you'd ever use this in C stubs - it's largely used internally in a context where it's refilling the channel's buffer.

I'm not sure whether the support of these two is worth extending to Unix.read and Unix.write (they will both exhibit the same problems if you use Unix.stdin and Unix.stdout/Unix.stderr)

@dra27
Copy link
Member Author

dra27 commented Oct 6, 2017

@gasche - thanks for the review, and noted regarding 4.06. How amenable would you be to the smaller version mentioned at the start (i.e. call SetConsoleOutputCP(CP_UTF8) on Windows 10 only).

I can split that off to a separate branch if you'd like to see what it looks like separately, but it's essentially the changes to the Makefiles (to link with version.dll) and all the changes in byterun/sys.c with the addition of if (caml_win32_major >= 10) before the SetConsoleOutputCP(CP_UTF8); call.

@gasche
Copy link
Member

gasche commented Oct 6, 2017

I think I was too grumpy in formulating my review -- it seems that late Friday is fine for bisection work, but maybe not for actual communication. I'm not trying to say that I would veto the change, but that I right now I wouldn't want to merge it.

Re. the simpler version: it does sound less invasive, but I don't realize how risky the .dll-related changes are. Clearly this needs the opinion of someone that knows about Windows builds.

@xavierleroy
Copy link
Contributor

Let me echo @gasche 's "abstraction boundary" concern in more brutal terms:

All code that needs #include <windows.h> should be in byterun/win32.c.

Having to #include <windows.h> in sys.c and io.c is a bad smell.

Now, let's work together to find the right "abstraction" functions to put into win32.c (and possibly in unix.c as well for symmetry) so that sys.c and io.c remain mostly Unix/Windows-agnostic.

@xavierleroy
Copy link
Contributor

Also: I wonder whether this material is specific to the Windows console (i.e. running an OCaml program or the OCaml toplevel directly from cmd.exe) or whether it is also relevant to using OCaml from a Cygwin or MSYS console, or under Emacs, etc.

@dra27
Copy link
Member Author

dra27 commented Oct 8, 2017

@gasche - hah, I didn't think you were being particularly grumpy, and I was grateful for the review before the weekend! I will, with my tongue moderately in my cheek, observe that abstraction seems to be co-opted to mean "not Windows", given the free use of unistd.h where needed, but I also appreciate that where I see the joy of Windows ports behaving similarly to their Unix siblings, others will see a horror of #ifdefs and hacks 😈 I have, apropos of both yours and @xavierleroy's comments, refactored the changes in sys.c and io.c to move code to win32.c and have a little bit of #ifdef-ery in osdeps.h.

There is a version of this branch only making the SetConsoleOutputCP change for Windows 10 at https://github.com/dra27/ocaml/tree/windows10-unicode (I haven't amended the commit messages on that branch, if it is decided to merge that separately).

@dra27
Copy link
Member Author

dra27 commented Oct 8, 2017

@xavierleroy - this code might be relevant in all those cases, but only when strictly using the Windows Console. So, when using whatever you Emacs people (!!) call gVim, this code would definitely not apply. Similarly, when using mintty (which is the bundled terminal emulator in Cygwin) for either MSYS2 or Cygwin, this wouldn't apply - standard handles then are pipes and the UTF-8 stuff "just works" because, well, mintty is a properly implemented terminal!

This code does apply if you run, say, Cygwin's bash in Cmd and then start native OCaml - it is possible that the code would be needed if you start a Cygwin-compiled OCaml from Cmd similarly. That said, trying to do any Cygwin/MSYS bash stuff through Cmd seems a moderately strange to do to me.

It's also possible that the Cygwin layer does this too - @nojb has noted, for example, that what I do hear is remarkably similar to how Git-for-Windows (which is a native Windows application containing a wrapped MSYS2 environment for running scripts) handles the UTF-8 console problems.

@@ -115,8 +115,8 @@ LDFLAGS=/ENTRY:wmainCRTStartup
### Libraries needed
#EXTRALIBS=bufferoverflowu.lib # for the old PSDK compiler only
EXTRALIBS=
BYTECCLIBS=advapi32.lib ws2_32.lib $(EXTRALIBS)
NATIVECCLIBS=advapi32.lib ws2_32.lib $(EXTRALIBS)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you removed the use of EXTRALIBS, maybe you could remove the definition as well? (Or if you think the definition, commented, has value, then I'd suggest keeping the use: if later it turns out it needs to be revived it will be easier.)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is used in byterun/Makefile (this was the inconsistency - msvc64 would link EXTRALIBS twice)

There is a real case for removing it entirely, as it was only added back in the dark ages when bufferoverflowu.lib had to be linked explicitly for amd64 (one old 64-bit Windows SDK needed this, as noted). At the moment, OCaml does still build with that particular SDK, so I don't really want to remove it entirely :)

@dra27
Copy link
Member Author

dra27 commented Oct 8, 2017

@gasche - I should have added the version.dll change is entirely safe... it's just how Windows syscalls "work".

@gasche
Copy link
Member

gasche commented Oct 8, 2017

I think it would make sense to have two PRs, one with the ConsoleOutputCP (including the version stuff etc.), and one with the I/O shims build on top of the first one. This would give interested people a natural place to discuss whether the first part should be merged in 4.06, without interference from the discussion about the second part (which can happen in parallel, although of course it would depend on the first one).

(This is the strategy I proposed for %S+ConsoleOutputCP, and I think we were better off discussing the two parts separately, as otherwise the %S part would still be waiting.)

@gasche
Copy link
Member

gasche commented Oct 8, 2017

(The reader may have noticed that I wrote the parenthesis above as I was conviced that the %S part, #1398, had already been merged, but in fact it's not. I'll leave it there for comic value, and go merge.)

@dra27
Copy link
Member Author

dra27 commented Oct 8, 2017

This GPR is now based on #1416, so only the last 3 commits "belong" here.

@damiendoligez damiendoligez added this to the discuss-for-4.06.0 milestone Oct 11, 2017
@dra27
Copy link
Member Author

dra27 commented Oct 12, 2017

Rebased (still suspended until #1416 is merged), but with the change to byterun/unix.c now in the correct GPR 😊

@dra27
Copy link
Member Author

dra27 commented Oct 12, 2017

Rebased now that #1416 is merged.

@gasche
Copy link
Member

gasche commented Oct 14, 2017

cc @xavierleroy, @nojb: if you have further opinions to give on this more advanced part of the "Windows console and Unicode" mystery novel, they are very welcome.

@xavierleroy
Copy link
Contributor

hoping I'm not too pushy, I'd like to see @nojb and @dra27 work on #1406 first, because #1406 looks to me like a lower hanging fruit than the present PR, and has greater benefits. (Nobody runs OCaml in a cmd.exe console; everyone uses a Cygwin or MSys console.) In particular I don't understand why #1406 has already been rescheduled for 4.07-or-later and the present PR is still marked consider-for-release.

@dra27
Copy link
Member Author

dra27 commented Oct 15, 2017

Erm, I use OCaml in a Cmd console and have done for the last 14 years - the Cygwin console is an unstable catastrophe for using native applications on a regular basis (it's trivially easy to end up with the terminal freezing - examples including trying to terminate long output of git commands, breaking in the toplevel, ...).

I'm happy to focus effort on #1406 first, but to me the priority is the wrong way around. #1406 improves colour support in error messages, where this GPR in one configuration fixes a segfault...

@xavierleroy
Copy link
Contributor

xavierleroy commented Oct 15, 2017

My sense of priorities could very well be wrong here; if that's the case, I apologize. Still, for me, the Windows console is this awful 1985 design with the rectangular text selection that everyone reimplements better and nicer. (That's 5 different links.)

@dra27
Copy link
Member Author

dra27 commented Oct 15, 2017

Quick survey of them:

ConEmu already supports all this anyway (including coloured output - you just have to trick OCaml by setting the TERM variable) because it hooks all the console functions, and it does it very well.

ColorConsole (you included twice) is a disaster, at least from my quick trial - that appears to be implemented using entirely naïve mechanisms, even the basic prompt doesn't work properly.

mintty is addressed in #1406, but for using the native ports, I really question whether that's an improvement over the Windows 10 Console, given the instabilities.

ConsoleZ, like ConEmu, does this "properly" (you should see how hooking the Console API works, you think I've written some Windows hacks before.......), but doesn't translate ANSI escape sequences, so coloured output doesn't work. However, Unix.isatty correctly returns true (which ColorConsole and mintty do not).

The proper alternative consoles on Windows do a lot of heavy work to act like the Windows Console host, so the priority should be handling that correctly in our code. I don't think we should generally worry about terminal emulators which are just acting like Unix terminals - except that mintty is a very common alternative one, so for me it's an OK exception.

However, all that said, my personal wish was for #1416 to be in 4.06.0, which is merged - I don't mind both this and #1406 being pushed to 4.07, and we can try and sort all the Windows Consoles at the same time. Fundamentally, I'm uneasy with the priority that something works if you install and use an optional piece of software, but it's broken if you use the operating system's official way of doing it - it feels the wrong way round.

@xavierleroy
Copy link
Contributor

The 5th link should have been http://www.powercmd.com/ . (I'm fixing my previous comment, for posterity.) I haven't used any of these alternatives except mintty under Cygwin, it's just what I found in 5 minutes of Googling.

Previouly, reading Unicode code points > U+00FF would simply result in
"?" symbols. Even worse, in the UTF-8 code page (chcp 65001), OCaml
would crash when reading any UTF-8 multi-byte sequences. This commit
alters caml_read_fd to use ReadConsole to get UCS-2 characters which are
then converted to UTF-8.

A side-effect of this is that calls to caml_read_fd will typically
return much less data than they did before, since ReadConsole must be
called with a character buffer 1/4 of the size of the supplied byte
buffer in order to allow for worst-case conversion to UTF-8.

As with the output changes, this change in behaviour is automatically
enabled when WINDOWS_UNICODE=1, but it is only enabled otherwise if the
user has manually selected CP_UTF8 (typically by running chcp 65001).

This change has the small side-effect that caml_read_fd on Windows must
be asked to read at least 4 bytes or it will return an error.
@xavierleroy
Copy link
Contributor

xavierleroy commented Nov 18, 2017

Glad to learn that Microsoft understood text selections, 30 years too late.

As mentioned at the latest developer meeting, I'm worried about UTF-8 sequences being "cut in the middle" when presented to caml_write_fd. This can happen if the user maliciously flushes an out_channel after every byte of a multi-byte UTF-8 encoding, but also if the buffer of an out_channel happens to be full and needs flushing at the wrong time. What happens in this case? Should the Win32-specific console output code maintain its own buffers for this purpose? Should caml_write_fd grow a mechanism to say "I couldn't flush those last N bytes, please keep them in the buffer" ?

@xavierleroy xavierleroy modified the milestones: 4.07-or-later, 4.07 Nov 18, 2017
@dra27
Copy link
Member Author

dra27 commented May 28, 2018

Apologies for my slowness on a GPR marked high priority.

I've done some work on this addressing @xavierleroy's point. Originally, I thought this was a corner case but in fact on further reflection and comparison with Unix terminal emulators, it's important. I have an implementation which correctly holds on to up to 3 bytes of at the end of a write (while still claiming to the caller that they were written) which allows a multi-byte UTF-8 sequence written character-by-character to succeed.

Having done this, it's fairly clear that rejecting the whole string if any part was invalid UTF-8 is a mistake which I shall work on later. Also, this shim is necessary for all versions of Windows, so it brings into question whether it was worth changing the console to UTF-8 mode at all (eliminating this would get rid of a couple of weird bug reports I've seen elsewhere relating to setting UTF-8 mode on some far eastern editions of Windows).

TL;DR this is going to have a lot more code, so I'm not sure this is suitable for 4.07 at this stage.

@damiendoligez
Copy link
Member

I don't think it's reasonable to try to push this into 4.07 at this point. Let's take the time to do things right.

@damiendoligez damiendoligez removed this from the 4.07 milestone May 31, 2018
@XVilka
Copy link
Contributor

XVilka commented Dec 14, 2018

Sorry for the intrusion, but what are the chances it will make in 4.08?

@gasche
Copy link
Member

gasche commented Dec 14, 2018

@XVilka we would need someone to complete a full review of this PR soon. If you were interested, I think it could happen.

@XVilka
Copy link
Contributor

XVilka commented Dec 14, 2018

@gasche sure, how I can help?

@dra27
Copy link
Member Author

dra27 commented Dec 14, 2018

@gasche, @XVilka - there's some work which I need to finalise and push (I'm afraid this has slipped down my spike yet again). I can try to do this in the next few days, if there's some available reviewing effort prior to the 4.08 freeze.

@gasche
Copy link
Member

gasche commented Dec 14, 2018

@XVilka basically, a review? The idea is to read the code carefully, asking about everything you don't understand from the information given (PR message and commit messages), and try to confirm that the behavior changes are all improvements, not bugs, and that there are no other issues introduced by the PR (Code inconsistency, etc.). If you get to that conclusion, then it's great, you have an "approving review", otherwise the effort spent and questions asked are still useful.

@XVilka
Copy link
Contributor

XVilka commented Dec 14, 2018

@gasche OK, will do once @dra27 gives a green light.

@XVilka
Copy link
Contributor

XVilka commented Jun 10, 2019

@dra27 since you are active in another PRs right now, maybe it is a good time to revive this one as well?

@damiendoligez
Copy link
Member

@dra27 is this still on track for 4.10?

@damiendoligez
Copy link
Member

ping @dra27

@dra27
Copy link
Member Author

dra27 commented Jul 22, 2020

Indeed - this has become more important as it turns out it's necessary for @stedolan's signal handling PR as well. I'll get there - presently fixing stat (again)

@stedolan stedolan mentioned this pull request Jul 28, 2020
1 task
@gasche
Copy link
Member

gasche commented Apr 18, 2021

I apologize for putting yet more weight on @dra27's shoulders, but I think this issue deserves a ping.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants