-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Windows Console Unicode Support #1408
base: trunk
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I had a look on the surface with mostly code-readability comments, but of course we would need a review from someone that knows about windows. ( @nojb, @alainfrisch ? )
To be honest I find the amount of change and their invasiveness a bit scary. I'm not convinced that it would be a "PR disaster" to display characters incorrectly under Windows (the rationale for #1200 was to not break down on unicode-using file paths, not to print nice things in the console), and that seems way better than overlooking something and screwing up I/O. Don't count on me to push this into 4.06 :-)
byterun/io.c
Outdated
if (GetFileType(h) == FILE_TYPE_CHAR && GetConsoleMode(h, &mode)) | ||
channel->flags = CHANNEL_FLAG_CONSOLE; | ||
else | ||
#endif | ||
channel->flags = 0; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not fond of the way the windows-specific logic pollutes the function code (I think it breaks abstraction levels). Also, having an else
at the end of an optional block is evil. Could you encapsulate this whole logic in a separate function?
(You could either have an ifdef _WIN32 here with channel->flags = ...
in one case and channel->flags = 0;
in the other, or have the function declared to always return 0 outside Windows, no strong preference here.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Meh - OK, I agree that ending the block with else
was perhaps a little too towards the dark side.
I'm not sure that this breaks abstraction levels - handed an fd, the job of the function is to initialise the channel, which is what it's doing... it's a slightly unfair advantage to say that Unix can maintain the abstraction at an fd level, given that both C and the C runtime were invented for Unix's benefit...!
In this particular instance, I'm not certain about using a function, because it makes it tempting to use it elsewhere and the logic could become less clear. For example, one might be tempted to call this something caml_sys_isatty, but then #1406 might inadvertently affect it. I guess I could make it channel->flags = initialise_channel_flags(fd);
, but I'm not that convinced by single-use functions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I apologize for being unclear; what I meant here and above is not that you are breaking abstraction boundaries, but that I think you are not following the "don't mix different levels of abstraction" principle (incidentally, it's the only idea in Clean Code-style books that I found both reasonable and not completely obvious). What this function was doing before was very simple and I could grok it in one pass: it's initializing each flag with a value that sort of looks right, adds it in a linked list (this could be a function call imho) and returns it. Now the code is mixed with non-trivial Win32-only logic, with two ifdef
block that are to be understood together (as the variables bound in one are used in the other), I find the code much harder to read.
byterun/sys.c
Outdated
#if WINDOWS_UNICODE | ||
SetConsoleOutputCP(CP_UTF8); | ||
#endif | ||
#endif |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Again, this looks like a break of abstraction levels to me. Could this be moved to its own function?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Initialising the system in a function called caml_sys_init
seemed vaguely appropriate to me?! I freely admit that I selected on the basis that it is called from all of the various entry points for starting an OCaml application!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was thinking that the logic to get the win32 version could be moved into a function called something like caml_win32_init_version
that could be called from caml_sys_init
(the OutputCP thing is of course a separate concern). The idea is, from the point of view of someone looking at caml_sys_init
and interested about a high-level overview of what is happening here, to not have to read through 19 lines of low-level Windows-specific code.
int caml_read_fd(int fd, int flags, void * buf, int n) | ||
{ | ||
int retcode; | ||
if ((flags & CHANNEL_FLAG_FROM_SOCKET) == 0) { | ||
caml_enter_blocking_section(); | ||
retcode = read(fd, buf, n); | ||
#if WINDOWS_UNICODE | ||
if ((flags & CHANNEL_FLAG_CONSOLE) && console_supports_unicode()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
console_supports_unicode
makes what I suppose are three system calls, and you would call that each time we read from the console? Is there a performance cost to this repeated test? (It so, it could make sense to store the information in the channel flags at channel-creation time instead.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is reading user input, so I think we could do largely anything, especially with the speed of the Windows console!
I debated caching, but the user is able to change these settings, so the cached value could realistically be out of date. I think it's possible to use SetWinEventHook to be told when console settings may have changed, but that seemed overkill for a function which shouldn't be speed critical.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, so the idea is that all calls to this function in the I/O system are guarded by the more efficient CHANNEL_FLAG_CONSOLE
check. That sounds reasonable.
if ((flags & CHANNEL_FLAG_CONSOLE) && GetConsoleCP() == CP_UTF8 | ||
&& console_supports_unicode()) { | ||
#endif | ||
/* Cannot perform a UTF-8 read unless the buffer is at least 4 bytes */ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a risk that some applications would be written to read bytes one at a time (so with n == 1
always), and happen to be run on unicode-supporting systems? Would it be possible to fallback to just a read
call in that case (I guess that means designed-for-unicode code could sometimes behave incorrectly?)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Falling back to read
would mean that you might get garbage (if you're lucky - often it's a segfault). I wasn't certain, but it seemed unlikely that you'd ever use this in C stubs - it's largely used internally in a context where it's refilling the channel's buffer.
I'm not sure whether the support of these two is worth extending to Unix.read and Unix.write (they will both exhibit the same problems if you use Unix.stdin and Unix.stdout/Unix.stderr)
@gasche - thanks for the review, and noted regarding 4.06. How amenable would you be to the smaller version mentioned at the start (i.e. call I can split that off to a separate branch if you'd like to see what it looks like separately, but it's essentially the changes to the Makefiles (to link with version.dll) and all the changes in byterun/sys.c with the addition of |
I think I was too grumpy in formulating my review -- it seems that late Friday is fine for bisection work, but maybe not for actual communication. I'm not trying to say that I would veto the change, but that I right now I wouldn't want to merge it. Re. the simpler version: it does sound less invasive, but I don't realize how risky the |
Let me echo @gasche 's "abstraction boundary" concern in more brutal terms: All code that needs Having to Now, let's work together to find the right "abstraction" functions to put into win32.c (and possibly in unix.c as well for symmetry) so that sys.c and io.c remain mostly Unix/Windows-agnostic. |
Also: I wonder whether this material is specific to the Windows console (i.e. running an OCaml program or the OCaml toplevel directly from cmd.exe) or whether it is also relevant to using OCaml from a Cygwin or MSYS console, or under Emacs, etc. |
@gasche - hah, I didn't think you were being particularly grumpy, and I was grateful for the review before the weekend! I will, with my tongue moderately in my cheek, observe that abstraction seems to be co-opted to mean "not Windows", given the free use of unistd.h where needed, but I also appreciate that where I see the joy of Windows ports behaving similarly to their Unix siblings, others will see a horror of There is a version of this branch only making the |
@xavierleroy - this code might be relevant in all those cases, but only when strictly using the Windows Console. So, when using whatever you Emacs people (!!) call gVim, this code would definitely not apply. Similarly, when using mintty (which is the bundled terminal emulator in Cygwin) for either MSYS2 or Cygwin, this wouldn't apply - standard handles then are pipes and the UTF-8 stuff "just works" because, well, mintty is a properly implemented terminal! This code does apply if you run, say, Cygwin's bash in Cmd and then start native OCaml - it is possible that the code would be needed if you start a Cygwin-compiled OCaml from Cmd similarly. That said, trying to do any Cygwin/MSYS bash stuff through Cmd seems a moderately strange to do to me. It's also possible that the Cygwin layer does this too - @nojb has noted, for example, that what I do hear is remarkably similar to how Git-for-Windows (which is a native Windows application containing a wrapped MSYS2 environment for running scripts) handles the UTF-8 console problems. |
config/Makefile.msvc64
Outdated
@@ -115,8 +115,8 @@ LDFLAGS=/ENTRY:wmainCRTStartup | |||
### Libraries needed | |||
#EXTRALIBS=bufferoverflowu.lib # for the old PSDK compiler only | |||
EXTRALIBS= | |||
BYTECCLIBS=advapi32.lib ws2_32.lib $(EXTRALIBS) | |||
NATIVECCLIBS=advapi32.lib ws2_32.lib $(EXTRALIBS) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you removed the use of EXTRALIBS, maybe you could remove the definition as well? (Or if you think the definition, commented, has value, then I'd suggest keeping the use: if later it turns out it needs to be revived it will be easier.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is used in byterun/Makefile (this was the inconsistency - msvc64 would link EXTRALIBS twice)
There is a real case for removing it entirely, as it was only added back in the dark ages when bufferoverflowu.lib had to be linked explicitly for amd64 (one old 64-bit Windows SDK needed this, as noted). At the moment, OCaml does still build with that particular SDK, so I don't really want to remove it entirely :)
@gasche - I should have added the version.dll change is entirely safe... it's just how Windows syscalls "work". |
I think it would make sense to have two PRs, one with the ConsoleOutputCP (including the version stuff etc.), and one with the I/O shims build on top of the first one. This would give interested people a natural place to discuss whether the first part should be merged in 4.06, without interference from the discussion about the second part (which can happen in parallel, although of course it would depend on the first one). (This is the strategy I proposed for %S+ConsoleOutputCP, and I think we were better off discussing the two parts separately, as otherwise the %S part would still be waiting.) |
(The reader may have noticed that I wrote the parenthesis above as I was conviced that the %S part, #1398, had already been merged, but in fact it's not. I'll leave it there for comic value, and go merge.) |
This GPR is now based on #1416, so only the last 3 commits "belong" here. |
f2bd882
to
0d26032
Compare
Rebased (still suspended until #1416 is merged), but with the change to byterun/unix.c now in the correct GPR 😊 |
0d26032
to
b808c75
Compare
Rebased now that #1416 is merged. |
b808c75
to
4c78162
Compare
cc @xavierleroy, @nojb: if you have further opinions to give on this more advanced part of the "Windows console and Unicode" mystery novel, they are very welcome. |
hoping I'm not too pushy, I'd like to see @nojb and @dra27 work on #1406 first, because #1406 looks to me like a lower hanging fruit than the present PR, and has greater benefits. (Nobody runs OCaml in a cmd.exe console; everyone uses a Cygwin or MSys console.) In particular I don't understand why #1406 has already been rescheduled for 4.07-or-later and the present PR is still marked consider-for-release. |
Erm, I use OCaml in a Cmd console and have done for the last 14 years - the Cygwin console is an unstable catastrophe for using native applications on a regular basis (it's trivially easy to end up with the terminal freezing - examples including trying to terminate long output of git commands, breaking in the toplevel, ...). I'm happy to focus effort on #1406 first, but to me the priority is the wrong way around. #1406 improves colour support in error messages, where this GPR in one configuration fixes a segfault... |
My sense of priorities could very well be wrong here; if that's the case, I apologize. Still, for me, the Windows console is this awful 1985 design with the rectangular text selection that everyone reimplements better and nicer. (That's 5 different links.) |
Quick survey of them: ConEmu already supports all this anyway (including coloured output - you just have to trick OCaml by setting the TERM variable) because it hooks all the console functions, and it does it very well. ColorConsole (you included twice) is a disaster, at least from my quick trial - that appears to be implemented using entirely naïve mechanisms, even the basic prompt doesn't work properly. mintty is addressed in #1406, but for using the native ports, I really question whether that's an improvement over the Windows 10 Console, given the instabilities. ConsoleZ, like ConEmu, does this "properly" (you should see how hooking the Console API works, you think I've written some Windows hacks before.......), but doesn't translate ANSI escape sequences, so coloured output doesn't work. However, Unix.isatty correctly returns true (which ColorConsole and mintty do not). The proper alternative consoles on Windows do a lot of heavy work to act like the Windows Console host, so the priority should be handling that correctly in our code. I don't think we should generally worry about terminal emulators which are just acting like Unix terminals - except that mintty is a very common alternative one, so for me it's an OK exception. However, all that said, my personal wish was for #1416 to be in 4.06.0, which is merged - I don't mind both this and #1406 being pushed to 4.07, and we can try and sort all the Windows Consoles at the same time. Fundamentally, I'm uneasy with the priority that something works if you install and use an optional piece of software, but it's broken if you use the operating system's official way of doing it - it feels the wrong way round. |
The 5th link should have been http://www.powercmd.com/ . (I'm fixing my previous comment, for posterity.) I haven't used any of these alternatives except mintty under Cygwin, it's just what I found in 5 minutes of Googling. |
Previouly, reading Unicode code points > U+00FF would simply result in "?" symbols. Even worse, in the UTF-8 code page (chcp 65001), OCaml would crash when reading any UTF-8 multi-byte sequences. This commit alters caml_read_fd to use ReadConsole to get UCS-2 characters which are then converted to UTF-8. A side-effect of this is that calls to caml_read_fd will typically return much less data than they did before, since ReadConsole must be called with a character buffer 1/4 of the size of the supplied byte buffer in order to allow for worst-case conversion to UTF-8. As with the output changes, this change in behaviour is automatically enabled when WINDOWS_UNICODE=1, but it is only enabled otherwise if the user has manually selected CP_UTF8 (typically by running chcp 65001). This change has the small side-effect that caml_read_fd on Windows must be asked to read at least 4 bytes or it will return an error.
4760114
to
24a5a8e
Compare
Glad to learn that Microsoft understood text selections, 30 years too late. As mentioned at the latest developer meeting, I'm worried about UTF-8 sequences being "cut in the middle" when presented to |
Apologies for my slowness on a GPR marked high priority. I've done some work on this addressing @xavierleroy's point. Originally, I thought this was a corner case but in fact on further reflection and comparison with Unix terminal emulators, it's important. I have an implementation which correctly holds on to up to 3 bytes of at the end of a write (while still claiming to the caller that they were written) which allows a multi-byte UTF-8 sequence written character-by-character to succeed. Having done this, it's fairly clear that rejecting the whole string if any part was invalid UTF-8 is a mistake which I shall work on later. Also, this shim is necessary for all versions of Windows, so it brings into question whether it was worth changing the console to UTF-8 mode at all (eliminating this would get rid of a couple of weird bug reports I've seen elsewhere relating to setting UTF-8 mode on some far eastern editions of Windows). TL;DR this is going to have a lot more code, so I'm not sure this is suitable for 4.07 at this stage. |
I don't think it's reasonable to try to push this into 4.07 at this point. Let's take the time to do things right. |
Sorry for the intrusion, but what are the chances it will make in 4.08? |
@XVilka we would need someone to complete a full review of this PR soon. If you were interested, I think it could happen. |
@gasche sure, how I can help? |
@XVilka basically, a review? The idea is to read the code carefully, asking about everything you don't understand from the information given (PR message and commit messages), and try to confirm that the behavior changes are all improvements, not bugs, and that there are no other issues introduced by the PR (Code inconsistency, etc.). If you get to that conclusion, then it's great, you have an "approving review", otherwise the effort spent and questions asked are still useful. |
@dra27 since you are active in another PRs right now, maybe it is a good time to revive this one as well? |
@dra27 is this still on track for 4.10? |
ping @dra27 |
Indeed - this has become more important as it turns out it's necessary for @stedolan's signal handling PR as well. I'll get there - presently fixing stat (again) |
I apologize for putting yet more weight on @dra27's shoulders, but I think this issue deserves a ping. |
The explanation of this is slightly picky, so please bear with me. This is a continuation of #1398, but it picks up a few other bugs on the way.
On normal OSes, you don't have to worry so much about your terminal - the caller tells you about the terminal, and you just send appropriate output to it. Not quite the case on Windows, especially if you want Unicode characters to display correctly.
At a minimum, I would like the change in byterun/sys.c to call
SetConsoleOutputCP(CP_UTF8);
to be included in 4.06.0 (if the change in byterun/win32.c is not included, then it is necessary to guard that call with a check that we're running on Windows 10). I think the entire change is safe, especially as it only affects input/output from/to keyboards/screens, and not redirections.It seems to me that having gone to the (serious) trouble of adding proper handling for UTF-8 through the runtime on Windows, that it's a PR (in the "public relations") disaster to have the toplevel displaying raw UTF-8 sequences when it the Unix toplevel now gets to display proper strings.
The primary goal is to ensure that the Windows console is set to interpret UTF-8 text and display the correct Unicode characters. This can be achieved from the console by executing
chcp 65001
but it has some important caveats:caml_write_fd
which detects that the fd is a console and on the appropriate version of Windows, manually writes the output.chcp 65001
. The fact that OCaml now supports Unicode filenames, and so forth, means this is something users are more likely to do (run inchcp 65001
, I mean). I therefore include a fix incaml_read_fd
which similarly upon detection that fd is a console, reads Windows wide-characters directly and then calls the Windows functions to translate those sequences to UTF-8 as the official result of read.There is also some weirdness in that prior to Windows 10 1607, the "raster fonts" mode of the console (which was the default prior to Windows 10 1507/RTM) cannot ever display Unicode characters and can cause all kinds of crashes for any process trying to do so. For this reason, all the shims back off completely if the console has not been set to a truetype font. This means in the specific case of raster fonts selected and code page 65001 selected that ocamlrun will crash if presented with Unicode input from the console (but that's the same as anything - even cmd.exe displays an error prior to 1607 if you try to echo a string containing extended characters).
Obviously, this has the potential to alter both the way input and output are processed. I will stress here: this only happens when writing to consoles, it will never affect redirection to files. This is therefore a much lower-risk change than it may appear.
The behaviour, as with many of these things, depends on
WINDOWS_UNICODE
inconfig/Makefile
.If
WINDOWS_UNICODE
is 1, then:If
WINDOWS_UNICODE
is 0, then:chcp 65001
). This fixes MPR#6925There are some semantic changes to
caml_read_fd
andcaml_write_fd
when dealing with console handles, but I believe these are either within spec, or not relevant:The caml_write_fd change should have no particular effect, unless a program is very badly coded to assume the write always succeeds. The caml_read_fd change results in a subtle change to caml_ml_input_line to prevent it from every trying to request too few bytes - all the other uses of caml_read_fd are to fill channel buffers completely, so will work.