Skip to content

Garbage console output on Windows with UTF-8 console in caml_partial_flush and caml_putblock #6925

@vicuna

Description

@vicuna

Original bug ID: 6925
Reporter: @dra27
Assigned to: @dra27
Status: assigned (set by @mshinwell on 2016-12-08T09:28:42Z)
Resolution: open
Priority: normal
Severity: minor
Version: 4.02.2
Target version: later
Category: runtime system and C interface
Related to: #6521
Monitored by: @nojb @ygrek

Bug description

Roll your eyes and prepare for another bug in the Microsoft C runtime!

The Windows API function WriteConsole function (see https://msdn.microsoft.com/en-us/library/windows/desktop/ms687401) uses the word "characters" confusingly in its description of nNumberOfCharsToWrite and lpNumberOfCharsWritten. In Windows API speak, "chars" typically means "bytes" for the ANSI version (WriteConsoleA) and UCS2-ish characters (i.e. byte-length / 2) for the Unicode-ish version (WriteConsoleW).

However, it appears (at least on Windows 7, Windows Server 2012 and the latest public build of Windows 10) that lpNumberOfCharsWritten takes into account encoding too. So for the call WriteConsoleA(GetStdHandle(STD_OUTPUT_HANDLE, "\xe2\x86\x95", 3, &dwWritten, NULL), dwWritten will be 1 if the Console is set to UTF-8 encoding (it will be 3 for the default cp850)

Contrary to popular opinion, the Windows Console has actually supported UTF-8 for 15 years, so this isn't anything new!

Where this comes back to the C runtime and thus to OCaml is that it means that the C runtime function write returns the wrong number when writing to a console (it clearly returns the effective dwWritten from WriteConsole)... which means that Printf.printf and related functions using output_string (and thus eventually caml_putblock) keep repeating characters from the string.

This issue I don't think should affect other kinds of I/O (e.g. file I/O) because OCaml doesn't expose the Windows extensions which allow you to enable UTF-8 and UTF-16 encoding on file handles. It's possible that a C stub which enabled them could cause the same effect, but I haven't investigated that.

Steps to reproduce

In order to see the issue, you must be using a UTF-8 enabled Command Prompt. This is achieved by starting cmd and running chcp 65001. You must also select a Unicode font from the Font tab of the Properties dialog for the Command Prompt - either Consolas or Lucida Console. If you leave the default "Raster Fonts" option, you won't see the problem.

From an OCaml top-level, simply execute Printf.printf "\xe2\x86\x95" and you will see three characters (↕) rather than just the one expected.

Curiously, C's printf function is not affected by the issue. If you compile the attached broken-write.c using i686-w64-mingw32-gcc -o broken-write.exe broken-write.c and run it in a Unicode-enabled console, then you'll see printf correctly output just ? and a C demonstration of what's going wrong in caml_putblock which outputs ???

Additional information

There is something else going on in the runtime which affects the character encoding, because if the program [Printf.printf "\xe2\x86\x95"] is instead compiled using ocamlc/ocamlopt then the output is the more expected ???, but I haven't managed to trace precisely what's going on in the runtime to cause the mistranslation to ↕. Some kind of code page translation is going in write but I can't see why or where it gets set. In opam (which is where I've hit this), I was seeing this change in behaviour a long way through execution - i.e. I was getting ??? at the start of the program and then suddenly (but consistently) Printf.printf started displaying ↕.

It's possible to detect the code-page using GetConsoleCP and GetConsoleOutputCP and at least realise that the problem may occur.

File attachments

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions