Output to windows console in Unicode mode crashes #2348

axbender · 2015-03-16T12:14:52Z

The following program crashes with the Windows console set to Unicode output (chcp 65001):

   let OUT_PAT = " ■ "

   echo(OUT_PAT)   # ok
   write(stdout, OUT_PAT)   # Crash

 ■
 ■ Traceback (most recent call last)
`test.nim(4)              test
system.nim(2260)         raiseEIO
Error: unhandled exception: cannot write string to file [IOError]
Error: execution of an external program failed

With code page 1252 the program correctly outputs:

 â–
 â–

The text was updated successfully, but these errors were encountered:

Araq · 2015-03-16T22:05:14Z

So? Is this really a Nim bug and not just a limitation of Windows?

axbender · 2015-03-17T07:36:22Z

To speak with Sheldon Cooper: "Is that sarcasm?" ;-)

I know about the "peculiarities" of the UTF-8 support in Windows' console. But, given the fact, that this (resulting from looking at Nim's generated C code)

#include <stdio.h>

char *OUT = "[■]";   // U+25A0, file stored in UTF-8 format

void main() {
   printf("%s\n", OUT);
   fwrite(OUT, 1, 5, stdout);
}

works, and Nim's echo() delivers the correct results, I'd think that there's sth. wrong in Nim.
Please also see this post.

Araq · 2015-03-17T09:14:24Z

Works for me, win32, windows 8.1, used chcp 65001 command to set the codepage.

axbender · 2015-03-17T10:22:51Z

Sorry, which program works for you, the first nim? I've got Windows 7, 64 bit and MinGW64 here, where it definitly shows the error described in the initial post.

Would like to test that in a VM, but have never cross-compiled with Nim before: How would I cross-compile the source for win32 (my attempt with nim c --cpu:i386 --os:windows test.nim fails)?
I guess the standard nim.cfg (as well as mine) is missing the settings for MinGW64 (supposedly sth like i386.windows.gcc = ...).

Araq · 2015-03-17T12:05:14Z

The Nim program works for me.

How would I cross-compile the source for win32 (my attempt with nim c --cpu:i386 --os:windows test.nim fails)

Something like

i386.windows.gcc.exe = "gcc.exe -m32"
i386.windows.gcc.linkerexe = "gcc.exe -m32"

axbender · 2015-03-17T12:40:21Z

Doing so gives me:

In file included from g:\src\nim\lib\console\unused\nimcache\printbox.c:8:0:
D:\tools\nim\lib/nimbase.h:393:13: error: size of array 'assert_numbits' is negative
 typedef int assert_numbits[sizeof(NI) == sizeof(void*) && NIM_INTBITS == sizeof(NI)*8 ? 1 : -1];
             ^
In file included from g:\src\nim\lib\console\unused\nimcache\stdlib_system.c:8:0:
D:\tools\nim\lib/nimbase.h:393:13: error: size of array 'assert_numbits' is negative
 typedef int assert_numbits[sizeof(NI) == sizeof(void*) && NIM_INTBITS == sizeof(NI)*8 ? 1 : -1];
             ^
Error:  execution of an external program failed; rerun with --parallelBuild:1 to see the error message

That's the same result that I got before changing nim.cfg.
Which exact compiler version do you use (here: gcc (x86_64-win32-sjlj-rev0, Built by MinGW-W64 project) 4.9.2)?

axbender · 2015-03-17T12:51:57Z

I finally got it to compile with another version of the compiler (x32-4.8.1-release-posix-dwarf-rev5.7z) using nim c --cpu:i386 (so this was - or better is, as Dwarf and 4.8.1 are old releases - a C compiler problem). Executing the exe both, in win32, and in win64 results in

 ■
 ■ Traceback (most recent call last)
printbox.nim(4)          printbox
system.nim(2260)         raiseEIO
Error: unhandled exception: cannot write string to file [IOError]

Araq · 2015-03-17T13:30:31Z

Well if fwrite lies with its return value there is little we can do except to disable this check for your particular machine. Note that you simply ignore the return value in your C program.

axbender · 2015-03-17T13:47:48Z

Hm, here fwrite returns 3 (that would be correct, speaking about UTF-codepoints, which aren't bytes, I know...), but aside of that, how does echo do it?

Btw. just tested this (using the same EXEs) in Windows 8.1: It completely ignored the chcp 65001 setting, just ouput in Windows 1252.
Windows 10: Both EXEs work as intended (no distortion, no traceback).

So, it's going to become better with Windows 10, yet it's inconsistent within Nim (echo vs write).

Again, as you said it worked for you in Windows 8.1, which compiler toolchain do you use? Is there a way to bootstrap with VC?

Araq · 2015-03-17T14:23:25Z

--cc:vcc. I tested it with gcc version 4.8.1 (rev5, Built by MinGW-W64 project)

Varriount · 2015-03-20T19:58:31Z

Just confirming, you're using the command prompt, right? Not Mingw's own console?

axbender · 2015-03-21T09:00:55Z

@Varriount: Yes, Windows's console.

Araq · 2015-03-22T08:38:32Z

So, it's going to become better with Windows 10, yet it's inconsistent within Nim (echo vs write).

Well echo doesn't check the return value, write does. I don't want to disable this additional check. So ... any ideas?

axbender · 2015-03-23T17:12:03Z

I can fully understand that you vote against less checking, but if the output is the console (and not a "real" file), I think one could go w/o these extra checks (esp. if counting fails for UTF-8 characters). Usually the output would be visible immediately, the user know that sth went wrong.

dom96 · 2015-10-17T14:43:09Z

There are some serious unicode issues with write.

import terminal
let OUT_PAT = "╣"

echo(OUT_PAT)   # ok
write(stdout, OUT_PAT) # just a box

Other characters like "┨" don't appear at all. It's even worse in Git bash, not sure what differences there are between cmd.exe and it.

johnnovak · 2016-11-27T01:02:16Z

Confirmed, write produces incorrect output with Unicode strings, while echo seems to work fine. I attempted to fix this but it turned out echo is defined as a special built-in in system.nim:

when defined(nimvarargstyped):
  proc echo*(x: varargs[typed, `$`]) {.magic: "Echo", tags: [WriteIOEffect],
    benign, sideEffect.}
  ...
else:
  proc echo*(x: varargs[expr, `$`]) {.magic: "Echo", tags: [WriteIOEffect],
    benign, sideEffect.}

Any ideas how to proceed from here? Getting Unicode to work correctly is a pretty standard thing in 2016. We want to avoid any association with PHP and the likes of it, don't we? :)

Araq · 2016-11-27T08:40:37Z

As far as I'm concerned we should use the Win API directly for our IO layers and get rid of the libc dependency.

To answer your question, look at ccgexprs.nim, search for "mEcho" or "genEcho".

dom96 · 2017-08-23T20:38:24Z

Switching to high priority as it seems there is demand for a fix here.

johnnovak · 2017-09-02T01:20:51Z

Regarding that HN comment, the original poster shared his workaround which raises a few interesting issues:

import encodings
var hello1 = convert("Hellø, wørld!", "850", "UTF-8")

# Doesn't work - seems to think current codepage is utf8.
var hello2 = convert("Hellø, wørld!", getCurrentEncoding(), "UTF-8")

# Outputs correct text:
echo hello1
# Outputs corrupted text:
echo hello2

The above code works on my machine as indicated by the comments because my Windows console defaults to codepage 850 (this can be checked with the chcp DOS command).

The reason why getCurrentEncoding() produces the wrong output is because Windows differentiates between the system code page (GetACP), console input code page (GetConsoleCP) and console output code page (GetConsoleOutputCP). In the current discussion, we're only concerned with the system and console output code pages.

The following code prints out the values of all these three code pages, plus the value returned by getCurrentEncoding():

import encodings
import windows

var inCP = GetConsoleCP()
echo "Console input codepage:  " & $inCP

var outCP = GetConsoleOutputCP()
echo "Console output codepage: " & $outCP

var sysCP = GetACP()
echo "System codepage:         " & $sysCP

echo "getCurrentEncoding()     " & getCurrentEncoding()

On my system this gives:

Console input codepage:  850
Console output codepage: 850
System codepage:         1252
getCurrentEncoding()     windows-1252

Now, the problem is that getCurrentEncoding calls getACP() internally, which is just wrong when used in conjunction with console output for the above reasons:

proc getCurrentEncoding*(): string =
  ## retrieves the current encoding. On Unix, always "UTF-8" is returned.
  when defined(windows):
    result = codePageToName(getACP())
  else:
    result = "UTF-8"

I would say getCurrentEncoding is an ill-defined function because it's unclear whether it queries the system default encoding or the encoding of the current console*. Renaming it to getDefaultEncoding would be probably a good idea if we wanted to keep it.

As for our problem, users expect to be able to print an UTF8 string to the console without problems nowadays. I'm not sure that converting to the console output codepage on windows behind the scenes is a good idea, because if you want to use the same echo and write commands for both file and terminal I/O you could run into complications. E.g. the conversion should happen when the output goes to the console, but it should not when the output is redirected to a file...

Maybe simply setting the console output codepage to UTF-8 would be the best way to keep things simple? I guess this would make the vast majority of users happy.

The only problem with this approach is that stdout.write crashes when setting the output codepage to UTF-8 for unknown reasons...

import windows

discard SetConsoleOutputCP(65001)

echo "Hellø, wørld!"   # works OK
stdout.write("Hellø, wørld!")  # fails with an exception

Also, see this SO for further explanation:
https://stackoverflow.com/questions/43189210/why-ansi-code-page-and-console-code-page-are-different

And it seems like the Go folks have encountered the exact same problem a while ago:
golang/go#16857

Araq · 2017-09-02T07:21:21Z

I would say getCurrentEncoding is an ill-defined function because it's unclear whether it queries the system default encoding or the encoding of the current console*. Renaming it to getDefaultEncoding would be probably a good idea if we wanted to keep it.

It's the system encoding. I fail to see the advantage in renaming it to getDefaultEncoding, we can just improve the docs. We also need getConsoleEncoding for Windows, on Posix we can map it to getCurrentEncoding.

As for our problem, users expect to be able to print an UTF8 string to the console without problems nowadays.

No, some programmers might expect that, actual users do not use terminal apps at all on Windows. ;-)

The same programmers which use "Git bash" (see above) that does not care about GetConsoleOutputCP() afaik.

I think we should patch start.bat so that is sets the code page to UTF-8.

dom96 · 2017-09-02T11:22:02Z

It's the system encoding. I fail to see the advantage in renaming it to getDefaultEncoding, we can just improve the docs. We also need getConsoleEncoding for Windows, on Posix we can map it to getCurrentEncoding.

👍

No, some programmers might expect that, actual users do not use terminal apps at all on Windows. ;-)

Sure, but we should still support it.

I think we should patch start.bat so that is sets the code page to UTF-8.

How many people actually use this file on Windows?

Araq · 2017-09-02T16:42:30Z

How many people actually use this file on Windows?

It's what the installer puts into the start menu fwiw.

johnnovak · 2017-09-02T22:37:24Z

Okay, I got maybe a little carried away with my proposal... I agree that the UTF8 situation is kinda crap on the Windows console and we won't be able to handle it in a 100% satisfactorily way (e.g. even if Nim handled UTF-8 perfectly, the default Lucida Console font only supports a very small subset of UTF8, and most users don't bother installing a better font...).

Setting the console output code page to 65001 is probably the best option, I agree. Doing it in start.bat is not a bad idea; that takes care of the dev environment and if someone creates a command line tool, they should make it clear in the documentation that Windows users should do a chcp 65001 before running the tool, or even provide a wrapper batch script (and maybe also suggest to use a better font).

As for @dom96 's question, I'm always using start.bat when I'm using Nim on Windows.

Alternatively, there could be a built-in mechanism in the runtime to set the code page to 65001 at startup and then restore it to the old value at exit. I'm not totally convinced this is a good idea, though. Something like this (tested it and works fine):

import windows

var oldCP = GetConsoleOutputCP()
discard SetConsoleOutputCP(65001)

var s = "iÄäÜüß ЯБГДЖЙ"
echo s # ok
stdout.write(s) # crashes

discard SetConsoleOutputCP(oldCP)

Oh, and the write crash should be fixed regardless of what we do.

Araq · 2017-09-04T06:48:38Z

Oh, and the write crash should be fixed regardless of what we do.

I think thats a mingw/libc bug, not easy to fix.

Araq · 2017-10-16T10:35:21Z

The original problem is still not reproducible. Closing.

dom96 added the Standard Library label Mar 20, 2015

dom96 added the OS/Architecture Specific label Apr 4, 2016

johnnovak mentioned this issue Nov 27, 2016

terminal crashes Unicode output #2982

Closed

dom96 added the Severe label Aug 23, 2017

Araq added a commit that referenced this issue Sep 2, 2017

set the codepage to UTF-8 for start.bat; refs #2348

e5ef3f5

Araq closed this as completed Oct 16, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Output to windows console in Unicode mode crashes #2348

Output to windows console in Unicode mode crashes #2348

axbender commented Mar 16, 2015

Araq commented Mar 16, 2015

axbender commented Mar 17, 2015

Araq commented Mar 17, 2015

axbender commented Mar 17, 2015

Araq commented Mar 17, 2015

axbender commented Mar 17, 2015

axbender commented Mar 17, 2015

Araq commented Mar 17, 2015

axbender commented Mar 17, 2015

Araq commented Mar 17, 2015

Varriount commented Mar 20, 2015

axbender commented Mar 21, 2015

Araq commented Mar 22, 2015

axbender commented Mar 23, 2015

dom96 commented Oct 17, 2015

johnnovak commented Nov 27, 2016

Araq commented Nov 27, 2016

dom96 commented Aug 23, 2017

johnnovak commented Sep 2, 2017

Araq commented Sep 2, 2017

dom96 commented Sep 2, 2017

Araq commented Sep 2, 2017

johnnovak commented Sep 2, 2017

Araq commented Sep 4, 2017

Araq commented Oct 16, 2017

Output to windows console in Unicode mode crashes #2348

Output to windows console in Unicode mode crashes #2348

Comments

axbender commented Mar 16, 2015

Araq commented Mar 16, 2015

axbender commented Mar 17, 2015

Araq commented Mar 17, 2015

axbender commented Mar 17, 2015

Araq commented Mar 17, 2015

axbender commented Mar 17, 2015

axbender commented Mar 17, 2015

Araq commented Mar 17, 2015

axbender commented Mar 17, 2015

Araq commented Mar 17, 2015

Varriount commented Mar 20, 2015

axbender commented Mar 21, 2015

Araq commented Mar 22, 2015

axbender commented Mar 23, 2015

dom96 commented Oct 17, 2015

johnnovak commented Nov 27, 2016

Araq commented Nov 27, 2016

dom96 commented Aug 23, 2017

johnnovak commented Sep 2, 2017

Araq commented Sep 2, 2017

dom96 commented Sep 2, 2017

Araq commented Sep 2, 2017

johnnovak commented Sep 2, 2017

Araq commented Sep 4, 2017

Araq commented Oct 16, 2017