Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Output to windows console in Unicode mode crashes #2348

Closed
axbender opened this issue Mar 16, 2015 · 25 comments
Closed

Output to windows console in Unicode mode crashes #2348

axbender opened this issue Mar 16, 2015 · 25 comments

Comments

@axbender
Copy link

The following program crashes with the Windows console set to Unicode output (chcp 65001):

   let OUT_PAT = ""

   echo(OUT_PAT)   # ok
   write(stdout, OUT_PAT)   # Crash
 ■
 ■ Traceback (most recent call last)
`test.nim(4)              test
system.nim(2260)         raiseEIO
Error: unhandled exception: cannot write string to file [IOError]
Error: execution of an external program failed

With code page 1252 the program correctly outputs:

 â–
 â–
@Araq
Copy link
Member

Araq commented Mar 16, 2015

So? Is this really a Nim bug and not just a limitation of Windows?

@axbender
Copy link
Author

To speak with Sheldon Cooper: "Is that sarcasm?" ;-)

I know about the "peculiarities" of the UTF-8 support in Windows' console. But, given the fact, that this (resulting from looking at Nim's generated C code)

#include <stdio.h>

char *OUT = "[■]";   // U+25A0, file stored in UTF-8 format

void main() {
   printf("%s\n", OUT);
   fwrite(OUT, 1, 5, stdout);
}

works, and Nim's echo() delivers the correct results, I'd think that there's sth. wrong in Nim.
Please also see this post.

@Araq
Copy link
Member

Araq commented Mar 17, 2015

Works for me, win32, windows 8.1, used chcp 65001 command to set the codepage.

@axbender
Copy link
Author

Sorry, which program works for you, the first nim? I've got Windows 7, 64 bit and MinGW64 here, where it definitly shows the error described in the initial post.

Would like to test that in a VM, but have never cross-compiled with Nim before: How would I cross-compile the source for win32 (my attempt with nim c --cpu:i386 --os:windows test.nim fails)?
I guess the standard nim.cfg (as well as mine) is missing the settings for MinGW64 (supposedly sth like i386.windows.gcc = ...).

@Araq
Copy link
Member

Araq commented Mar 17, 2015

The Nim program works for me.

How would I cross-compile the source for win32 (my attempt with nim c --cpu:i386 --os:windows test.nim fails)

Something like

i386.windows.gcc.exe = "gcc.exe -m32"
i386.windows.gcc.linkerexe = "gcc.exe -m32"

@axbender
Copy link
Author

Doing so gives me:

In file included from g:\src\nim\lib\console\unused\nimcache\printbox.c:8:0:
D:\tools\nim\lib/nimbase.h:393:13: error: size of array 'assert_numbits' is negative
 typedef int assert_numbits[sizeof(NI) == sizeof(void*) && NIM_INTBITS == sizeof(NI)*8 ? 1 : -1];
             ^
In file included from g:\src\nim\lib\console\unused\nimcache\stdlib_system.c:8:0:
D:\tools\nim\lib/nimbase.h:393:13: error: size of array 'assert_numbits' is negative
 typedef int assert_numbits[sizeof(NI) == sizeof(void*) && NIM_INTBITS == sizeof(NI)*8 ? 1 : -1];
             ^
Error:  execution of an external program failed; rerun with --parallelBuild:1 to see the error message

That's the same result that I got before changing nim.cfg.
Which exact compiler version do you use (here: gcc (x86_64-win32-sjlj-rev0, Built by MinGW-W64 project) 4.9.2)?

@axbender
Copy link
Author

I finally got it to compile with another version of the compiler (x32-4.8.1-release-posix-dwarf-rev5.7z) using nim c --cpu:i386 (so this was - or better is, as Dwarf and 4.8.1 are old releases - a C compiler problem). Executing the exe both, in win32, and in win64 results in

 ■
 ■ Traceback (most recent call last)
printbox.nim(4)          printbox
system.nim(2260)         raiseEIO
Error: unhandled exception: cannot write string to file [IOError]

@Araq
Copy link
Member

Araq commented Mar 17, 2015

Well if fwrite lies with its return value there is little we can do except to disable this check for your particular machine. Note that you simply ignore the return value in your C program.

@axbender
Copy link
Author

Hm, here fwrite returns 3 (that would be correct, speaking about UTF-codepoints, which aren't bytes, I know...), but aside of that, how does echo do it?

Btw. just tested this (using the same EXEs) in Windows 8.1: It completely ignored the chcp 65001 setting, just ouput in Windows 1252.
Windows 10: Both EXEs work as intended (no distortion, no traceback).

So, it's going to become better with Windows 10, yet it's inconsistent within Nim (echo vs write).

Again, as you said it worked for you in Windows 8.1, which compiler toolchain do you use? Is there a way to bootstrap with VC?

@Araq
Copy link
Member

Araq commented Mar 17, 2015

--cc:vcc. I tested it with gcc version 4.8.1 (rev5, Built by MinGW-W64 project)

@Varriount
Copy link
Contributor

Just confirming, you're using the command prompt, right? Not Mingw's own console?

@axbender
Copy link
Author

@Varriount: Yes, Windows's console.

@Araq
Copy link
Member

Araq commented Mar 22, 2015

So, it's going to become better with Windows 10, yet it's inconsistent within Nim (echo vs write).

Well echo doesn't check the return value, write does. I don't want to disable this additional check. So ... any ideas?

@axbender
Copy link
Author

I can fully understand that you vote against less checking, but if the output is the console (and not a "real" file), I think one could go w/o these extra checks (esp. if counting fails for UTF-8 characters). Usually the output would be visible immediately, the user know that sth went wrong.

@dom96
Copy link
Contributor

dom96 commented Oct 17, 2015

There are some serious unicode issues with write.

import terminal
let OUT_PAT = ""

echo(OUT_PAT)   # ok
write(stdout, OUT_PAT) # just a box

Other characters like "┨" don't appear at all. It's even worse in Git bash, not sure what differences there are between cmd.exe and it.

@johnnovak
Copy link
Contributor

Confirmed, write produces incorrect output with Unicode strings, while echo seems to work fine. I attempted to fix this but it turned out echo is defined as a special built-in in system.nim:

when defined(nimvarargstyped):
  proc echo*(x: varargs[typed, `$`]) {.magic: "Echo", tags: [WriteIOEffect],
    benign, sideEffect.}
  ...
else:
  proc echo*(x: varargs[expr, `$`]) {.magic: "Echo", tags: [WriteIOEffect],
    benign, sideEffect.}

Any ideas how to proceed from here? Getting Unicode to work correctly is a pretty standard thing in 2016. We want to avoid any association with PHP and the likes of it, don't we? :)

@Araq
Copy link
Member

Araq commented Nov 27, 2016

As far as I'm concerned we should use the Win API directly for our IO layers and get rid of the libc dependency.

To answer your question, look at ccgexprs.nim, search for "mEcho" or "genEcho".

@dom96 dom96 added the Severe label Aug 23, 2017
@dom96
Copy link
Contributor

dom96 commented Aug 23, 2017

Switching to high priority as it seems there is demand for a fix here.

@johnnovak
Copy link
Contributor

Regarding that HN comment, the original poster shared his workaround which raises a few interesting issues:

import encodings
var hello1 = convert("Hellø, wørld!", "850", "UTF-8")

# Doesn't work - seems to think current codepage is utf8.
var hello2 = convert("Hellø, wørld!", getCurrentEncoding(), "UTF-8")

# Outputs correct text:
echo hello1
# Outputs corrupted text:
echo hello2

The above code works on my machine as indicated by the comments because my Windows console defaults to codepage 850 (this can be checked with the chcp DOS command).

The reason why getCurrentEncoding() produces the wrong output is because Windows differentiates between the system code page (GetACP), console input code page (GetConsoleCP) and console output code page (GetConsoleOutputCP). In the current discussion, we're only concerned with the system and console output code pages.

The following code prints out the values of all these three code pages, plus the value returned by getCurrentEncoding():

import encodings
import windows

var inCP = GetConsoleCP()
echo "Console input codepage:  " & $inCP

var outCP = GetConsoleOutputCP()
echo "Console output codepage: " & $outCP

var sysCP = GetACP()
echo "System codepage:         " & $sysCP

echo "getCurrentEncoding()     " & getCurrentEncoding()

On my system this gives:

Console input codepage:  850
Console output codepage: 850
System codepage:         1252
getCurrentEncoding()     windows-1252

Now, the problem is that getCurrentEncoding calls getACP() internally, which is just wrong when used in conjunction with console output for the above reasons:

proc getCurrentEncoding*(): string =
  ## retrieves the current encoding. On Unix, always "UTF-8" is returned.
  when defined(windows):
    result = codePageToName(getACP())
  else:
    result = "UTF-8"

I would say getCurrentEncoding is an ill-defined function because it's unclear whether it queries the system default encoding or the encoding of the current console*. Renaming it to getDefaultEncoding would be probably a good idea if we wanted to keep it.

As for our problem, users expect to be able to print an UTF8 string to the console without problems nowadays. I'm not sure that converting to the console output codepage on windows behind the scenes is a good idea, because if you want to use the same echo and write commands for both file and terminal I/O you could run into complications. E.g. the conversion should happen when the output goes to the console, but it should not when the output is redirected to a file...

Maybe simply setting the console output codepage to UTF-8 would be the best way to keep things simple? I guess this would make the vast majority of users happy.

The only problem with this approach is that stdout.write crashes when setting the output codepage to UTF-8 for unknown reasons...

import windows

discard SetConsoleOutputCP(65001)

echo "Hellø, wørld!"   # works OK
stdout.write("Hellø, wørld!")  # fails with an exception

Also, see this SO for further explanation:
https://stackoverflow.com/questions/43189210/why-ansi-code-page-and-console-code-page-are-different

And it seems like the Go folks have encountered the exact same problem a while ago:
golang/go#16857

@Araq
Copy link
Member

Araq commented Sep 2, 2017

I would say getCurrentEncoding is an ill-defined function because it's unclear whether it queries the system default encoding or the encoding of the current console*. Renaming it to getDefaultEncoding would be probably a good idea if we wanted to keep it.

It's the system encoding. I fail to see the advantage in renaming it to getDefaultEncoding, we can just improve the docs. We also need getConsoleEncoding for Windows, on Posix we can map it to getCurrentEncoding.

As for our problem, users expect to be able to print an UTF8 string to the console without problems nowadays.

No, some programmers might expect that, actual users do not use terminal apps at all on Windows. ;-)

The same programmers which use "Git bash" (see above) that does not care about GetConsoleOutputCP() afaik.

I think we should patch start.bat so that is sets the code page to UTF-8.

@dom96
Copy link
Contributor

dom96 commented Sep 2, 2017

It's the system encoding. I fail to see the advantage in renaming it to getDefaultEncoding, we can just improve the docs. We also need getConsoleEncoding for Windows, on Posix we can map it to getCurrentEncoding.

👍

No, some programmers might expect that, actual users do not use terminal apps at all on Windows. ;-)

Sure, but we should still support it.

I think we should patch start.bat so that is sets the code page to UTF-8.

How many people actually use this file on Windows?

@Araq
Copy link
Member

Araq commented Sep 2, 2017

How many people actually use this file on Windows?

It's what the installer puts into the start menu fwiw.

@johnnovak
Copy link
Contributor

Okay, I got maybe a little carried away with my proposal... I agree that the UTF8 situation is kinda crap on the Windows console and we won't be able to handle it in a 100% satisfactorily way (e.g. even if Nim handled UTF-8 perfectly, the default Lucida Console font only supports a very small subset of UTF8, and most users don't bother installing a better font...).

Setting the console output code page to 65001 is probably the best option, I agree. Doing it in start.bat is not a bad idea; that takes care of the dev environment and if someone creates a command line tool, they should make it clear in the documentation that Windows users should do a chcp 65001 before running the tool, or even provide a wrapper batch script (and maybe also suggest to use a better font).

As for @dom96 's question, I'm always using start.bat when I'm using Nim on Windows.

Alternatively, there could be a built-in mechanism in the runtime to set the code page to 65001 at startup and then restore it to the old value at exit. I'm not totally convinced this is a good idea, though. Something like this (tested it and works fine):

import windows

var oldCP = GetConsoleOutputCP()
discard SetConsoleOutputCP(65001)

var s = "iÄäÜüß ЯБГДЖЙ"
echo s # ok
stdout.write(s) # crashes

discard SetConsoleOutputCP(oldCP)

Oh, and the write crash should be fixed regardless of what we do.

@Araq
Copy link
Member

Araq commented Sep 4, 2017

Oh, and the write crash should be fixed regardless of what we do.

I think thats a mingw/libc bug, not easy to fix.

@Araq
Copy link
Member

Araq commented Oct 16, 2017

The original problem is still not reproducible. Closing.

@Araq Araq closed this as completed Oct 16, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants