Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix support for response files & switch the compiler fully to UTF-8 on Windows #3086

Merged
merged 11 commits into from Aug 9, 2019

Conversation

kinke
Copy link
Member

@kinke kinke commented Jun 1, 2019

Previously, -lowmem and, on Windows, all --DRT-* options needed to be direct cmdline args (for DMD too), occurrences in response files were ignored. That's a serious issue for dub, as AFAICT it always uses a response file with all actual args.

-lowmem was an easy fix, as we can expand the response files already in C main and unlike DMD don't need an initialized druntime.

The --DRT-* options on Windows are more problematic, as _d_run_main ignored the passed args and instead asked for the process' real arguments (as it needs them in original UTF-16 for proper conversion to UTF-8, as expected for D strings...).

@kinke
Copy link
Member Author

kinke commented Jun 1, 2019

@rainers: Please have a quick look at the 2nd commit (incl. druntime changes); that was necessary to make --DRT-* work in a response file (!). Do you think this (main => wmain, different _d_run_main signature) has a chance to be upstreamed?

@kinke kinke force-pushed the rspFiles branch 2 times, most recently from fa7e4b5 to cb0f95e Compare June 2, 2019 02:16
@rainers
Copy link
Contributor

rainers commented Jun 2, 2019

I suspect _d_run_main is used explicitly by a number of applications.

Wouldn't it be easier and less disruptive if _d_run_main is split into two functions, one that does the argument processing and calls the other with UTF8 args?

extern (C) int _d_run_main(int argc, char **argv, MainFunc mainFunc)
{
   version (Windows)
   {
        auto wargs = CommandLineToArgvW(wCommandLine, &wargc);
        // ...
   }
    else version (Posix)
    {
        // Allocate args[] on the stack
        // ...
    }
    return _d_run_main2(nargc, nargv, mainFunc);
}

extern (C) int _d_run_main2(int argc, char **argv, MainFunc mainFunc)
{
    // FPU init
    // ...
    tryExec(&runAll);
    // ...
}

so that you could call _d_run_main2 with modified arguments.

@kinke
Copy link
Member Author

kinke commented Jun 2, 2019

The problem with that approach is that core.runtime.Runtime.cArgs().argv would be a char** (encoded in current code page; status quo) if calling the old _d_run_main on Windows, and a UTF-16 wchar** if calling a hypothetical _d_run_wmain with a wchar** argv. To make it a UTF-8 char** in the latter case, I'd need to allocate yet another array on the stack (and null-terminate the individual args), just for that one pointer.

@rainers
Copy link
Contributor

rainers commented Jun 2, 2019

I don't think you need to change Runtime.cArgs (but might have a hard time setting it?). This change https://github.com/ldc-developers/ldc/pull/3086/files#diff-d56ab8d5a02d7348ab604507ca14063cR1122 seems to suggest that you want UTF8 arguments anyway, so the frontend shouldn't use cArgs, too. So it's probably even ok to not set it at all as it is unused.

With my suggestion you might have to copy some of the argument handling in dmain2, but I don't think that is too bad. Changing _d_run_main to accept wchar is unlikely to be accepted as it breaks code.

@kinke
Copy link
Member Author

kinke commented Jun 3, 2019

I could take a step back and only invoke a new _d_run_wmain for LDC and LDMD themselves for now, not in implicit __entrypoint.d for all D executables. Then I could really get away with not setting Runtime.cArgs().argv for now [although null-terminating the individual UTF-8 args and malloc'ing the argv array doesn't seem that bad].

@kinke
Copy link
Member Author

kinke commented Jun 3, 2019

Wrt. UTF-8, I checked what LLVM does when we use it to start a new child process, such as LDC from LDMD, with narrow args - it uses CreateProcessW and converts the supplied args from UTF-8 to UTF-16. So that's a potential issue with current LDMD/LDC (#611) when forwarding command-line args to a child process, as the source encoding is the current code page. That encoding is expected for the regular narrow MSCRT and Windows APIs (filenames etc.) though, not sure if we (incl. DMD front-end) use those anywhere directly.

@rainers
Copy link
Contributor

rainers commented Jun 5, 2019

Changes to druntime look much better now. _d_run_wmain doesn't have to be in dmain2.d, maybe just keep it in ldc?

BTW: does this handle DFLAGS, too?

AFAICT this uses UTF8 filenames internally on Windows, now. Does this work correctly with console output (without telling the user to change the codepage)?

@kinke
Copy link
Member Author

kinke commented Jun 6, 2019

Thx Rainer, very valid points - reading/setting env variables would need to happen via the wide API + converting to/from UTF8.

Wrt. console, I was shocked to see that I've apparently never tried to write non-ASCII stuff to the console (in D). How on earth can a SetConsoleOutputCP(CP_UTF8) (not sure if SetConsoleCP() for stdin should be used too) not be one of the first things done when initializing druntime?!

@kinke kinke force-pushed the rspFiles branch 3 times, most recently from 757cdac to c2cf661 Compare June 9, 2019 11:19
@kinke
Copy link
Member Author

kinke commented Jun 9, 2019

Seems to work as intended (LDC install dir can contain umlauts, source files in cmdline too etc.); console is switched to UTF8 and then reset before exiting. The only thing I noticed is that fprintf(stderr, ...) to the console stops at the first umlaut (working fine if redirected to file => UTF-8), while outputting the exact same string via fprintf(stdout, ...) works fine. The stream modes (_setmode()) are the same (_O_TEXT).

@rainers
Copy link
Contributor

rainers commented Jun 13, 2019

How on earth can a SetConsoleOutputCP(CP_UTF8) (not sure if SetConsoleCP() for stdin should be used too) not be one of the first things done when initializing druntime?!

IIRC switching the console codepage had impact on the full display rendering text shown via the ANSI code page to unreadable characters. I don't see that now, though. I suspect the console has improved with Windows 10.

@rainers
Copy link
Contributor

rainers commented Jun 13, 2019

The only thing I noticed is that fprintf(stderr, ...) to the console stops at the first umlaut (working fine if redirected to file => UTF-8), while outputting the exact same string via fprintf(stdout, ...) works fine.

Probably related: https://issues.dlang.org/show_bug.cgi?id=1448, but I cannot reproduce on Win10.

@kinke
Copy link
Member Author

kinke commented Jun 13, 2019

IIRC switching the console codepage had impact on the full display rendering text shown via the ANSI code page to unreadable characters. I don't see that now, though. I suspect the console has improved with Windows 10.

The only negative thing I've noticed is that the whole console switches to some extremely ugly font if the previous one was a raster font, incl. a different window size. Once a proper font is set as default, that's not an issue anymore, but sure, some people will probably wonder.

Probably related: https://issues.dlang.org/show_bug.cgi?id=1448, but I cannot reproduce on Win10.

Thx for digging that out. This should be reproducible:

import core.stdc.stdio;
import core.sys.windows.wincon, core.sys.windows.winnls;

void main()
{
    const oldCP = SetConsoleOutputCP(CP_UTF8);
    scope(exit) SetConsoleOutputCP(oldCP);

    fprintf(stdout, "HellöѬ LDC\n");
    fflush(stdout);

    fprintf(stderr, "HellöѬ LDC\n");
    fflush(stderr);
}

=>

HellöѬ LDC
Hell

@rainers
Copy link
Contributor

rainers commented Jun 15, 2019

The only negative thing I've noticed is that the whole console switches to some extremely ugly font if the previous one was a raster font, incl. a different window size. Once a proper font is set as default, that's not an issue anymore, but sure, some people will probably wonder.

That's what I remember but don't see now. There is a single raster font installed, though.

Probably related: https://issues.dlang.org/show_bug.cgi?id=1448, but I cannot reproduce on Win10.

Thx for digging that out. This should be reproducible:

Works here, Win 10 1809

@kinke
Copy link
Member Author

kinke commented Jun 15, 2019

Works here, Win 10 1809

Well that's interesting. I'm at v1803. Such recent improvements in this area after decades of maintaining the crappy status quo would be really surprising.

@rainers
Copy link
Contributor

rainers commented Jun 15, 2019

Well that's interesting. I'm at v1803. Such recent improvements in this area after decades of maintaining the crappy status quo would be really surprising.

Seems like it's happening: https://devblogs.microsoft.com/commandline/windows-command-line-unicode-and-utf-8-output-text-buffer/

@kinke
Copy link
Member Author

kinke commented Jun 15, 2019

Indeed. Running v1903 now (how do they manage to make an OS update that slow?!), and the previously built binary now outputs to stderr correctly.

@kinke
Copy link
Member Author

kinke commented Jul 27, 2019

druntime addition proposed upstream: dlang/druntime#2701

@kinke kinke changed the title WIP: Fix support for response files Fix support for response files & switch the compiler fully to UTF-8 on Windows Aug 7, 2019
@kinke
Copy link
Member Author

kinke commented Aug 7, 2019

Front-end fixes proposed upstream: dlang/dmd#10288

This is required for dub, which appears to always use a response file
containing all command-line arguments.
only)

This makes _d_wrun_main (cherry-picked from dlang/druntime#2701) use the
provided args directly instead of the process's real arguments (on
Windows) - if the host D compiler supports it.
E.g., this is required when passing --DRT-* options from a response file
to _d_wrun_main.

As a major change, the encoding of the Windows cmdline arguments is
switched from the current codepage to UTF-8.

Note that the MinGW-based libs currently only provide narrow CRT entry
points.
Conforming to its documentation, although the previous implementation
assumed filenames in the current code page.
E.g., this disables it for __entrypoint.d (the C main function(s) in
there), which is required.

A wmain on Windows is not detected as FuncDeclaration.isCMain() yet,
that should be fixed too.
For proper in/output of narrow UTF-8 strings.
There might be more occurrences.
And use the wide API for pure is-env-variable-set checks too, as the
first call to a narrow env API function would lead to the C runtime
preparing and maintaining both narrow and wide environments.
Thereby preventing to use fopen() on Windows, which expects a narrow
string in the current code page, not necessarily UTF-8.
@kinke
Copy link
Member Author

kinke commented Aug 9, 2019

I tried adding a Unicode lit-test, but lit/Python 2.7 (?) doesn't support testfile names with Unicode chars here on my Windows box, so not pursuing that brittle path any further (there was another issue, an LLVM encoding bug when converting llvm.used symbols to MS linker directives...).

@kinke kinke merged commit 20a11c2 into ldc-developers:master Aug 9, 2019
@kinke kinke deleted the rspFiles branch August 9, 2019 14:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants