Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

runtime: exit status 0xC0000005 from test process on windows-amd64-longtest builder #38440

Closed
bcmills opened this issue Apr 14, 2020 · 81 comments
Closed

Comments

@bcmills
Copy link
Member

@bcmills bcmills commented Apr 14, 2020

Observed in 2020-04-10T16:24:46-ea7126f/windows-amd64-longtest:

windows-amd64-longtest …
…
--- FAIL: TestTestCache (4.15s)
    go_test.go:2405: 
        
        INITIAL
        
        
    go_test.go:2413: running testgo [test -x -v -short t/...]
    go_test.go:2413: standard output:
    go_test.go:2413: exit status 3221225477
        FAIL	t/t1	0.486s
        exit status 3221225477
        FAIL	t/t2	0.481s
        === RUN   Test3
            Test3: t3_test.go:6: 1
        --- PASS: Test3 (0.00s)
        PASS
        ok  	t/t3	0.462s
        === RUN   Test4
            Test4: t4_test.go:6: 1
        --- PASS: Test4 (0.00s)
        PASS
        ok  	t/t4	0.467s
        FAIL
        
    go_test.go:2413: standard error:

The FAIL in t/t1 is unexpected in this test.

Exit status 3221225477 is 0xC0000005, which some cursory searching seems to suggest is a generic “access violation” error. That suggests possible memory corruption.

I haven't seen any repeats of this error so far.

CC @alexbrainman @zx2c4

@bcmills
Copy link
Member Author

@bcmills bcmills commented Apr 14, 2020

2020-04-13T22:38:56-ca017a6/windows-amd64-longtest

--- FAIL: TestScript (0.01s)
    --- FAIL: TestScript/test_go111module_cache (1.22s)
        script_test.go:205: 
            > env GO111MODULE=on
            > go mod init foo
            [stderr]
            go: creating new go.mod: module foo
            > go test
            [stdout]
            exit status 3221225477
            FAIL	foo	0.267s
            [exit status 1]
            FAIL: testdata\script\test_go111module_cache.txt:3: unexpected command failure
            
FAIL
FAIL	cmd/go	95.006s
@bcmills
Copy link
Member Author

@bcmills bcmills commented Apr 14, 2020

That makes two in the last month, and none from the entire year before. Absent a specific theory of what's wrong, I think we need to treat this as a potential 1.15 compiler regression.

CC @randall77 @josharian

@bcmills bcmills modified the milestones: Unplanned, Go1.15 Apr 14, 2020
@josharian
Copy link
Contributor

@josharian josharian commented Apr 15, 2020

Or a runtime regression. The other repeating mystery failure is on windows/386, which has windows in common. I took a quick peek through git history around 1/9/2020 and CL 213837 might be a candidate(?).

@bcmills
Copy link
Member Author

@bcmills bcmills commented Apr 15, 2020

Ooh, CL 213837 is an interesting hypothesis. (CC @aclements @cherrymui @dr2chase @ianlancetaylor).

(For reference, the windows/386 issue is #37360.)

@alexbrainman
Copy link
Member

@alexbrainman alexbrainman commented Apr 16, 2020

There is also mysterious #36492 bug.

Alex

@bcmills
Copy link
Member Author

@bcmills bcmills commented Apr 22, 2020

@bcmills
Copy link
Member Author

@bcmills bcmills commented Apr 30, 2020

2020-04-30T02:29:55-eda6fe3/windows-amd64-longtest
2020-04-29T21:29:17-faafdf5/windows-amd64-longtest
2020-04-29T21:29:08-b3863fb/windows-amd64-longtest
2020-04-29T20:33:31-197a2a3/windows-amd64-longtest
2020-04-29T19:03:10-35dce1d/windows-amd64-longtest
2020-04-29T18:56:19-b13ce66/windows-amd64-longtest
2020-04-29T18:50:32-202c43b/windows-amd64-longtest
2020-04-29T08:57:33-e1a96b8/windows-amd64-longtest
2020-04-28T18:41:07-39380e8/windows-amd64-longtest
2020-04-27T20:06:57-d2f5e4e/windows-amd64-longtest
2020-04-27T18:19:26-a136919/windows-amd64-longtest
2020-04-27T17:49:51-8af8fa9/windows-amd64-longtest
2020-04-27T17:20:00-1c4e9b2/windows-amd64-longtest
2020-04-27T17:19:27-f4e13b8/windows-amd64-2016
2020-04-27T17:19:16-5029c36/windows-amd64-longtest
2020-04-27T15:53:46-9b9556f/windows-amd64-2016
2020-04-27T10:23:01-925d6b3/windows-amd64-longtest
2020-04-25T18:40:45-7ef28cb/windows-amd64-longtest
2020-04-25T02:19:12-49f10f3/windows-amd64-longtest
2020-04-24T23:58:38-396833c/windows-amd64-longtest
2020-04-24T23:12:43-b0a8754/windows-amd64-longtest
2020-04-24T23:12:23-6b5ab20/windows-amd64-longtest
2020-04-24T23:06:59-1c54535/windows-amd64-longtest
2020-04-24T23:06:48-47b5efa/windows-amd64-longtest
2020-04-24T22:01:49-d8d3542/windows-amd64-longtest
2020-04-24T21:12:21-943a0d0/windows-amd64-longtest
2020-04-24T17:13:12-da33f9c/windows-amd64-longtest
2020-04-24T13:26:34-ee8972c/windows-amd64-longtest
2020-04-23T19:40:07-6303c34/windows-amd64-longtest
2020-04-22T16:19:35-1257900/windows-amd64-2016

@ianlancetaylor
Copy link
Contributor

@ianlancetaylor ianlancetaylor commented May 1, 2020

The same exit status on a 32-bit system in a trybot run as exit status -1073741819.

https://storage.googleapis.com/go-build-log/bbdccf35/windows-386-2008_b817f391.log

I don't see any in the saved logs, though, so that might have really been corruption of some sort.

@ianlancetaylor
Copy link
Contributor

@ianlancetaylor ianlancetaylor commented May 1, 2020

It's strange to see this exit status. It should be handled as an exception by the runtime package (code in runtime/signal_windows.go). Even if the runtime package doesn't handle the exception, it should still cause a stack trace and an exit with status 2. I don't know under what conditions Windows can still have a 0xc0000005 exit status.

@ianlancetaylor
Copy link
Contributor

@ianlancetaylor ianlancetaylor commented May 2, 2020

The error seems to tend to happen in cmd/go TestScript or when running the tests in the test directory, which are both cases where we are executing a large number of child processes in parallel.

@ianlancetaylor
Copy link
Contributor

@ianlancetaylor ianlancetaylor commented May 2, 2020

I don't know what is going on here; not looking at it any further at present.

@alexbrainman
Copy link
Member

@alexbrainman alexbrainman commented May 3, 2020

It's strange to see this exit status.

The process returned ERROR_ACCESS_DENIED. ERROR_ACCESS_DENIED is just another Windows error. There is no restriction on what error process can and cannot return.

It should be handled as an exception by the runtime package (code in runtime/signal_windows.go).

I agree,

Mind you, if something goes wrong in exception handle, it won't exit the process properly, and Windows itself will exit process.

Also, maybe we have a bug and exception handler does not run for some reason.

Even if the runtime package doesn't handle the exception, it should still cause a stack trace and an exit with status 2.

The exception handler can fail as it is printing stack trace.

I don't know under what conditions Windows can still have a 0xc0000005 exit status.

If I run this program

https://play.golang.org/p/r_g4sD76FOh

it prints

panic: runtime error: invalid memory address or nil pointer dereference
[signal 0xc0000005 code=0x0 addr=0x0 pc=0x9853ca]

goroutine 1 [running]:
main.main()
        /home/a/src/issues/go/38440/a.go:9 +0x2a

on Windows. I reckon (I did not test) the program will return 0xc0000005 exit code, if I disable exception handler.

Do we know which program is failing? I still don't see which program failed from test output. Maybe, if we know which program fails, we can add some temporary debugging code to it.

Alex

Alex

@ianlancetaylor
Copy link
Contributor

@ianlancetaylor ianlancetaylor commented May 4, 2020

It's a range of different programs that are failing. The only consistency I've seen so far is that it happens in the cmd/go testsuite and in the top-level test testsuite, both of which run many different Go programs in parallel. Some of the Go programs that they run in parallel fail with exit status 0xc0000005 and no other output, implying, as you say, that the exception handler did not take effect for some reason.

I'm wondering if it's possible that when memory allocation fails while calling into Windows, such as when starting a new thread, that this exit status might occur. That is pure speculation, though; I have no evidence.

@alexbrainman
Copy link
Member

@alexbrainman alexbrainman commented May 4, 2020

I'm wondering if it's possible that when memory allocation fails while calling into Windows, such as when starting a new thread, that this exit status might occur.

The way threads are created by Go runtime is by calling CreateThread Windows API (in runtime.newosproc). So, if CreateThread fails and returns an error, it should be printed by runtime. Same when Go allocates memory.

I also considered bug in how we implemented exception handling. Windows exception handling on 386 is very different from amd64. The fact that both 368 and amd64 builders fail with this error points that the problem is not exception handler related.

Maybe crashing program does print at least some stack trace, but its output is lost by the parent process. Maybe stdout is not read for some reason by parent process. The fact that exit code is 0xc0000005 does not supports that theory, but ...

Alex

@aclements
Copy link
Member

@aclements aclements commented May 14, 2020

It seems this became much more common in early April.

$ greplogs -dashboard -E "exit status 3221225477" -l | findflakes -paths
First observed ea7126f 10 Apr 16:24 2020 (455 commits ago)
Last observed  364a05e 14 May 15:18 2020 (2 commits ago)
76% chance failure is still happening
13% failure probability (59 of 454 commits)
Likely culprits:
   13% ea7126f cmd/compile: use a Sym type instead of interface{} for symbolic offsets
   11% 782fcb4 regexp: add (*Regexp).SubexpIndex
   10% 245409e runtime: replace the type of netpollWakeSig from a uintptr to a uint32
    8% 7b90c1c os/exec: extend grace period in TestExtraFiles to 20% of overall deadline
    7% 28157b3 cmd/compile: start implementing strongly typed aux and auxint fields
    6% a1550d3 cmd/compile: use isel with variable shifts on ppc64x
    6% 7c0ee11 test: deflaking measures for runtime gdb test
    5% b191c60 cmd/oldlink: delete darwin/386 host object loading support
    4% a3ef804 cmd/internal/obj/ppc64: leverage AND operation to calculate remainder
    4% c1f0eda net: convert many Close tests to use parallel subtests
Past failures:
  8a958bb 29 Apr 16:24 2018 to ad4e637 04 Jul 00:15 2018
    0.14% failure probability (2 of 695 commits)
  08c4348 10 Apr 18:40 2015 (isolated failure)
@toothrot
Copy link
Contributor

@toothrot toothrot commented May 26, 2020

Hello! This is one of the few remaining issues blocking the Beta release of Go 1.15. We'll need to make a decision on this in the next week in order to keep our release on schedule.

@aclements
Copy link
Member

@aclements aclements commented May 27, 2020

I'm working on bisecting this (which is a very slow process; given the 13% failure probability, it takes 34 successful runs to drive the chance of a missed repro under 1%). I have been able to reproduce it as far back as da8591b, which is 32 commits earlier than the first observation on the dashboard (ea7126f).

@aclements
Copy link
Member

@aclements aclements commented May 27, 2020

Reproduced at e31d741, which is 64 commits before the first observation on the dashboard.

Given the failure probability and the clear cut-off on the dashboard, I'm starting to wonder if something changed on the builder, rather than in the tree. Assuming it had the same 13% failure probability before the first dashboard observation, there's only a (1-0.13)**64 = 0.01% chance it wouldn't have appeared somewhere in those 64 commits.

@josharian
Copy link
Contributor

@josharian josharian commented May 27, 2020

For some reason (I don’t recall why), when I first looked at this in #38440 (comment), I thought the first occurrence was on 1/9/2020. Have you looked around that date range?

@aclements
Copy link
Member

@aclements aclements commented May 27, 2020

This is definitely not happening at tag go1.14.3.

@josharian, I haven't looked around that range yet. Do you have any recollection where 1/9/2020 come from? :)

@aclements
Copy link
Member

@aclements aclements commented May 27, 2020

I should note that go1.14.3 included CL 213837, which was your hypothesized culprit, but I wasn't able to reproduce at that tag (I even accidentally ran a lot more iterations than I meant to!)

@josharian
Copy link
Contributor

@josharian josharian commented May 27, 2020

Ah, I think it's because of #37360 (comment), which I mentally transposed into this issue. They may be related, in that they are both weird exit statuses on Windows.

@alexbrainman
Copy link
Member

@alexbrainman alexbrainman commented May 31, 2020

I don't use windbg or crash dumps, so don't assume I am an expert. @zx2c4 is your man.

I'm trying to get a user crash dump captured so I can windbg it, but I'm not sure if that's even a reasonable thing to do ...

From what I understand, you will be able to see process state at the time of the crash. That assumes that the program exists due to Windows exception. And you will need the executable file itself too to use with your crash dump.

I am not sure what that buys you. You won't be able to use any Go symbols - windbg does not understand DWARF. You will see all Windows DLL functions.

... and I'm just not getting any files in my crash dump directory.

Assuming our program raises Windows exception, you need to adjust Go runtime to let that exception out of Go process. Currently Go does not let exceptions out of the process. Go prints stack trace, if it cannot handle exception.

There is #20498 to adjust Go to allow for exceptions to be handled externally.

https://go-review.googlesource.com/c/go/+/195577/ is latest attempt to solve that problem.

Once you adjust runtime, then you will need to configure your system to handle exception as crash dump. I don't know how to do it, so your guess is as good as mine. Perhaps your instructions above are fine.

Hopefully it is any help.

Alex

@zx2c4
Copy link
Contributor

@zx2c4 zx2c4 commented May 31, 2020

ZwFlushVirtualMemory/MmFlushVirtualMemory uses FsRtlAcquireFileForCcFlushEx, which takes a reference to underlying filesystem objects and pokes the normal rootkit^Wfiltering hooks, and then calls MiFlushSectionInternal, which walks the page tables, looks at the dirty flags, and calls MiWaitForPageWriteCompletion.

But so far as I can tell, ZwUnmapViewOfSectionEx calls MiUnmapVad, which then doesn't appear to call into any of the same infrastructure. It's possible that unmapping the pages somewhere deeper down results in a flush, but I'm not seeing it. And judging by @aclements' comment, it seems like empirically there's a difference between calling unmap and calling flush+unmap.

This is counterintuitive for me, and I assume for the rest of you too, because on Linux, munmap usually propagates the dirty bits into the pagecache, so that kswapd can get around to flushing it later, and so that there's some consistency if the file is mapped subsequently. That doesn't mean that munmap will wait for anything, but it does mean that Linux won't forget about those modifications and that they'll eventually be written to disk.

Option 1: It's possible that on Nt, unmapping a file without flushing won't actually propagate the dirty bits anywhere, and so the file essentially doesn't get completely written. That'd be weird, but who knows. Option 2: Another possibility would be that failing to flush doesn't make those file pages incoherent, but it does make calls to traditional file I/O return inconsistent values, and some other debug info parser in the toolchain is using traditional I/O to read things and crashing; this seems unlikely though, as if it were a parser error, you'd expect Go to just go out of bounds. Option 3: Like option 2, but is it possible that the file is copied after the linker is done with it, and it's that copy that is executed? In that case, the bug and evident fix would be easily explainable.

Of these, (1) seems possible, precisely for the reasons that (2) is at least something to consider: apparently normal I/O paths aren't using the same page cache as MapViewOfFile; MSDN says "A mapped view of a file is not guaranteed to be coherent with a file that is being accessed by the ReadFile." That makes me think cache coherence needs to be done explicitly, by way of calls to FlushViewOfFile. Unless things actually aren't that complicated with regard to dirty pages, and we're just running up against the edge case MSDN warns about in the case of (3).

Now the above is pretty speculative, based on some pretty brief IDAing around ntoskrnl.exe. If you demand more assurance, I guess I could actually spend some time learning about how the NT page cache and file I/O work. But it sounds like CL 235639 is working, so I'll +2 that shortly.

@zx2c4
Copy link
Contributor

@zx2c4 zx2c4 commented May 31, 2020

I don't use windbg or crash dumps, so don't assume I am an expert. @zx2c4 is your man.

Similar disclaimer. But the bootleg way of doing this would be to just add a breakpoint to ExRaiseAccessViolation and to the invocation of KiExceptionDispatch from KiPageFault, and then you'll be able to inspect the entire system at that moment.

@alexbrainman
Copy link
Member

@alexbrainman alexbrainman commented May 31, 2020

Another thing that caught my attention is that when memory mapped file are used, it is possible for Windows exception to be raised during simple memory access.

https://docs.microsoft.com/en-us/windows/win32/memory/reading-and-writing-from-a-file-view

If something like that happens, Go exception handler would be called. But I expect to see Go stack trace.

Alex

@zx2c4
Copy link
Contributor

@zx2c4 zx2c4 commented May 31, 2020

https://docs.microsoft.com/en-us/windows/win32/memory/reading-and-writing-from-a-file-view

That article is about 0xC0000006, but we're dealing with 0xC0000005.

@zx2c4
Copy link
Contributor

@zx2c4 zx2c4 commented May 31, 2020

Okay I figured it out, I think. It's likely option 3, no fancy business. About time I read the actual Go source where this issue is happening.

src/cmd/link/internal/ld/main.go has this sequence:

        bench.Start("Munmap")
        ctxt.Out.Close() // Close handles Munmapping if necessary.

        bench.Start("undef")
        ctxt.undef()
        bench.Start("hostlink")
        ctxt.hostlink()

ctxt.Out.Close() calls UnmapViewOfFile, without flushing. Then ctxt.hostlink()is called, which makes a copy by way of calling hostobjCopy() and appending the results as arguments to the next stage:

        argv = append(argv, hostobjCopy()...)

Diving into that function, we see:

                        f, err := os.Open(h.file)
                        if err != nil {
                                Exitf("cannot reopen %s: %v", h.pn, err)
                        }
                        defer f.Close()
                        if _, err := f.Seek(h.off, 0); err != nil {
                                Exitf("cannot seek %s: %v", h.pn, err)
                        }

                        w, err := os.Create(dst)
                        if err != nil {
                                Exitf("cannot create %s: %v", dst, err)
                        }
                        if _, err := io.CopyN(w, f, h.length); err != nil {
                                Exitf("cannot write %s: %v", dst, err)
                        }
                        if err := w.Close(); err != nil {
                                Exitf("cannot close %s: %v", dst, err)
                        }

This is normal boring file I/O, which means it's not synchronized with the mapped file, unless flush was called prior to unmapping. MSDN makes that property abundantly clear. So, not all of the data written to the mapping is included in the copy. Kaboom.

CL 235639 is the correct fix.

@aclements
Copy link
Member

@aclements aclements commented May 31, 2020

Thank you for taking a close look at this @zx2c4 !

Diving into that function, we see:

The one thing that's still bothering me is, if this is where the crash is, why don't we see a stack trace?

The fact that we don't see a stack trace is why I was trying so hard to get a user-mode crash dump. Also, Go's exception handling mechanism never exits with the exception code; it would usually exit with status 2, or some small integer if things go really wrong. This suggests Go isn't handling the exception at all, and that it's in some Windows stack. So I was hoping a crash dump would at least tell me what process it was in, and maybe what Windows stack it was in.

This is normal boring file I/O, which means it's not synchronized with the mapped file, unless flush was called prior to unmapping. MSDN makes that property abundantly clear.

Is this true even when the file has been unmapped and reopened with a different handle? In UNIX, if you're going to interleave mmap'ed IO and read/write IO, you're advised to msync between them, but once you've closed, unmapped, and reopened the file, it's safe to do read/write IO. (msync is also a giant hammer that you don't want to overuse!)

@cherrymui
Copy link
Contributor

@cherrymui cherrymui commented May 31, 2020

I don't think hostobjCopy does anything with the output file we wrote using mmap. hostobjCopy copies "host object files", which are non-Go object files we are linking in.

@cherrymui
Copy link
Contributor

@cherrymui cherrymui commented May 31, 2020

Just a guess (completely unsure): the linker has written the output executable, and another process tries to execute it but failed (to even start running code), so we don't get any stack traces. (It might be similar to on UNIX if we somehow screw the ELF header or the dynamic loader fails.)

@zx2c4
Copy link
Contributor

@zx2c4 zx2c4 commented May 31, 2020

I don't think hostobjCopy does anything with the output file we wrote using mmap. hostobjCopy copies "host object files", which are non-Go object files we are linking in.

You're right that hostobjCopy was barking up the wrong tree. So I went to see how go build does things and found the -x flag, and saw that it seemed, on Linux at least, to mv /path/to/temporary/exe /path/to/final/exe:

/usr/lib/go/pkg/tool/linux_amd64/link -o $WORK/b001/exe/a.out -importcfg $WORK/b001/importcfg.link -buildmode=exe -buildid=58H7_5kYPqii0uAGLa38/sqTE1WDirs7yz2m_Gl_O/J64dQnzG1o1jnUgl1kbu/58H7_5kYPqii0uAGLa38 -extld=x86_64-pc-linux-gnu-gcc $WORK/b001/_pkg_.a
mv $WORK/b001/exe/a.out a

That would seem fine according to what we know of the interaction between Windows IO and the page mapper so far. But then I opened up the function that emitted that, and look what I found:

// moveOrCopyFile is like 'mv src dst' or 'cp src dst'.
func (b *Builder) moveOrCopyFile(dst, src string, perm os.FileMode, force bool) error {
        if cfg.BuildN {
                b.Showcmd("", "mv %s %s", src, dst)
                return nil
        }

        // If we can update the mode and rename to the dst, do it.
        // Otherwise fall back to standard copy.

        // If the source is in the build cache, we need to copy it.
        if strings.HasPrefix(src, cache.DefaultDir()) {
                return b.copyFile(dst, src, perm, force)
        }

        // On Windows, always copy the file, so that we respect the NTFS
        // permissions of the parent folder. https://golang.org/issue/22343.
        // What matters here is not cfg.Goos (the system we are building
        // for) but runtime.GOOS (the system we are building on).
        if runtime.GOOS == "windows" {
                return b.copyFile(dst, src, perm, force)
        }

If we look at b.copyFile, we see another boring file I/O copy operation:

        df, err := os.OpenFile(dst, os.O_WRONLY|os.O_CREATE|os.O_TRUNC, perm)
        if err != nil && base.ToolIsWindows {
                // Windows does not allow deletion of a binary file
                // while it is executing. Try to move it out of the way.
                // If the move fails, which is likely, we'll try again the
                // next time we do an install of this binary.
                if err := os.Rename(dst, dst+"~"); err == nil {
                        os.Remove(dst + "~")
                }
                df, err = os.OpenFile(dst, os.O_WRONLY|os.O_CREATE|os.O_TRUNC, perm)
        }
        if err != nil {
                return fmt.Errorf("copying %s: %w", src, err) // err should already refer to dst
        }

        _, err = io.Copy(df, sf)
        df.Close()

So again, voila. We have the pattern of: write using MapViewOfFile, forget to flush it, and then read that file using normal boring file I/O. MSDN makes it very clear that if you forget to flush, the contents you made with mapping might be out of sync with that of normal file I/O. That means the final executable that gets executed has some random parts of the file as zeros instead of as the actual file contents. This is, again, option 3 above.

Actually though, I deleted a step from that -x trace. Here's the real one:

/usr/lib/go/pkg/tool/linux_amd64/link -o $WORK/b001/exe/a.out -importcfg $WORK/b001/importcfg.link -buildmode=exe -buildid=58H7_5kYPqii0uAGLa38/sqTE1WDirs7yz2m_Gl_O/J64dQnzG1o1jnUgl1kbu/58H7_5kYPqii0uAGLa38 -extld=x86_64-pc-linux-gnu-gcc $WORK/b001/_pkg_.a
/usr/lib/go/pkg/tool/linux_amd64/buildid -w $WORK/b001/exe/a.out # internal
mv $WORK/b001/exe/a.out a

The missing step is the call to buildid. Guess what this does? More normal boring file I/O:

        f, err = os.OpenFile(file, os.O_WRONLY, 0)
        if err != nil {
                log.Fatal(err)
        }
        if err := buildid.Rewrite(f, matches, newID); err != nil {
                log.Fatal(err)
        }
        if err := f.Close(); err != nil {
                log.Fatal(err)
        }

So that means that either buildid is writing things back inconsistently, or that the mv that is actually a filecopy is writing things back inconsistently. Either way, getting an inconsistent executable is to be expected. With link using mapped I/O, the rest of the toolchain that runs after and doesn't use mapped I/O but rather uses normal boring file I/O is bound to create bogus files some percentage of the time, because its reads won't be consistent with link's writes.

I remain convinced that CL 235639 is correct and necessary if we take heed of MSDN's warnings.

@zx2c4
Copy link
Contributor

@zx2c4 zx2c4 commented May 31, 2020

The one thing that's still bothering me is, if this is where the crash is, why don't we see a stack trace?

Take a normal working .exe, and then write zeros to various swaths of the file. Try to execute it. I would bet that there's at least a few places where writing pages of zeros causes the resultant executable to crash.

Is this true even when the file has been unmapped and reopened with a different handle? In UNIX, if you're going to interleave mmap'ed IO and read/write IO, you're advised to msync between them, but once you've closed, unmapped, and reopened the file, it's safe to do read/write IO. (msync is also a giant hammer that you don't want to overuse!)

The option 3 theory says: Mapping, unmapping, mapping will give a consistent file when reading from the second mapping. But mapping, unmapping, ReadFile-ing will not give a consistent file when using the ReadFile. MSDN makes this latter point very clear.

The option 1 theory says: Mapping, unmapping, mapping will not give a consistent file when reading from the second mapping, because the dirty bits from those pages are just dropped entirely when unmapping, and there's no nice pagecache and kswapd keeping things consistent like on Linux.

At first, option 1 seemed a tad bit more likely to me, because that would help explain the ReadFile quirk in a deeper way, as well as explain why we're getting these crashes: that when the PE loader maps the executable file, it's after those pages have been dropped entirely from before, and so its missing previously written but not flushed chunks.

But now, seeing how many times Go object files are copied around (see this latest #38440 (comment) ), it seems like option 3 would be the most likely. Option 1 would be annoying and surprising behavior, after all, and option 3 is consistent with what's been expressed clearly on MSDN. And now we have Go code that precisely matches the bad edge case that MSDN says to avoid.

@cherrymui
Copy link
Contributor

@cherrymui cherrymui commented May 31, 2020

To be clear, I agree that CL 235639 is probably what we should do.

I also think that it is not safe to mix mmap and file I/O in the same process. That is why in the past, when the linker itself used both mmap and file I/O for the output binary, we did have a msync (FlushViewOfFile on windows, if mapping were used, but it was not due to permission bug) between the two. In this cycle, as we changed to use mmap only, the msync was removed (CL https://golang.org/cl/228385).

But I'm not clear whether it is safe if another process, after waiting for the linker process terminated (which unmaps the file and closes the FD before exiting), do anything with the file, read or execute. @aclements 's comment #38440 (comment) seems to indicate this is safe on UNIX systems. The MSDN document doesn't make this explicit. But it is a possibility.

(Anything could happen after the linker process terminated. Besides what go command does, the user may open the file, execute it, or read it for inspection. We need to make sure that is safe.)

@cherrymui
Copy link
Contributor

@cherrymui cherrymui commented May 31, 2020

I'm also not really sure what go build does is really the cause. I looked at some of the failures, and they failed at go run or go test, which don't seem to copy the output file or write buildid, but just execute the binary. So it seems only executing the output after the linker process terminated is still not safe.

@ianlancetaylor
Copy link
Contributor

@ianlancetaylor ianlancetaylor commented May 31, 2020

@aclements If I'm reading you right, you're looking for a crash dump from cmd/link or cmd/go. But from this discussion it seems more likely to me that the executable is corrupt on disk, so running it crashes early, and the Go signal handling is never installed.

@aclements
Copy link
Member

@aclements aclements commented Jun 1, 2020

@ianlancetaylor , I'm looking for a crash dump from whatever's crashing. If it were cmd/link or cmd/go, I would expect a Go traceback, and since we don't see that, I suspect it's not cmd/link or cmd/go. If it's the corrupt binary, I would expect the crash dump to implicate some binary loader or something very early in process start up.

@zx2c4
Copy link
Contributor

@zx2c4 zx2c4 commented Jun 1, 2020

But I'm not clear whether it is safe if another process, after waiting for the linker process terminated (which unmaps the file and closes the FD before exiting), do anything with the file, read or execute. @aclements 's comment #38440 (comment) seems to indicate this is safe on UNIX systems. The MSDN document doesn't make this explicit. But it is a possibility.

We know that another process using normal boring file I/O isn't safe. But what's unclear is whether remapping after dropping those pages is coherent. That's my option (1) earlier. This seems definitely possible, though it'd be a bit wild. The issue is essentially whether dirty pages are stored with their dirty bit in tact in some global page cache. Linux does this. The question is: does Windows?

@jstarks
Copy link

@jstarks jstarks commented Jun 1, 2020

I'm not certain about the coherence question (based on the kernel data structures at play, I believe map+unmap+map (even with intervening close+open), will be coherent, but I can't speak from a position of authority).

But to @aclements's point, there are several code paths in the Windows loader that handle an access violation (or other exception) and call NtTerminateProcess without raising again. So it's certainly possible that a corrupt binary will exit without producing a crash dump. Unfortunately.

@aclements
Copy link
Member

@aclements aclements commented Jun 1, 2020

But to @aclements's point, there are several code paths in the Windows loader that handle an access violation (or other exception) and call NtTerminateProcess without raising again. So it's certainly possible that a corrupt binary will exit without producing a crash dump. Unfortunately.

Ah, that's interesting.

It seems like the most likely theory at this point is that the Windows loader uses read/write IO for at least some things (even if it memory maps the segments themselves), and that omitting the FlushViewOfFile causes that to be incoherent, which causes the loader to crash.

If it were a problem in the linker itself, we would expect a Go traceback (and possibly user-mode crash dumps). If the main contents of the binary were corrupted (e.g., incoherence led to zeroed pages being mapped in), we would expect a much wider variety of crashes.

@jstarks
Copy link

@jstarks jstarks commented Jun 1, 2020

It seems like the most likely theory at this point is that the Windows loader uses read/write IO for at least some things (even if it memory maps the segments themselves), and that omitting the FlushViewOfFile causes that to be incoherent, which causes the loader to crash.

That would be surprising to me. I would expect that CreateProcess would construct an image section for the file, which I would expect to implicitly flush any remaining data section (and fail if there were any existing writable mappings). I'm not an expert in this area, and I did not read the code, so this is just my intuition.

However, this is not a problem. From @zx2c4's analysis, it sounded like the corruption was introduced earlier in the chain. If the Go linker is calling CopyFile (which calls NtReadFile) without calling FlushViewOfFile, then the copied binary could be corrupted due to incoherence well before CreateProcess is called on it.

@aclements
Copy link
Member

@aclements aclements commented Jun 1, 2020

That would be surprising to me. I would expect that CreateProcess would construct an image section for the file, which I would expect to implicitly flush any remaining data section (and fail if there were any existing writable mappings). I'm not an expert in this area, and I did not read the code, so this is just my intuition.

I'm not sure I understand what you're saying. What do you mean by "image section"? I wasn't saying there are any writable mappings at that point.

My theory is that CreateProcess (or the loader behind it) uses ReadFile to get a few things like headers before creating memory mappings for the bulk of the binary. Those memory mappings are probably fine because they're coherent with the data written via memory mappings in the linker. But ReadFile could be returning bad data since the linker didn't FlushViewOfFile, and that could cause loading the binary to fail.

However, this is not a problem. From @zx2c4's analysis, it sounded like the corruption was introduced earlier in the chain. If the Go linker is calling CopyFile (which calls NtReadFile) without calling FlushViewOfFile, then the copied binary could be corrupted due to incoherence well before CreateProcess is called on it.

We've seen failures in situations where the binary does not get copied.

@gopherbot gopherbot closed this in 13bc6d4 Jun 1, 2020
@cherrymui
Copy link
Contributor

@cherrymui cherrymui commented Jun 1, 2020

My theory is that CreateProcess (or the loader behind it) uses ReadFile to get a few things like headers before creating memory mappings for the bulk of the binary. Those memory mappings are probably fine because they're coherent with the data written via memory mappings in the linker. But ReadFile could be returning bad data since the linker didn't FlushViewOfFile, and that could cause loading the binary to fail.

This is my guess as well. It probably needs to read the header to know the addresses and ranges of the mappings. I guess it is the header that gets corrupted, so it fails very early so that it doesn't print much information.

@zx2c4
Copy link
Contributor

@zx2c4 zx2c4 commented Jun 1, 2020

I'm not seeing anywhere in ntoskrnl where the process image is being read through means other than its mapping. It looks like NtCreateUserProcess calls createfile and then maps it pretty soon after. I didn't see much in userspace CreateProcessInternalW either, besides reading file metadata. But @jstarks can probably consult the source and find something more definitive than me poking around for 5 minutes in the disassembler.

@zx2c4
Copy link
Contributor

@zx2c4 zx2c4 commented Jun 1, 2020

Actually, NtCreateUserProcess->MmCreateSpecialImageSection->MiCreateSection->MiCreateImageOrDataSection->MiCreateNewSection->MiCreateImageFileMap->MiReadImageHeaders->MiPageRead->IoPageReadEx->an irp gets queued up. So that might give credence to what @cherrymui just suggested.

@jstarks
Copy link

@jstarks jstarks commented Jun 1, 2020

It looks like MiCreateImageFileMap flushes the data section (i.e. waits for modified pages to be written to the backing file) before doing any reads. So if the MiReadImageHeaders read is the first one in the CreateProcess path, then I wouldn't expect any coherence issues in that path.

Let me see if I can get someone from the Mm team to chime in.

@zx2c4
Copy link
Contributor

@zx2c4 zx2c4 commented Jun 1, 2020

Let me see if I can get someone from the Mm team to chime in.

Would be very interested. I'd also like to know if the above option (1) is a real possibility. Specifically, will Windows just discard unflushed pages and ignore the dirty bit if you unmap without flushing first? I'd expect the answer to be "no, that would be insane", but ya never know. The whole caveat of fscache/normal I/O being somehow separate from mmap'd I/O is sufficiently weird that anything sort of seems possible there.

@jstarks
Copy link

@jstarks jstarks commented Jun 2, 2020

Interestingly, according to the Mm team, mappings and ReadFile should be coherent as long as the file was opened cached, i.e. without FILE_NO_INTERMEDIATE_BUFFERING. The fact that they appear not to be in this case may be an OS bug.

We (well, not me, but others here in Windows) will try to get a local repro under stress. In the meantime, it sounds like flushing the mapping explicitly is a reasonable workaround.

Thanks for looping me in. @aclements, on which OS build are you reproducing this?

@zx2c4
Copy link
Contributor

@zx2c4 zx2c4 commented Jun 2, 2020

Interestingly, according to the Mm team, mappings and ReadFile should be coherent as long as the file was opened cached

Oh, good, that's reassuring. I was worried that these being not coherent implied some very odd IO cache architecture.

@alexbrainman
Copy link
Member

@alexbrainman alexbrainman commented Jun 2, 2020

I tried simulating corrupted file on Windows

https://play.golang.org/p/5aZQqi7pS8o

and that is what I see

C:\>test
--- FAIL: TestCorruptedExecutable (0.33s)
    a_test.go:76: Child: exit status 3221225477
        ""
FAIL

C:\>

Exactly what we see on the builder.

Child process has no output (stack trace or otherwise), because it has not run yet.

And parent process prints child's exit code of 0xC0000005, and empty output.

Alex

@aclements
Copy link
Member

@aclements aclements commented Jun 2, 2020

@aclements, on which OS build are you reproducing this?

This is on the windows-amd64-longtest builder, which reports as:

$ VM=$(gomote create windows-amd64-longtest); gomote run -system $VM systeminfo
Host Name:                 SERVER-2016-V7-
OS Name:                   Microsoft Windows Server 2016 Datacenter
OS Version:                10.0.14393 N/A Build 14393
OS Manufacturer:           Microsoft Corporation
OS Configuration:          Standalone Server
OS Build Type:             Multiprocessor Free
Registered Owner:          N/A
Registered Organization:   N/A
Product ID:                -
Original Install Date:     7/2/2018, 5:50:15 PM
System Boot Time:          6/2/2020, 1:10:45 PM
System Manufacturer:       Google
System Model:              Google Compute Engine
System Type:               x64-based PC
Processor(s):              1 Processor(s) Installed.
                           [01]: Intel64 Family 6 Model 63 Stepping 0 GenuineIntel ~2300 Mhz
BIOS Version:              Google Google, 1/1/2011
Windows Directory:         C:\Windows
System Directory:          C:\Windows\system32
Boot Device:               \Device\HarddiskVolume1
System Locale:             en-us;English (United States)
Input Locale:              en-us;English (United States)
Time Zone:                 (UTC+00:00) Monrovia, Reykjavik
Total Physical Memory:     14,746 MB
Available Physical Memory: 13,750 MB
Virtual Memory: Max Size:  15,770 MB
Virtual Memory: Available: 14,800 MB
Virtual Memory: In Use:    970 MB
Page File Location(s):     C:\pagefile.sys
Domain:                    WORKGROUP
Logon Server:              \\SERVER-2016-V7-
Hotfix(s):                 5 Hotfix(s) Installed.
                           [01]: KB3186568
                           [02]: KB3199986
                           [03]: KB4049065
                           [04]: KB4132216
                           [05]: KB4284880
Network Card(s):           1 NIC(s) Installed.
                           [01]: Red Hat VirtIO Ethernet Adapter
                                 Connection Name: Ethernet
                                 DHCP Enabled:    Yes
                                 DHCP Server:     169.254.169.254
                                 IP address(es)
                                 [01]: 10.240.0.19
Hyper-V Requirements:      A hypervisor has been detected. Features required for Hyper-V will not be displayed.

I tried simulating corrupted file on Windows

@alexbrainman, fascinating.

Y'know, if the text is zeroes, it disassembles to add %al,(%rax), which will indeed cause an access violation for most values of rax.

But surely the text section is also being memory mapped by the loader, so it wouldn't be incoherent, right? Or maybe this is the OS bug @jstarks is sensing. And why would be text be incoherent but not the executable headers (which we write out after the text)? So many questions...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

None yet
You can’t perform that action at this time.