Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

runtime: Windows binaries built with -race occasionally deadlock #35775

Closed
bcmills opened this issue Nov 22, 2019 · 17 comments
Closed

runtime: Windows binaries built with -race occasionally deadlock #35775

bcmills opened this issue Nov 22, 2019 · 17 comments

Comments

@bcmills
Copy link
Member

@bcmills bcmills commented Nov 22, 2019

Starting in November, there appears to be a dramatic uptick in the number of test timeouts on the windows-* builders.

Many of these are for tests that normally run nearly instantaneously, such as archive/tar and bufio.

2019-11-22T03:06:22-0e02cfb/windows-amd64-race
2019-11-21T22:20:17-94e9a5e/windows-amd64-race
2019-11-21T19:27:16-f4a8bf1/windows-amd64-longtest
2019-11-21T19:09:24-2434869/windows-amd64-longtest
2019-11-21T19:09:24-2434869/windows-amd64-race
2019-11-21T16:56:47-37715cc/windows-amd64-longtest
2019-11-21T16:56:47-37715cc/windows-amd64-race
2019-11-21T16:01:14-c7e73ef/windows-amd64-race
2019-11-20T20:51:13-9852b4b/windows-amd64-race
2019-11-13T19:15:27-7ad2748/windows-amd64-longtest
2019-11-12T22:09:05-a56d755/windows-amd64-2016
2019-11-12T01:07:15-ec73263/windows-amd64-2012
2019-11-08T17:01:05-a5a6f61/windows-amd64-2012
2019-11-07T19:18:12-1b0b980/windows-amd64-2012
2019-11-07T05:52:34-3eabdd2/windows-amd64-longtest
2019-11-06T09:09:59-0c5d545/windows-amd64-2008
2019-11-06T02:52:51-f71bd51/windows-amd64-2016
2019-11-05T16:31:48-414c1d4/windows-amd64-2016
2019-11-05T14:44:56-e457cc3/windows-amd64-race
2019-11-05T05:19:08-d51f7f3/windows-amd64-longtest
2019-11-05T03:50:54-979d65d/windows-amd64-2016
2019-11-05T00:19:10-6cbd737/windows-amd64-race
2019-11-01T14:48:28-a570fcf/windows-amd64-2012
2019-03-19T08:30:50-451a2eb/windows-amd64-2008
2018-12-05T21:54:54-6454a09/windows-amd64-race

CC @ianlancetaylor @aclements @alexbrainman @zx2c4

@bcmills
Copy link
Member Author

@bcmills bcmills commented Nov 22, 2019

I suspect that the same root cause is responsible for apparent deadlocks in the os/exec test.
2019-11-22T04:26:34-95be9b7/windows-amd64-2008
2019-11-21T22:20:17-94e9a5e/windows-amd64-2012

@ianlancetaylor
Copy link
Contributor

@ianlancetaylor ianlancetaylor commented Dec 2, 2019

2019-11-22T14:32:02-ea18a1c/windows-amd64-race
2019-11-22T15:32:23-c02f3b8/windows-amd64-longtest
2019-11-22T15:32:23-c02f3b8/windows-amd64-race
2019-11-22T19:01:42-c931f1b/windows-amd64-longtest
2019-11-22T19:01:42-c931f1b/windows-amd64-race
2019-11-22T19:03:26-28314cf/windows-amd64-race
2019-11-22T19:03:50-941ac9c/windows-amd64-race
2019-11-22T19:17:28-8324aca/windows-amd64-race
2019-11-22T21:09:43-9f3c2b6/windows-amd64-race
2019-11-22T22:24:38-980f0c0/windows-amd64-race
2019-11-22T23:23:13-6ba1bf3/windows-amd64-race
2019-11-23T00:06:41-f29e53d/windows-amd64-race
2019-11-23T11:00:41-6f7b96f/windows-amd64-longtest
2019-11-23T11:00:41-6f7b96f/windows-amd64-race
2019-11-24T15:06:38-b81dd1c/windows-amd64-longtest
2019-11-24T15:06:38-b81dd1c/windows-amd64-race
2019-11-25T15:08:18-01f15b6/windows-amd64-race
2019-11-25T16:26:15-476395c/windows-amd64-race
2019-11-25T16:43:27-e16f64c/windows-amd64-2016
2019-11-25T16:43:27-e16f64c/windows-amd64-race
2019-11-25T19:02:50-7adafc9/windows-amd64-2016
2019-11-25T21:15:11-5de3b97/windows-amd64-race
2019-11-25T23:37:03-87805c9/windows-amd64-race
2019-11-26T16:55:36-67f0f83/windows-amd64-race
2019-11-26T20:21:04-4a378d7/windows-amd64-2008
2019-11-27T01:29:00-c3f1492/windows-amd64-race
2019-11-27T01:30:19-b2482e4/windows-amd64-2008
2019-11-27T01:30:32-0f25102/windows-amd64-2008
2019-11-27T15:06:55-4e3d580/windows-amd64-race
2019-11-27T18:42:33-f1c8e2c/windows-amd64-race
2019-12-02T16:11:24-3edd1d8/windows-amd64-race
2019-12-02T16:45:37-bf3ee57/windows-amd64-race
2019-12-02T19:05:01-e4c01f0/windows-amd64-race
2019-12-02T20:12:54-a18608a/windows-amd64-2012
2019-12-02T20:12:54-a18608a/windows-amd64-longtest

@ianlancetaylor
Copy link
Contributor

@ianlancetaylor ianlancetaylor commented Dec 2, 2019

These failures are all either on the race builder or for tests with the race detector enabled. Seems like something makes deadlocks possible with the race detector.

The findflakes programs reports that this might be due to https://golang.org/cl/204340 "runtime: support preemption on windows/{386,amd64}".

CC @aclements @cherrymui

@bcmills
Copy link
Member Author

@bcmills bcmills commented Dec 4, 2019

Seems like something makes deadlocks possible with the race detector.

That could be related to #18717 — does the runtime assume that its preemption signals are eventually delivered?

@aclements
Copy link
Member

@aclements aclements commented Dec 5, 2019

does the runtime assume that its preemption signals are eventually delivered?

It does not. It will always try both a cooperative and a non-cooperative preemption and take whichever one it can get. If the signal goes missing, it will wait for a cooperative preemption just like before.

@bradfitz
Copy link
Contributor

@bradfitz bradfitz commented Dec 10, 2019

I keep seeing these as trybot flakes too:
https://storage.googleapis.com/go-build-log/dd36d1f3/windows-amd64-2016_767dbeaf.log

It'd be nice to see stacks, though. Maybe we could work on that first and it might yield a clue.

@mknyszek
Copy link
Contributor

@mknyszek mknyszek commented Dec 18, 2019

I've had a tough time trying to reproduce this in a gomote to get a stack trace, but I was able to reproduce at least once (unfortunately, I didn't get a stack trace that time...). If it's not too hard, it might be easier to reproduce in CI and the trybots @bradfitz, though I'm not sure how best to do that.

In particular, I've been targeting the "Testing race detector" tests that tend to be the source of failure as @ianlancetaylor mentioned and haven't been able to reproduce this failure on a gomote.

Meanwhile, I'll keep trying.

@bcmills bcmills pinned this issue Dec 20, 2019
@bcmills bcmills changed the title runtime: apparent deadlocks on Windows builders runtime: Windows binaries build with -race occasionally deadlock Dec 20, 2019
@bcmills bcmills changed the title runtime: Windows binaries build with -race occasionally deadlock runtime: Windows binaries built with -race occasionally deadlock Dec 20, 2019
@mknyszek
Copy link
Contributor

@mknyszek mknyszek commented Dec 20, 2019

Alrighty thanks to @aclements and a Windows laptop we have a reproducer, a theory, and a partial fix.

The problem is a race between SuspendThread and ExitProcess on Windows. The order of events is as follows:

Thread 1: Suspend (asynchronously) Thread 2.
Thread 2: Call ExitProcess, which terminates all threads except Thread 2.
Thread 2: In ExitProcess, receives asynchronous notification to suspend, and stops.

This race is already handled in the runtime for the usual exits by putting a lock around suspending a thread (and effectively disallowing it in certain cases, like exit), but in race mode __tsan_fini (called by racefini) calls ExitProcess instead. The fix is to just grab this lock before calling into __tsan_fini.

Unfortunately this raises a bigger issue: what if C code, called from Go, calls ExitProcess on Windows? We have no way to synchronize asynchronous preemption with that like we do with exits we can actually control. One thought is that ExitProcess already calls a bunch of DLL hooks; could we throw in our own to side-step this issue maybe? More thought on this problem is required.

@cherrymui
Copy link
Contributor

@cherrymui cherrymui commented Dec 21, 2019

Put a lock on each M? Lock the lock when the M is entering cgo call, and before SuspendThread on that M. So we won't suspend a thread running C code. This is fine as we won't preempt C code anyway.

In fact, if we want to SuspendThread on an M for which the lock is locked, meaning that it is running C code, we don't need to suspend it anyway, so we don't need to wait on the lock. If the M is about to enter C, and the lock is locked, meaning that there is a pending suspend, it can just preempt itself.

@mknyszek
Copy link
Contributor

@mknyszek mknyszek commented Dec 26, 2019

@cherrymui that solution makes sense, and would solve other issues such as sending signals to threads running C code resulting in surprising behavior like in #36281 (IIUC) if we used it on all platforms, not just Windows.

@aclements
Copy link
Member

@aclements aclements commented Jan 6, 2020

Unfortunately this raises a bigger issue: what if C code, called from Go, calls ExitProcess on Windows?

I took a stab at fixing this using @cherrymui's suggestion (varied slightly after chatting in person so there's no actual blocking behavior, just a one-shot CAS). Unfortunately, in Windows, any "system call" can potentially call ExitProcess, so we need to defend against this in reentersyscall, not just cgo calls. That's an extra atomic CAS on every system call.

@mknyszek mknyszek added NeedsFix and removed NeedsInvestigation labels Jan 7, 2020
@alexbrainman
Copy link
Member

@alexbrainman alexbrainman commented Jan 8, 2020

I also have make.bat reliably hanging on a Windows 7 computer here. I run make.bat (I use GOARCH=amd64) on top of

commit f376b8510ed7884c69a09fbcf61418f7285f2787 (origin/master, origin/HEAD, mas
Author: Emmanuel T Odeke <emmanuel@orijtech.com>
Date:   Thu Jan 2 02:00:55 2020 -0800

    all: fix invalid invocations of Fatalf in goroutines

    Found by running the go vet pass 'testinggoroutine' that
    I started in CL 212920.

    Change-Id: Ic9462fac85dbafc437fe4a323b886755a67a1efa
    Reviewed-on: https://go-review.googlesource.com/c/go/+/213097
    Run-TryBot: Emmanuel Odeke <emm.odeke@gmail.com>
    TryBot-Result: Gobot Gobot <gobot@golang.org>
    Reviewed-by: Ian Lance Taylor <iant@golang.org>

and this is how far it gets

image

As you can see, there is stuck go_bootstrap.exe there. I used Delve to see what the process does, and this is what I see:

(dlv) threads
  Thread 2004 at :0
  Thread 3912 at :0
  Thread 4064 at :0
  Thread 4368 at :0
  Thread 4580 at :0
  Thread 4852 at :0
  Thread 7620 at :0
  Thread 7680 at :0
  Thread 8592 at :0
* Thread 8712 at :0
  Thread 8852 at :0
  Thread 11816 at :0
  Thread 13452 at :0
  Thread 13632 at :0
  Thread 13692 at :0
  Thread 13792 at :0
  Thread 13956 at :0
  Thread 14180 at :0
  Thread 14192 at 0x5b6351 c:/Users/alexb/dev/go/src/crypto/sha256/sha256block_a
md64.s:805 crypto/sha256.block
(dlv) goroutines -t
  Goroutine 1 - User: c:/Users/alexb/dev/go/src/runtime/sema.go:56 sync.runtime_
Semacquire (0x447179)
         0  0x00000000004385e5 in runtime.gopark
             at c:/Users/alexb/dev/go/src/runtime/proc.go:305
         1  0x000000000044754e in runtime.goparkunlock
             at c:/Users/alexb/dev/go/src/runtime/proc.go:310
         2  0x000000000044754e in runtime.semacquire1
             at c:/Users/alexb/dev/go/src/runtime/sema.go:144
         3  0x0000000000447179 in sync.runtime_Semacquire
             at c:/Users/alexb/dev/go/src/runtime/sema.go:56
         4  0x000000000047e4bb in sync.(*WaitGroup).Wait
             at c:/Users/alexb/dev/go/src/sync/waitgroup.go:130
         5  0x00000000006a6735 in cmd/go/internal/work.(*Builder).Do
             at c:/Users/alexb/dev/go/src/cmd/go/internal/work/exec.go:187
         6  0x00000000006a08c6 in cmd/go/internal/work.InstallPackages
             at c:/Users/alexb/dev/go/src/cmd/go/internal/work/build.go:605
         7  0x000000000069fc3d in cmd/go/internal/work.runInstall
             at c:/Users/alexb/dev/go/src/cmd/go/internal/work/build.go:516
         8  0x000000000077d0d6 in main.main
             at c:/Users/alexb/dev/go/src/cmd/go/main.go:189
         9  0x0000000000438212 in runtime.main
             at c:/Users/alexb/dev/go/src/runtime/proc.go:203
        10  0x0000000000462c91 in runtime.goexit
             at c:/Users/alexb/dev/go/src/runtime/asm_amd64.s:1375
  Goroutine 2 - User: c:/Users/alexb/dev/go/src/runtime/proc.go:305 runtime.gopa
rk (0x4385e5)
        0  0x00000000004385e5 in runtime.gopark
            at c:/Users/alexb/dev/go/src/runtime/proc.go:305
        1  0x0000000000438485 in runtime.goparkunlock
            at c:/Users/alexb/dev/go/src/runtime/proc.go:310
        2  0x0000000000438485 in runtime.forcegchelper
            at c:/Users/alexb/dev/go/src/runtime/proc.go:253
        3  0x0000000000462c91 in runtime.goexit
            at c:/Users/alexb/dev/go/src/runtime/asm_amd64.s:1375
  Goroutine 3 - User: c:/Users/alexb/dev/go/src/runtime/proc.go:305 runtime.gopa
rk (0x4385e5)
        0  0x00000000004385e5 in runtime.gopark
            at c:/Users/alexb/dev/go/src/runtime/proc.go:305
        1  0x00000000004232bf in runtime.goparkunlock
            at c:/Users/alexb/dev/go/src/runtime/proc.go:310
        2  0x00000000004232bf in runtime.bgsweep
            at c:/Users/alexb/dev/go/src/runtime/mgcsweep.go:89
        3  0x0000000000462c91 in runtime.goexit
            at c:/Users/alexb/dev/go/src/runtime/asm_amd64.s:1375
  Goroutine 4 - User: c:/Users/alexb/dev/go/src/runtime/proc.go:305 runtime.gopa
rk (0x4385e5)
        0  0x00000000004385e5 in runtime.gopark
            at c:/Users/alexb/dev/go/src/runtime/proc.go:305
        1  0x0000000000421718 in runtime.goparkunlock
            at c:/Users/alexb/dev/go/src/runtime/proc.go:310
        2  0x0000000000421718 in runtime.scavengeSleep
            at c:/Users/alexb/dev/go/src/runtime/mgcscavenge.go:203
        3  0x0000000000421a32 in runtime.bgscavenge
            at c:/Users/alexb/dev/go/src/runtime/mgcscavenge.go:302
        4  0x0000000000462c91 in runtime.goexit
            at c:/Users/alexb/dev/go/src/runtime/asm_amd64.s:1375
  Goroutine 5 - User: c:/Users/alexb/dev/go/src/runtime/proc.go:305 runtime.gopa
rk (0x4385e5)
        0  0x00000000004385e5 in runtime.gopark
            at c:/Users/alexb/dev/go/src/runtime/proc.go:305
        1  0x0000000000418b91 in runtime.goparkunlock
            at c:/Users/alexb/dev/go/src/runtime/proc.go:310
        2  0x0000000000418b91 in runtime.runfinq
            at c:/Users/alexb/dev/go/src/runtime/mfinal.go:175
        3  0x0000000000462c91 in runtime.goexit
            at c:/Users/alexb/dev/go/src/runtime/asm_amd64.s:1375
  Goroutine 39 - User: :0 ??? (0x77ace306) (thread 13956)
        0  0x0000000077ace306 in ???
            at ?:-1
            error: Only part of a ReadProcessMemory or WriteProcessMemory reques
t was completed.
        (truncated)
  Goroutine 54 - User: c:/Users/alexb/dev/go/src/runtime/proc.go:305 runtime.gop
ark (0x4385e5)
        0  0x00000000004385e5 in runtime.gopark
            at c:/Users/alexb/dev/go/src/runtime/proc.go:305
        1  0x000000000041c6f4 in runtime.gcBgMarkWorker
            at c:/Users/alexb/dev/go/src/runtime/mgc.go:1880
        2  0x0000000000462c91 in runtime.goexit
            at c:/Users/alexb/dev/go/src/runtime/asm_amd64.s:1375
  Goroutine 55 - User: :0 ??? (0x77acbd7a) (thread 3912)
        0  0x0000000077acbd7a in ???
            at ?:-1
            error: Only part of a ReadProcessMemory or WriteProcessMemory reques
t was completed.
        (truncated)
  Goroutine 56 - User: :0 ??? (0x77acbd7a) (thread 7680)
        0  0x0000000077acbd7a in ???
            at ?:-1
            error: Only part of a ReadProcessMemory or WriteProcessMemory reques
t was completed.
        (truncated)
  Goroutine 991 - User: c:/Users/alexb/dev/go/src/cmd/go/internal/work/exec.go:1
68 cmd/go/internal/work.(*Builder).Do.func3 (0x6dc154)
        0  0x00000000004385e5 in runtime.gopark
            at c:/Users/alexb/dev/go/src/runtime/proc.go:305
        1  0x00000000004469e8 in runtime.selectgo
            at c:/Users/alexb/dev/go/src/runtime/select.go:316
        2  0x00000000006dc154 in cmd/go/internal/work.(*Builder).Do.func3
            at c:/Users/alexb/dev/go/src/cmd/go/internal/work/exec.go:168
        3  0x0000000000462c91 in runtime.goexit
            at c:/Users/alexb/dev/go/src/runtime/asm_amd64.s:1375
  Goroutine 992 - User: c:/Users/alexb/dev/go/src/cmd/go/internal/work/exec.go:1
68 cmd/go/internal/work.(*Builder).Do.func3 (0x6dc154)
        0  0x00000000004385e5 in runtime.gopark
            at c:/Users/alexb/dev/go/src/runtime/proc.go:305
        1  0x00000000004469e8 in runtime.selectgo
            at c:/Users/alexb/dev/go/src/runtime/select.go:316
        2  0x00000000006dc154 in cmd/go/internal/work.(*Builder).Do.func3
            at c:/Users/alexb/dev/go/src/cmd/go/internal/work/exec.go:168
        3  0x0000000000462c91 in runtime.goexit
            at c:/Users/alexb/dev/go/src/runtime/asm_amd64.s:1375
  Goroutine 993 - User: c:/Users/alexb/dev/go/src/cmd/go/internal/work/exec.go:1
68 cmd/go/internal/work.(*Builder).Do.func3 (0x6dc154)
        0  0x00000000004385e5 in runtime.gopark
            at c:/Users/alexb/dev/go/src/runtime/proc.go:305
        1  0x00000000004469e8 in runtime.selectgo
            at c:/Users/alexb/dev/go/src/runtime/select.go:316
        2  0x00000000006dc154 in cmd/go/internal/work.(*Builder).Do.func3
            at c:/Users/alexb/dev/go/src/cmd/go/internal/work/exec.go:168
        3  0x0000000000462c91 in runtime.goexit
            at c:/Users/alexb/dev/go/src/runtime/asm_amd64.s:1375
  Goroutine 1042 - User: c:/Users/alexb/dev/go/src/crypto/sha256/sha256block_amd
64.s:805 crypto/sha256.block (0x5b6351) (thread 14192)
         0  0x00000000005b6351 in crypto/sha256.block
             at c:/Users/alexb/dev/go/src/crypto/sha256/sha256block_amd64.s:805
         1  0x00000000005b25f2 in crypto/sha256.(*digest).Write
             at c:/Users/alexb/dev/go/src/crypto/sha256/sha256.go:198
         2  0x00000000004a3f2e in io.(*multiWriter).Write
             at c:/Users/alexb/dev/go/src/io/multi.go:60
         3  0x00000000004a38c2 in io.copyBuffer
             at c:/Users/alexb/dev/go/src/io/io.go:407
         4  0x00000000004a3621 in io.Copy
             at c:/Users/alexb/dev/go/src/io/io.go:364
         5  0x00000000004a3621 in io.CopyN
             at c:/Users/alexb/dev/go/src/io/io.go:340
         6  0x00000000005be33a in cmd/go/internal/cache.(*Cache).copyFile
             at c:/Users/alexb/dev/go/src/cmd/go/internal/cache/cache.go:490
         7  0x00000000005bdd43 in cmd/go/internal/cache.(*Cache).put
             at c:/Users/alexb/dev/go/src/cmd/go/internal/cache/cache.go:428
         8  0x00000000006a5897 in cmd/go/internal/cache.(*Cache).Put
             at c:/Users/alexb/dev/go/src/cmd/go/internal/cache/cache.go:403
         9  0x00000000006a5897 in cmd/go/internal/work.(*Builder).updateBuildID
             at c:/Users/alexb/dev/go/src/cmd/go/internal/work/buildid.go:706
        10  0x00000000006aaf34 in cmd/go/internal/work.(*Builder).build
             at c:/Users/alexb/dev/go/src/cmd/go/internal/work/exec.go:783
        (truncated)
[13 goroutines]
(dlv)

I also tried reproducing this on another Windows 7 (386 this time) running in VM, and I could not reproduce it. I am not sure what is special about this computer that it fails. But, I guess, if it is a race, then anything is possible. Anything can push it over the age.

Also I can make.bat pass, if I set runtime.preemptMSupported to false.

Let me know, if you want me to debug something. I can reproduce it pretty much every time.

Sorry all this details are vague. That is only free time I had.

Alex

@aclements
Copy link
Member

@aclements aclements commented Jan 8, 2020

@alexbrainman, thanks for sharing the details you had time to collect. From the Delve output it looks like the process has a lot of threads, which would mean this is a different issue. Could you confirm that through some other process explorer and paste your report into a new issue if there's more than one thread in the hung process?

@aclements
Copy link
Member

@aclements aclements commented Jan 8, 2020

For the record, this is what @mknyszek was using as a repro (based on dist's test):

stress -p 1 -timeout 10s -- go test -run 'TestParse|TestEcho|TestStdinCloseRace|TestClosedPipeRace|TestTypeRace|TestFdRace|TestFdReadRace|TestFileCloseRace' -race -count 1 - timeout 0 flag net os os/exec encoding/gob

We whittled this down to the following, which fails after a few minutes

go test -c -race flag
stress -p 4 -timeout 10s -- 'flag.test.exe' '-test.run' TestParse '-test.timeout' 0
@gopherbot
Copy link

@gopherbot gopherbot commented Jan 8, 2020

Change https://golang.org/cl/213837 mentions this issue: runtime: protect against external code calling ExitProcess

@gopherbot gopherbot closed this in 957259b Jan 9, 2020
@cherrymui
Copy link
Contributor

@cherrymui cherrymui commented Jan 9, 2020

I think we can unpin this now.

@alexbrainman
Copy link
Member

@alexbrainman alexbrainman commented Jan 10, 2020

Could you confirm that through some other process explorer and paste your report into a new issue if there's more than one thread in the hung process?

Done. See #36492.

Alex

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

None yet
8 participants
You can’t perform that action at this time.