Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cmd/compile: apparent deadlocks on builders #45885

Closed
bcmills opened this issue Apr 30, 2021 · 4 comments
Closed

cmd/compile: apparent deadlocks on builders #45885

bcmills opened this issue Apr 30, 2021 · 4 comments

Comments

@bcmills
Copy link
Member

@bcmills bcmills commented Apr 30, 2021

I started looking into build failures with stuck subprocesses due to an unexpected TryBot flake in https://storage.googleapis.com/go-build-log/385d1fe8/freebsd-amd64-12_2_0bbe1706.log.

I check the builder logs for similar failures, and found a lot of processes stuck on go build but for the most part little clue about what go build itself was stuck on.

However, one of the logs seems to point to a very long hang in an invocation of cmd/compile:

2021-04-29T16:29:52-f7c6f62/linux-amd64-buster (stuck 7 minutes on compile, if I'm reading this code properly)

A couple of others (including the aforementioned TryBot run) were stuck in go build or go install, but didn't terminate the go command with quite the right signal to figure out where it was stuck.
2021-04-29T23:41:22-a9705e1/freebsd-amd64-12_2
2021-04-23T21:48:41-a6d3dc4/freebsd-386-12_2

We do occasionally see hangs in other cmd/go invocations, but they are generally very specific (go list on MIPS - #45884, and occasionally something on darwin ‐ #38768) and in particular don't account for the linux-amd64-buster failure.

@gopherbot
Copy link

@gopherbot gopherbot commented May 10, 2021

Change https://golang.org/cl/318569 mentions this issue: runtime: hold sched.lock across atomic pidleget/pidleput

Loading

@prattmic
Copy link
Member

@prattmic prattmic commented May 10, 2021

It is hard to be certain without more complete stacks, but I think http://golang.org/cl/318569 will fix this. i.e., that #45975, #45916, #45885, and #45884 all have the same cause.

Loading

gopherbot pushed a commit that referenced this issue May 11, 2021
As a cleanup, golang.org/cl/307914 unintentionally caused the idle GC
work recheck to drop sched.lock between acquiring a P and committing to
keep it (once a worker G was found).

This is unsafe, as releasing a P requires extra checks once sched.lock
is taken (such as for runSafePointFn). Since checkIdleGCNoP does not
perform these extra checks, we can now race with other users.

In the case of #45975, we may hang with this sequence:

1. M1: checkIdleGCNoP takes sched.lock, gets P1, releases sched.lock.
2. M2: forEachP takes sched.lock, iterates over sched.pidle without
   finding P1, releases sched.lock.
3. M1: checkIdleGCNoP puts P1 back in sched.pidle.
4. M2: forEachP waits forever for P1 to run the safePointFn.

Change back to the old behavior of releasing sched.lock only after we
are certain we will keep the P. Thus if we put it back its removal from
sched.pidle was never visible.

Fixes #45975
For #45916
For #45885
For #45884

Change-Id: I191a1800923b206ccaf96bdcdd0bfdad17b532e9
Reviewed-on: https://go-review.googlesource.com/c/go/+/318569
Trust: Michael Pratt <mpratt@google.com>
Run-TryBot: Michael Pratt <mpratt@google.com>
TryBot-Result: Go Bot <gobot@golang.org>
Reviewed-by: David Chase <drchase@google.com>
@prattmic
Copy link
Member

@prattmic prattmic commented May 11, 2021

I think this bug should be fixed, but can't be sure. Please note here if you see more occurrences.

Loading

@mknyszek
Copy link
Contributor

@mknyszek mknyszek commented May 24, 2021

Just checked again and I haven't seen any of these deadlocks (searching for SIGQUIT) since May 8th on a first-class port (May 9th on non-first-class ports). Closing the issue.

Loading

@mknyszek mknyszek closed this May 24, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

None yet
4 participants