x/build: reverse pool locking problem in the coordinator #10750
Closed
Labels
Comments
In the reverse buildlet healthcheck, this channel receive is blocking for 33+ minutes while holding the mutex: // reverseHealthCheck requests the status page of each idle buildlet.
// If the buildlet fails to respond promptly, it is removed from the pool.
func (p *reverseBuildletPool) reverseHealthCheck() {
p.mu.Lock()
responses := make(map[*reverseBuildlet]chan error)
for _, b := range p.buildlets {
if b.inUseAs == "health" { // sanity check
panic("previous health check still running")
}
if b.inUseAs != "" {
continue // skip busy buildlets
}
b.inUseAs = "health"
res := make(chan error, 1)
responses[b] = res
client := b.client
go func() {
_, err := client.Status()
res <- err
}()
}
p.mu.Unlock()
time.Sleep(5 * time.Second) // give buildlets time to respond
p.mu.Lock()
var buildlets []*reverseBuildlet
for _, b := range p.buildlets {
res := responses[b]
if b.inUseAs != "health" || res == nil {
// buildlet skipped or registered after health check
buildlets = append(buildlets, b)
continue
}
b.inUseAs = ""
err, done := <-res // <------------ HERE (that final line) |
CL https://golang.org/cl/9851 mentions this issue. |
Sign up for free
to subscribe to this conversation on GitHub.
Already have an account?
Sign in.
@crawshaw,
farmer.golang.org is hanging. Interesting stack goroutine:
The text was updated successfully, but these errors were encountered: