-
Notifications
You must be signed in to change notification settings - Fork 18.8k
Description
What version of Go are you using (go version)?
go version go1.10 linux/amd64
Does this issue reproduce with the latest release?
Not sure. This is the first time we've encountered it.
What operating system and processor architecture are you using (go env)?
Linux (amd64)
What did you do?
CockroachDB has various time based loops. A customer reported a problem on a cluster which we eventually traced down to some of these timer loops being wedged for excessively long periods of time. The cleanest example is:
func (s *Store) raftTickLoop(ctx context.Context) {
ticker := time.NewTicker(s.cfg.RaftTickInterval)
defer ticker.Stop()
var rangeIDs []roachpb.RangeID
for {
select {
case <-ticker.C:
...
case <-s.stopper.ShouldStop():
return
}
}
}
s.cfg.RaftTickInterval is 200 * time.Milliseconds (there is no way to change this without recompiling). The omitted ... code isn't doing anything with the ticker.
Here is a goroutine stack showing the impossible:
goroutine 487 [select, 53810 minutes]:
github.com/cockroachdb/cockroach/pkg/storage.(*Store).raftTickLoop(0xc4204b1000, 0x2779600, 0xc425516810)
/go/src/github.com/cockroachdb/cockroach/pkg/storage/store.go:3979 +0x1e7
github.com/cockroachdb/cockroach/pkg/storage.(*Store).(github.com/cockroachdb/cockroach/pkg/storage.raftTickLoop)-fm(0x2779600, 0xc425516810)
/go/src/github.com/cockroachdb/cockroach/pkg/storage/store.go:3965 +0x3e
github.com/cockroachdb/cockroach/pkg/util/stop.(*Stopper).RunWorker.func1(0xc425cd7ed0, 0xc420796000, 0xc425cd7ec0)
/go/src/github.com/cockroachdb/cockroach/pkg/util/stop/stopper.go:192 +0xe9
created by github.com/cockroachdb/cockroach/pkg/util/stop.(*Stopper).RunWorker
/go/src/github.com/cockroachdb/cockroach/pkg/util/stop/stopper.go:185 +0xad
We have examples of badness with time.Ticker loops and time.Timer loops. This problem has occurred on multiple nodes within the same cluster, though we're unaware of it ever occurring on another cluster. Also somewhat interesting is that we have evidence that not all timer loops on a node blocked at the same time. For example, from the same node there is this stack:
goroutine 300 [select, 3661 minutes]:
github.com/cockroachdb/cockroach/pkg/storage.(*replicaScanner).waitAndProcess(0xc424e7a000, 0x2779600, 0xc4206174d0, 0xc420796000, 0x6dfee00, 0xed369038f, 0x0, 0xc429488a80, 0x400)
/go/src/github.com/cockroachdb/cockroach/pkg/storage/scanner.go:212 +0x212
github.com/cockroachdb/cockroach/pkg/storage.(*replicaScanner).scanLoop.func1.1(0xc429488a80, 0xc42079a301)
/go/src/github.com/cockroachdb/cockroach/pkg/storage/scanner.go:274 +0x81
github.com/cockroachdb/cockroach/pkg/storage.(*storeReplicaVisitor).Visit(0xc42079a390, 0xc42717e500)
/go/src/github.com/cockroachdb/cockroach/pkg/storage/store.go:324 +0x1a6
github.com/cockroachdb/cockroach/pkg/storage.(*replicaScanner).scanLoop.func1(0x2779600, 0xc4206174d0)
/go/src/github.com/cockroachdb/cockroach/pkg/storage/scanner.go:272 +0x1c0
github.com/cockroachdb/cockroach/pkg/util/stop.(*Stopper).RunWorker.func1(0xc4202cbba0, 0xc420796000, 0xc4203b2ec0)
/go/src/github.com/cockroachdb/cockroach/pkg/util/stop/stopper.go:192 +0xe9
created by github.com/cockroachdb/cockroach/pkg/util/stop.(*Stopper).RunWorker
/go/src/github.com/cockroachdb/cockroach/pkg/util/stop/stopper.go:185 +0xad
The loop in replicaScanner.scanLoop is supposed to iterate over all replicas on a node every 10min. That's what we always see, yet here we see that goroutine blocked on a select for 2.5 days.
Have there every been other similar reports to this (we found nothing looking through the open and closed issues)? Is it conceivable that we have a bug in cockroach that corrupted internal runtime data structures? We're scratching our heads over here about what could be going on.