Skip to content

time: time.Ticker(200ms) hasn't fired in 37 days #29123

@petermattis

Description

@petermattis

What version of Go are you using (go version)?

go version go1.10 linux/amd64

Does this issue reproduce with the latest release?

Not sure. This is the first time we've encountered it.

What operating system and processor architecture are you using (go env)?

Linux (amd64)

What did you do?

CockroachDB has various time based loops. A customer reported a problem on a cluster which we eventually traced down to some of these timer loops being wedged for excessively long periods of time. The cleanest example is:

func (s *Store) raftTickLoop(ctx context.Context) {
	ticker := time.NewTicker(s.cfg.RaftTickInterval)
	defer ticker.Stop()

	var rangeIDs []roachpb.RangeID

	for {
		select {
		case <-ticker.C:
...
		case <-s.stopper.ShouldStop():
			return
		}
	}
}

s.cfg.RaftTickInterval is 200 * time.Milliseconds (there is no way to change this without recompiling). The omitted ... code isn't doing anything with the ticker.

Here is a goroutine stack showing the impossible:

goroutine 487 [select, 53810 minutes]:
github.com/cockroachdb/cockroach/pkg/storage.(*Store).raftTickLoop(0xc4204b1000, 0x2779600, 0xc425516810)
	/go/src/github.com/cockroachdb/cockroach/pkg/storage/store.go:3979 +0x1e7
github.com/cockroachdb/cockroach/pkg/storage.(*Store).(github.com/cockroachdb/cockroach/pkg/storage.raftTickLoop)-fm(0x2779600, 0xc425516810)
	/go/src/github.com/cockroachdb/cockroach/pkg/storage/store.go:3965 +0x3e
github.com/cockroachdb/cockroach/pkg/util/stop.(*Stopper).RunWorker.func1(0xc425cd7ed0, 0xc420796000, 0xc425cd7ec0)
	/go/src/github.com/cockroachdb/cockroach/pkg/util/stop/stopper.go:192 +0xe9
created by github.com/cockroachdb/cockroach/pkg/util/stop.(*Stopper).RunWorker
	/go/src/github.com/cockroachdb/cockroach/pkg/util/stop/stopper.go:185 +0xad

We have examples of badness with time.Ticker loops and time.Timer loops. This problem has occurred on multiple nodes within the same cluster, though we're unaware of it ever occurring on another cluster. Also somewhat interesting is that we have evidence that not all timer loops on a node blocked at the same time. For example, from the same node there is this stack:

goroutine 300 [select, 3661 minutes]:
github.com/cockroachdb/cockroach/pkg/storage.(*replicaScanner).waitAndProcess(0xc424e7a000, 0x2779600, 0xc4206174d0, 0xc420796000, 0x6dfee00, 0xed369038f, 0x0, 0xc429488a80, 0x400)
	/go/src/github.com/cockroachdb/cockroach/pkg/storage/scanner.go:212 +0x212
github.com/cockroachdb/cockroach/pkg/storage.(*replicaScanner).scanLoop.func1.1(0xc429488a80, 0xc42079a301)
	/go/src/github.com/cockroachdb/cockroach/pkg/storage/scanner.go:274 +0x81
github.com/cockroachdb/cockroach/pkg/storage.(*storeReplicaVisitor).Visit(0xc42079a390, 0xc42717e500)
	/go/src/github.com/cockroachdb/cockroach/pkg/storage/store.go:324 +0x1a6
github.com/cockroachdb/cockroach/pkg/storage.(*replicaScanner).scanLoop.func1(0x2779600, 0xc4206174d0)
	/go/src/github.com/cockroachdb/cockroach/pkg/storage/scanner.go:272 +0x1c0
github.com/cockroachdb/cockroach/pkg/util/stop.(*Stopper).RunWorker.func1(0xc4202cbba0, 0xc420796000, 0xc4203b2ec0)
	/go/src/github.com/cockroachdb/cockroach/pkg/util/stop/stopper.go:192 +0xe9
created by github.com/cockroachdb/cockroach/pkg/util/stop.(*Stopper).RunWorker
	/go/src/github.com/cockroachdb/cockroach/pkg/util/stop/stopper.go:185 +0xad

The loop in replicaScanner.scanLoop is supposed to iterate over all replicas on a node every 10min. That's what we always see, yet here we see that goroutine blocked on a select for 2.5 days.

Have there every been other similar reports to this (we found nothing looking through the open and closed issues)? Is it conceivable that we have a bug in cockroach that corrupted internal runtime data structures? We're scratching our heads over here about what could be going on.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions