Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

time: time.Ticker(200ms) hasn't fired in 37 days #29123

Closed
petermattis opened this issue Dec 6, 2018 · 1 comment

Comments

Projects
None yet
1 participant
@petermattis
Copy link

commented Dec 6, 2018

What version of Go are you using (go version)?

go version go1.10 linux/amd64

Does this issue reproduce with the latest release?

Not sure. This is the first time we've encountered it.

What operating system and processor architecture are you using (go env)?

Linux (amd64)

What did you do?

CockroachDB has various time based loops. A customer reported a problem on a cluster which we eventually traced down to some of these timer loops being wedged for excessively long periods of time. The cleanest example is:

func (s *Store) raftTickLoop(ctx context.Context) {
	ticker := time.NewTicker(s.cfg.RaftTickInterval)
	defer ticker.Stop()

	var rangeIDs []roachpb.RangeID

	for {
		select {
		case <-ticker.C:
...
		case <-s.stopper.ShouldStop():
			return
		}
	}
}

s.cfg.RaftTickInterval is 200 * time.Milliseconds (there is no way to change this without recompiling). The omitted ... code isn't doing anything with the ticker.

Here is a goroutine stack showing the impossible:

goroutine 487 [select, 53810 minutes]:
github.com/cockroachdb/cockroach/pkg/storage.(*Store).raftTickLoop(0xc4204b1000, 0x2779600, 0xc425516810)
	/go/src/github.com/cockroachdb/cockroach/pkg/storage/store.go:3979 +0x1e7
github.com/cockroachdb/cockroach/pkg/storage.(*Store).(github.com/cockroachdb/cockroach/pkg/storage.raftTickLoop)-fm(0x2779600, 0xc425516810)
	/go/src/github.com/cockroachdb/cockroach/pkg/storage/store.go:3965 +0x3e
github.com/cockroachdb/cockroach/pkg/util/stop.(*Stopper).RunWorker.func1(0xc425cd7ed0, 0xc420796000, 0xc425cd7ec0)
	/go/src/github.com/cockroachdb/cockroach/pkg/util/stop/stopper.go:192 +0xe9
created by github.com/cockroachdb/cockroach/pkg/util/stop.(*Stopper).RunWorker
	/go/src/github.com/cockroachdb/cockroach/pkg/util/stop/stopper.go:185 +0xad

We have examples of badness with time.Ticker loops and time.Timer loops. This problem has occurred on multiple nodes within the same cluster, though we're unaware of it ever occurring on another cluster. Also somewhat interesting is that we have evidence that not all timer loops on a node blocked at the same time. For example, from the same node there is this stack:

goroutine 300 [select, 3661 minutes]:
github.com/cockroachdb/cockroach/pkg/storage.(*replicaScanner).waitAndProcess(0xc424e7a000, 0x2779600, 0xc4206174d0, 0xc420796000, 0x6dfee00, 0xed369038f, 0x0, 0xc429488a80, 0x400)
	/go/src/github.com/cockroachdb/cockroach/pkg/storage/scanner.go:212 +0x212
github.com/cockroachdb/cockroach/pkg/storage.(*replicaScanner).scanLoop.func1.1(0xc429488a80, 0xc42079a301)
	/go/src/github.com/cockroachdb/cockroach/pkg/storage/scanner.go:274 +0x81
github.com/cockroachdb/cockroach/pkg/storage.(*storeReplicaVisitor).Visit(0xc42079a390, 0xc42717e500)
	/go/src/github.com/cockroachdb/cockroach/pkg/storage/store.go:324 +0x1a6
github.com/cockroachdb/cockroach/pkg/storage.(*replicaScanner).scanLoop.func1(0x2779600, 0xc4206174d0)
	/go/src/github.com/cockroachdb/cockroach/pkg/storage/scanner.go:272 +0x1c0
github.com/cockroachdb/cockroach/pkg/util/stop.(*Stopper).RunWorker.func1(0xc4202cbba0, 0xc420796000, 0xc4203b2ec0)
	/go/src/github.com/cockroachdb/cockroach/pkg/util/stop/stopper.go:192 +0xe9
created by github.com/cockroachdb/cockroach/pkg/util/stop.(*Stopper).RunWorker
	/go/src/github.com/cockroachdb/cockroach/pkg/util/stop/stopper.go:185 +0xad

The loop in replicaScanner.scanLoop is supposed to iterate over all replicas on a node every 10min. That's what we always see, yet here we see that goroutine blocked on a select for 2.5 days.

Have there every been other similar reports to this (we found nothing looking through the open and closed issues)? Is it conceivable that we have a bug in cockroach that corrupted internal runtime data structures? We're scratching our heads over here about what could be going on.

@petermattis

This comment has been minimized.

Copy link
Author

commented Dec 6, 2018

Well, it appears that the investigation above triggered a memory from the customer experiencing this problem about a timer related issue (not Go related) which they had previously traced to kernel timer corruption and sleeps getting stuck. Apparently https://lore.kernel.org/lkml/tip-1f71addd34f4c442bec7d7c749acc1beb58126f2@git.kernel.org/ fixed the problem.

Fun times. Not a CockroachDB bug. Not a Go runtime bug.

@petermattis petermattis closed this Dec 6, 2018

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.