New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

time: wall and monotonic clocks get out of sync #27090

Open
dshearer opened this Issue Aug 20, 2018 · 11 comments

Comments

Projects
None yet
3 participants
@dshearer

dshearer commented Aug 20, 2018

What version of Go are you using (go version)?

go1.10.3 linux/amd64

Does this issue reproduce with the latest release?

Indeed.

What operating system and processor architecture are you using (go env)?

Running in a Docker container. Here's the output of go env on the container:

GOARCH="amd64"
GOBIN=""
GOCACHE="/root/.cache/go-build"
GOEXE=""
GOHOSTARCH="amd64"
GOHOSTOS="linux"
GOOS="linux"
GOPATH="/go"
GORACE=""
GOROOT="/usr/local/go"
GOTMPDIR=""
GOTOOLDIR="/usr/local/go/pkg/tool/linux_amd64"
GCCGO="gccgo"
CC="gcc"
CXX="g++"
CGO_ENABLED="1"
CGO_CFLAGS="-g -O2"
CGO_CPPFLAGS=""
CGO_CXXFLAGS="-g -O2"
CGO_FFLAGS="-g -O2"
CGO_LDFLAGS="-g -O2"
PKG_CONFIG="pkg-config"
GOGCCFLAGS="-fPIC -m64 -pthread -fno-caret-diagnostics -Qunused-arguments -fmessage-length=0 -fdebug-prefix-map=/tmp/go-build937894194=/tmp/go-build -gno-record-gcc-switches"

Strangely, I have not seen, or heard reports of, this happening outside of Docker.

What did you do?

My program must do some task on a schedule. So it uses the "time" lib to compute the next time to do the task, and to wait till that time. Here's an example:

package main

import (
	"fmt"
	"time"
)

func main() {
	now := time.Now()

	for {
		fmt.Println("")
		nrt := now.Add(time.Second * 5)
		fmt.Printf("Now: %v; Next run time: %v\n", now.String(), nrt.String())

		for now.Before(nrt) {
			sleepDur := nrt.Sub(now)
			fmt.Printf("Sleeping for %v\n", sleepDur)

			afterChan := time.After(sleepDur)
			now = <-afterChan
			fmt.Printf("Awoke at %v\n", now.String())
		}

		// do task
		fmt.Printf("Doing task at %v (next run time: %v)\n", now.String(), nrt.String())
	}
}

I ran it in a Docker container (version 18.06.0-ce-mac70):

FROM golang:1.10-alpine

WORKDIR /
COPY main.go /
RUN go build /main.go

ENTRYPOINT ["/main"]

What did you expect to see?

For every "Doing task at X (next run time: Y)" line, X should be >= Y.

What did you see instead?

After a few iterations, I see "Doing task at X (next run time: Y)" lines where X < Y. Example:

Doing task at 2018-08-20 00:09:51.9754029 +0000 UTC m=+60.021665301 (next run time: 2018-08-20 00:09:52.0083237 +0000 UTC m=+60.021022301)

Analysis

This does not always happen, and usually only after a few iterations. As I mentioned above, I have only seen this in Docker containers. With this example program, the times will only be off by tens of milliseconds.

Here's a longer output sample, with 3 iterations:

Now: 2018-08-20 00:49:33.8226258 +0000 UTC m=+550.358479801; Next run time: 2018-08-20 00:49:38.8226258 +0000 UTC m=+555.358479801
Sleeping for 5s
Awoke at 2018-08-20 00:49:38.8275073 +0000 UTC m=+555.363361401
Doing task at 2018-08-20 00:49:38.8275073 +0000 UTC m=+555.363361401 (next run time: 2018-08-20 00:49:38.8226258 +0000 UTC m=+555.358479801)

Now: 2018-08-20 00:49:38.8275073 +0000 UTC m=+555.363361401; Next run time: 2018-08-20 00:49:43.8275073 +0000 UTC m=+560.363361401
Sleeping for 5s
Awoke at 2018-08-20 00:49:43.8283399 +0000 UTC m=+560.364194401
Doing task at 2018-08-20 00:49:43.8283399 +0000 UTC m=+560.364194401 (next run time: 2018-08-20 00:49:43.8275073 +0000 UTC m=+560.363361401)

Now: 2018-08-20 00:49:43.8283399 +0000 UTC m=+560.364194401; Next run time: 2018-08-20 00:49:48.8283399 +0000 UTC m=+565.364194401
Sleeping for 5s
Awoke at 2018-08-20 00:49:48.799623 +0000 UTC m=+565.368983701
Doing task at 2018-08-20 00:49:48.799623 +0000 UTC m=+565.368983701 (next run time: 2018-08-20 00:49:48.8283399 +0000 UTC m=+565.364194401)

The bug shows up in the last iteration, in which Before claims that 00:49:48.799623 is not before 00:49:48.8283399. Interestingly, while Before is incorrect in terms of the wall-clock times, it is correct in terms of the monotonic times.

The last iteration began with now == 00:49:43.8283399 (m=+560.364194401). It then slept, and woke when the After channel passed it a new now value of 00:49:48.799623 (m=+565.368983701). Note that the difference in wall-clock time is 4.971283099999994 sec, while the difference in monotonic time is 5.004789300000084 sec. So, it seems that the time lib is returning time values in which the relation between the monotonic and wall clocks changes a bit. IOW, one of these clocks is not properly keeping time.

Cc. @rsc

Background

I run the Jobber project, which is an enhanced cron that can be run in Docker. This bug caused some users' jobs to run twice: once a second or two before the scheduled time, and once at the scheduled time. Confer dshearer/jobber#192

@dshearer dshearer changed the title from Wall and monotonic clocks get out of sync to time: Wall and monotonic clocks get out of sync Aug 20, 2018

@ianlancetaylor ianlancetaylor added this to the Go1.12 milestone Aug 20, 2018

@ianlancetaylor ianlancetaylor changed the title from time: Wall and monotonic clocks get out of sync to time: wall and monotonic clocks get out of sync Aug 20, 2018

@ianlancetaylor

This comment has been minimized.

Contributor

ianlancetaylor commented Aug 20, 2018

Both the wall clock time and the monotonic time come from calling clock_gettime VDSO call. For the wall time the Go runtime passes CLOCK_REALTIME, and for the monotonic time it passes CLOCK_MONOTONIC. It would be interesting to see whether a C program calling clock_gettime in a loop would show the same behavior in your container.

@dshearer

This comment has been minimized.

dshearer commented Sep 3, 2018

I've reproduced this in C by reimplementing much of the Go time API. It's in https://github.com/dshearer/golang-issue-27090 . This repo also includes an equivalent program in Go.

@dshearer

This comment has been minimized.

dshearer commented Sep 3, 2018

I think the problem is the fact that the realtime clock and the monotonic clock cannot be checked simultaneously --- rather, you must make two calls to clock_gettime. And of course some amount of time passes between these calls. In a container, this amount of time can be large enough to cause this bug.

@dshearer

This comment has been minimized.

dshearer commented Sep 3, 2018

Here's the output of the Go program (https://github.com/dshearer/golang-issue-27090/blob/master/go/main.go) running in a container:

Sleeping for 5s
Woke at 2018-09-03 21:09:56.0803929 +0000 UTC m=+5.001737701 (should be >= 2018-09-03 21:09:56.0791786 +0000 UTC m=+5.000523001)
Wall diff: 5.001214504241943        Mono diff: 5.001214593590703

Sleeping for 5s
Woke at 2018-09-03 21:10:01.0818136 +0000 UTC m=+10.003160201 (should be >= 2018-09-03 21:10:01.0803929 +0000 UTC m=+10.001737701)
Wall diff: 5.001420497894287        Mono diff: 5.001422882080078

Sleeping for 5s
Woke at 2018-09-03 21:10:06.0847409 +0000 UTC m=+15.006088101 (should be >= 2018-09-03 21:10:06.0818136 +0000 UTC m=+15.003160201)
Wall diff: 5.002927303314209        Mono diff: 5.002927780151367

Sleeping for 5s
Woke at 2018-09-03 21:10:11.0863005 +0000 UTC m=+20.007650001 (should be >= 2018-09-03 21:10:11.0847409 +0000 UTC m=+20.006088101)
Wall diff: 5.001559495925903        Mono diff: 5.001562118530273

Sleeping for 5s
Woke at 2018-09-03 21:10:16.0874598 +0000 UTC m=+25.008804601 (should be >= 2018-09-03 21:10:16.0863005 +0000 UTC m=+25.007650001)
Wall diff: 5.001159429550171        Mono diff: 5.001153945922852

Sleeping for 5s
Woke at 2018-09-03 21:10:21.0564017 +0000 UTC m=+30.011997801 (should be >= 2018-09-03 21:10:21.0874598 +0000 UTC m=+30.008804601)
Wall diff: 4.968941926956177        Mono diff: 5.003192901611328
BUG ENCOUNTERED!!!

In each iteration, it sleeps for 5 seconds. Those two diffs ("Wall diff" and "Mono diff") show how much time has passed according to the wall and mono components of the time structs. (I made a hack to get the mono component.)

@dshearer

This comment has been minimized.

dshearer commented Sep 3, 2018

And here's the output of the C program (https://github.com/dshearer/golang-issue-27090/blob/master/c/main.c) running in a similar container:

Woke at 21:15:18.712391 m=5.001187 (should be >= 21:15:18.711209 m=5.000000)
Wall diff: 5.001181        Mono diff: 5.001187

Sleeping for 5 sec
Woke at 21:15:23.683796 m=10.006417 (should be >= 21:15:23.712687 m=10.001482)
Wall diff: 4.971109        Mono diff: 5.004935
BUG ENCOUNTERED!!!
@ianlancetaylor

This comment has been minimized.

Contributor

ianlancetaylor commented Sep 3, 2018

Although some time passes between the calls to clock_gettime, we always make the calls in the same order: we call with CLOCK_REALTIME first and with CLOCK_MONOTONIC second. So if the kernel maintains those timers in a consistent manner, it's not clear to me how this can happen.

@ianlancetaylor

This comment has been minimized.

Contributor

ianlancetaylor commented Sep 3, 2018

By the way, thanks for recreating the problem in C. That is very helpful and suggests that there may not be much we do about it in Go.

@dshearer

This comment has been minimized.

dshearer commented Sep 3, 2018

The clocks are read in the same order, but the amount of time between the reads varies. That means that times returned by time.Now present the two clocks in an inconsistent relationship with each other.

For illustration, let's pretend that the time struct just has two numbers for each clock. I first call time.Now and get start = {wall: A, mono: B}. I then compute the time I want to wake up by adding 5 units to start: wakeup = {wall: A+5, mono: B+5}. I then sleep, and then call time.Now. But now, I get curr = {wall: A', mono: B'} such that A' - A != B' - B. And eventually, after several iterations, I'll have a curr such that B' > B+5 but A' < A+5 --- or, in other words:

!curr.Before(wake) && curr.UnixNano() < wake.UnixNano()

I reproduced the problem in C by reproducing how Go uses clock_gettime. So the bug (assuming we agree this is a bug) is in how Go uses clock_gettime. In fact, it seems this problem was anticipated in the design doc (https://github.com/golang/proposal/blob/master/design/12914-monotonic.md):

Precision: In general, operating systems provide different system operations to read the wall clock
and the monotonic clock, so the implementation of time.Now must read both in sequence. Time
will advance between the calls, with the effect that even in the absence of clock resets, t.Sub(u)
(using monotonic clock readings) and t.AddDate(0,0,0).Sub(u) (using wall clock readings) will
differ slightly. Since both cases are subtracting times obtained time.Now, both results are arguably
correct: any discrepancy is necessarily less than the overhead of the calls to time.Now. This
discrepancy only arises if code actively looks for it, by doing the subtraction or comparison both
ways. In the survey of extant Go code (see Appendix below), we found no such code that would
detect this discrepancy.

@ianlancetaylor

This comment has been minimized.

Contributor

ianlancetaylor commented Sep 7, 2018

Thanks for the explanation.

I don't see how Go could use clock_gettime in any other way, so I don't see any way to fix this. Really the only thing that comes to mind is a kernel call that returns both kinds of time with consistent values.

@dshearer dshearer closed this Sep 9, 2018

@dshearer dshearer reopened this Sep 9, 2018

@dshearer

This comment has been minimized.

dshearer commented Sep 9, 2018

Oops. Didn't mean to close it.

FYI, here's the next paragraph in the design doc:

On x86 systems, Linux, macOS, and Windows convey clock information to user processes by
publishing a page of memory containing the coefficients for a formula converting the processor's
time stamp counter to monotonic clock and to wall clock readings. A perfectly synchronized read of
both clocks could be obtained in this case by doing a single read of the time stamp counter and
applying both formulas to the same input. This is an option if we decide it is important to eliminate
the discrepancy on commonly used systems. This would improve precision but again it is false
precision beyond the actual accuracy of the calls.

I can't figure out how to implement this fix without doing strange hacks to find these coefficients.

I hate to say it, but I think the root problem is the decision to put both clocks in one datatype :(

dshearer added a commit to dshearer/jobber that referenced this issue Sep 10, 2018

(#192) Bugfix in queue.go: Job execed multiple times
- A bug in Go's time package (golang/go#27090) was causing jobs
to be executed multiple times.  In this commit, we provide a workaround.
@wsc1

This comment has been minimized.

wsc1 commented Oct 18, 2018

Thanks for the explanation.

I don't see how Go could use clock_gettime in any other way, so I don't see any way to fix this. Really the only thing that comes to mind is a kernel call that returns both kinds of time with consistent values.

I wouldn't have expected the times to agree to begin with in any container/virtualized situation. To me, this is "working as expected".

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment