New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

time: clock drift on Windows 2008r2 w/ version >= 1.9 #24489

Open
dennisdupont opened this Issue Mar 22, 2018 · 8 comments

Comments

Projects
None yet
9 participants
@dennisdupont

dennisdupont commented Mar 22, 2018

What version of Go are you using (go version)?

Occurs on v1.9 and above

Does this issue reproduce with the latest release?

Yes

What operating system and processor architecture are you using (go env)?

windows/amd64
Issue occurs on win2008r2, but not on win2012 (tested) or win2016 (according to consul forum comments)
Issue occurs on domain attached or standalone server.

What did you do?

Clock drift was noticed on a deployed consul cluster (see hashicorp/consul#3925 for excruciating details). Determined it started with consul v0.9.3 and existed in the latest. That was when they switched to go v1.9. So we downgraded to v0.9.2 and problem disappeared.

The major applicable change in go 1.9 seemed to be the monotonic clock changes, so I experimented with the go version. If consul v0.9.3 is built with go 1.8 the problem also does not exist.

With help from a consul contributor we were able to create a small snippet to reproduce the issue:
https://play.golang.org/p/4y79262HSrJ

The clock drift is measured the same way as with the production servers, running w32tm:
>w32tm /stripchart /computer:10.60.1.25 /dataonly /samples:100
As soon as the test starts running you can see drift.

What did you expect to see?

A stable clock

What did you see instead?

Significant clock drift
Here is an example run:

C:\Users\Administrator>w32tm /stripchart /computer:208.88.126.235 /dataonly /samples:100
Tracking 208.88.126.235 [208.88.126.235:123].
Collecting 100 samples.
The current time is 3/22/2018 9:45:25 AM.
09:45:25, +00.0104974s
09:45:27, +00.0085572s
09:45:29, +00.0080007s
09:45:31, +00.0022288s
09:45:33, +00.0070934s
09:45:35, -00.0778244s <== test started
09:45:38, -00.1392391s
09:45:40, -00.3150037s
09:45:42, -00.4225186s
09:45:44, -00.4935759s
09:45:46, -00.6112448s
09:45:48, -00.7180814s
09:45:50, -00.8264958s
09:45:52, -00.9447071s
09:45:54, -01.0553810s <== over 1 second offset in ~20 seconds
09:45:56, -01.1570893s
09:45:58, -01.2324556s

This keeps growing until (S)NTP starts fighting the drift, but we have seen it as high as ~180 seconds, enough to cause kerberos auth failures.

@ALTree ALTree changed the title from Clock drift on Windows 2008r2 w/ version >= 1.9 to time: clock drift on Windows 2008r2 w/ version >= 1.9 Mar 22, 2018

@dgryski

This comment has been minimized.

Contributor

dgryski commented Mar 22, 2018

@dennisdupont

This comment has been minimized.

dennisdupont commented Mar 22, 2018

There are a couple of things I tried against a 1.9 version (in order):

  • commented out osRelax and changed osinit to set _timeBeginPeriod to 1 (as in v1.8)
  • altered os_windows.go to use systime(), nanotime() and unixnano() implementation from 1.8
  • commented out the set _timeBeginPeriod in osinit

None of these fixed the issue (although I am a neophyte, so take it with a grain of salt).

@alexbrainman

This comment has been minimized.

Member

alexbrainman commented Mar 24, 2018

09:45:54, -01.0553810s <== over 1 second offset in ~20 seconds

That's terrible Muriel.

Reading

hashicorp/consul#3925 (comment)

https://stackoverflow.com/questions/102064/clock-drift-on-windows

https://bugs.java.com/view_bug.do?bug_id=5005837

https://support.microsoft.com/en-us/help/821893/the-system-clock-may-run-fast-when-you-use-the-acpi-power-management-t

the only suggestion that comes to my mind is that staring from go1.9 we call timeBeginPeriod / timeEndPeriod much more often. Perhaps that makes your computer time drift. You can easily test that theory by changing osRelax function in runtime to do nothing.

commented out osRelax and changed osinit to set _timeBeginPeriod to 1 (as in v1.8)

Have you tried to do something like that:

diff --git a/src/runtime/os_windows.go b/src/runtime/os_windows.go
index 415ec0c..4295947 100644
--- a/src/runtime/os_windows.go
+++ b/src/runtime/os_windows.go
@@ -284,6 +284,8 @@ const osRelaxMinNS = 60 * 1e6
 // if we're already using the CPU, but if all Ps are idle there's no
 // need to consume extra power to drive the high-res timer.
 func osRelax(relax bool) uint32 {
+       return
+
        if relax {
                return uint32(stdcall1(_timeEndPeriod, 1))
        } else {

?

But I am not a real doctor. Perhaps @aclements will be more helpful.

Alex

@andybons andybons added this to the Go1.11 milestone Mar 26, 2018

@Carrotman42

This comment has been minimized.

Carrotman42 commented Mar 28, 2018

I haven't run the example playground, but looking at the code there are two places the code comments don't match the implementation (afaict). I can't speak to what that means for this issue; I just wanted to mention what I noticed.

There's a "default" branch in the select where you're supposed to be waiting for that timer to send something down the channel. This causes the select to not block, and you won't wait for the tick.

At the end of the for loop it says you're trying to clear the timers. I suggest setting the slice to nil instead: setting it to timers[:] unfortunately doesn't do anything useful.

@as

This comment has been minimized.

Contributor

as commented Mar 28, 2018

Comment doesn't match the code.

// Clear timers slice
timers = timers[:]

Should be

timers = timers[:0]

Additionally, this program seems to create an ever-increasing amount of timers that it iterates through checking for done-ness. As I now see is already stated by @Carrotman42, the default case is triggered if the uncleared value is checked or the timer hasn't fired yet, if the timer hasn't fired yet, it likely gets read on the next iteration and contains the stale timestamp.

@alexbrainman

This comment has been minimized.

Member

alexbrainman commented Apr 2, 2018

Comment doesn't match the code.

True, but that still does not explain computer clock drift.

@dennisdupont could you change your program to remove lines 30-31 and replace line 38 with timers = timers[:0], and tell us if that makes any difference to computer clock.

Thank you.

A;ex

@aclements

This comment has been minimized.

Member

aclements commented Apr 3, 2018

I may be confused here, but surely this can't be anything but a Windows kernel bug (not saying that Go isn't tickling it)? No user process should be capable of causing the system clock to skew.

If I were to guess what would trigger a kernel timekeeping issue, I would definitely start with the timeBeginPeriod/timeEndPeriod. But Go definitely isn't the only thing that's constantly switching the time period. Simply retrieving the time would be far down my list of suspects, since that doesn't even enter the kernel except on Wine.

How loaded is the system? Could it be thrashing so bad that it causes huge delays? Do we trust w32tm's report?

@dennisdupont

This comment has been minimized.

dennisdupont commented Apr 16, 2018

@aclements - the system I test on is not loaded at all, very idle. I have been using w32tm on quite a few servers around the center and pretty sure it is accurate. Also the drift's are consistent with the visible clocks (remote windows vs my desktop vs other servers, etc.)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment