runtime: problems with rdtsc in VMs moving backward #8976

randall77 · 2014-10-21T18:09:00Z

I added the following patch:

https://golang.org/cl/159150043

It seems to fix failures that we were having on one of two Solaris build machines.  The
fix involves making sure that times returned by cputicks() are strictly increasing.

cputicks just runs RDTSC, so os-dependent behavior is odd.  My only suspicion is that
Solaris doesn't keep the cpu counter synchronized between processors, so when a
goroutine moves between processors the clock can look like it goes backwards.

Any idea if this is the case?  There's not so much something to fix here as I'm curious
to understand what is going on.

4ad · 2014-10-21T18:24:05Z

Comment 1:

The timestamp counter is most certainly not synchronized between CPUs. gethrvtime(3c)
keeps track track of everything to present near-ns information, but if you use the raw
rdtsc it will not be synchronized.

ianlancetaylor · 2014-10-21T18:27:37Z

Comment 2:

On older Intel (or possibly only AMD, I can't remember) multi-core processors RDTSC is
not kept synchronized between different cores.  The different cores increment their
counters independently, and are only periodically synchronized.  My understanding is
that the Linux kernel has some moderately complex code to ensure that RDTSC always give
monotonically increasing results even when called from different cores.  Presumably the
Solaris kernel does not have that code.

randall77 · 2014-10-21T19:57:59Z

Comment 3:

Sounds like we should probably audit our code to make sure that we're not depending on
monotonic time.  I just fixed the blocking profile, but I'm sure there are others.
Can we move solaris to gethrvtime instead of rdtsc?
Do any of our other OSes have this problem?  I would imagine we can cope with RDTSC that
goes backwards a little bit, but if it is totally unsynchronized we're in trouble.

davecheney · 2014-10-21T20:45:00Z

Comment 4:

In a previous job I spent a lot of time on timekeeping stuff. I'm of the mind that for
newer intel CPUs (basically all of the 64bit cpus) RDTSC _can_ be guaranteed to at least
be monotonically increasing at a given rate, irrespective of changing of power
management and clock speeds. This is a good thing.
However, when you have multiple cores on a die, or multiple cpu packages on a machine,
RDTSC is not guaranteed to report a monotonically increasing value. That is, if you call
RDTSC and then your thread is rescheduled onto another core, the second RDTSC
instruction _may_ report a value which is slightly behind the first (where slightly is
in the scale of a small % of a counter moving in the gigahertz range).
My knowledge of this is about three years out of date, but when I looked into the linux
kernel code, it is not possible to set the TSC counter accurately so the linux kernel
can only zero the counter across each CPU as it is coming online, which gets the
counters close (again with a % of something free running in the ghz range). 
At the time the PPTP software I was using ran in kernel mode and pinned itself to one
CPU core thereby eliminating a potential that TSC counters may drift between cores or
may not be monotonic with respect to other cores.
This also explains why the build is failing on the smartos solaris builder, but not on
the sol11 builder, the former is running on real hardware, a big machine with multiple
sockets, where the latter is running inside vmware which goes to some lengths to 'fix'
RDTSC by virtualising the isntruction.

4ad · 2014-10-21T20:57:56Z

Comment 5:

> Can we move solaris to gethrvtime instead of rdtsc?
I have to check all instances where we do this but I see no reason why we couldn't do
this, in principle.

alexbrainman · 2014-10-21T22:19:08Z

Comment 6:

I think our windows runtime.nanotime is monotonic. See source code for details. It even
have some links to external documentation.
Alex

gopherbot · 2014-11-09T21:26:41Z

Comment 7 by shawn.walker@binarycrusader.com:

The other posters are correct; on Solaris, you should use gethrvtime() or gethrtime() as
appropriate.  It is guaranteed monotonic, and is optimised (along with gettimeofday() in
newer releases) to be done entirely in userspace avoiding the kernel context switch.

dvyukov · 2015-01-29T10:12:14Z

This also badly affects runtime tracer.
I've disabled tracer tests for now (see src/runtime/pprof/trace_test.go).
We need monotonic ticks, otherwise they are futile.

kardianos · 2015-04-03T05:27:04Z

@alexbrainman On my 6 core AMD, I'm tried running the following:
Windows XP 64-bit (SP2) and Windows XP 32-bit (SP3)

I've tried running these on the following VM:
VMWare and VritualBox.

In the different combinations, they all exhibited behavior of the monotonic time not being monotonic if the core count was 2 or more. If I only allocated a single core to the VM, on VirtualBox the pprof tests pass, but on VMWare they still fail.

I have not observed this issue running on metal 4 core AMD Windows 7 boxes.

davecheney · 2015-04-03T05:32:09Z

Virtual machines really cannot tell time. I don't expect RDTSC to work
reliably in virtual environments.

On Fri, Apr 3, 2015 at 4:27 PM, Daniel Theophanes notifications@github.com
wrote:

@alexbrainman https://github.com/alexbrainman On my 6 core AMD, I'm
tried running the following:
Windows XP 64-bit (SP2) and Windows XP 32-bit (SP3)

I've tried running these on the following VM:
VMWare and VritualBox.

In the different combinations, they all exhibited behavior of the
monotonic time not being monotonic if the core count was 2 or more. If I
only allocated a single core to the VM on either product monotonic time is
observed.

I have not observed this issue running on metal 4 core AMD Windows 7 boxes.

—
Reply to this email directly or view it on GitHub
#8976 (comment).

kardianos · 2015-04-03T05:34:26Z

@davecheney If that's the expectation, should we define a IN_VM or NO_MONOTONIC_TICKS environment variable for builders that are running in VMs?

bradfitz · 2015-04-03T08:19:30Z

FWIW, we already use IN_QEMU to skip some tests in QEMU. Generalizing that might be good.

minux · 2015-04-03T18:22:32Z

@kardianos, do you observe non-monotonic RDSTCs when running windows on that 6-core AMD box physically? I imagine that VMM will just expose the underlying timestamp counter of the physical cpus, so I couldn't understand why the timestamps are not monotonic in the VM whereas on the physical machine it is. Does the Windows VM detects it's running on AMD (virtual) cpus?

kardianos · 2015-04-04T05:47:24Z

@minux When I run Windows XP SP3 on the 6 core box physically I observe monotonic; all tests pass repeatably.

The windows VM detects the make and model of my host AMD cpus.

rsc · 2015-04-10T05:20:32Z

Can someone explain why this is important?

dvyukov · 2015-04-10T08:29:00Z

Block profile and runtime tracer use RDSTC.

4ad · 2015-04-10T09:21:33Z

If the block profile and runtime tracer require monotonic time, why don't they use the OS-specific APIs that return monotonic time?

dvyukov · 2015-04-10T09:29:39Z

OS timers have too low precision.
RDSTC is ought to be monotonic.

bradfitz · 2015-04-10T12:37:59Z

From my comment on https://go-review.googlesource.com/#/c/8736/2 :

I think this is too specific. It's not just IN_QEMU. And it's not all VMs.

We probably should have per-builder lists of tests to skip or something.
Or just make the runtime itself detect things going backwards and set a
flag we can check here. e.g.:

   func TestFoo(t *testing.T) {
        if err := testFoo(t); err != nil { 
            if runtime.TimerEverWentBackwards() {
               t.Skip("test failed, but ignoring, since runtime reports timer sucks on this machine")
            } else {
              t.Fatal(err)
            }
     }

Obviously not exactly like that, but I'm wondering if we can ask the runtime if
it ever detected a problem with the clocks, if the clocks aren't fixable.

Then we can also make "go test -trace" and friends print a nicer message saying, 

    "Sorry, tracing is not supported in this environment. Your clocks don't work."

davecheney · 2015-04-10T12:39:27Z

The TSC counter is not monotonic unless your process is pinned to a CPU
core.

On Fri, Apr 10, 2015 at 10:38 PM, Brad Fitzpatrick <notifications@github.com

wrote:

From my comment on https://go-review.googlesource.com/#/c/8736/2 :

I think this is too specific. It's not just IN_QEMU. And it's not all VMs.

We probably should have per-builder lists of tests to skip or something.
Or just make the runtime itself detect things going backwards and set a
flag we can check here. e.g.:

func TestFoo(t *testing.T) {
if err := testFoo(t); err != nil {
if runtime.TimerEverWentBackwards() {
t.Skip("test failed, but ignoring, since runtime reports timer sucks on this machine")
} else {
t.Fatal(err)
}
}

Obviously not exactly like that, but I'm wondering if we can ask the runtime if
it ever detected a problem with the clocks, if the clocks aren't fixable.

Then we can also make "go test -trace" and friends print a nicer message saying,
"Sorry, tracing is not supported in this environment. Your clocks don't work."
—
Reply to this email directly or view it on GitHub
#8976 (comment).

tv42 · 2015-08-21T15:21:16Z

It's not just VMs, either. go tool trace consistently fails with time stamps out of order here, linux/amd64 no VM involved. What Dave said; RDTSC is per core.

Intel(R) Core(TM) i7-3930K CPU @ 3.20GHz

kardianos · 2016-11-14T05:17:11Z

I think this issue is old and can be closed. The profiler issues have largely been solved on solaris and windows with a combination of using a stable sort and manual sequence order to work around this.

tv42 · 2016-11-14T16:07:44Z

@kardianos @bradfitz This bug is absolutely not fixed, I can trigger it just as before with Go 2f76c19. Details for that specific scenario are still at #16755

tv42 · 2016-11-14T16:10:10Z

(Well, I guess that depends on your definition of "this bug". Over-reliance on RDTSC to be monotonic across cores still definitely exists.)

rsc · 2016-11-14T16:11:10Z

@tv42, ok well since #16755 is still open, I think we don't need this bug to track that too.

4ad · 2016-11-14T16:12:18Z

Yes, solaris is still affected. Let's continue in #16755.

randall77 added the new label Nov 9, 2014

bradfitz removed the new label Dec 18, 2014

rsc changed the title ~~runtime: solaris timer can go backwards?~~ runtime: problems with rdtsc in VMs moving backward Apr 10, 2015

rsc added this to the Go1.5Maybe milestone Apr 10, 2015

rsc modified the milestones: Unplanned, Go1.5Maybe Jul 21, 2015

bradfitz closed this as completed Nov 14, 2016

golang locked and limited conversation to collaborators Nov 14, 2017

gopherbot added the FrozenDueToAge label Nov 14, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

runtime: problems with rdtsc in VMs moving backward #8976

runtime: problems with rdtsc in VMs moving backward #8976

randall77 commented Oct 21, 2014

4ad commented Oct 21, 2014

ianlancetaylor commented Oct 21, 2014

randall77 commented Oct 21, 2014

davecheney commented Oct 21, 2014

4ad commented Oct 21, 2014

alexbrainman commented Oct 21, 2014

gopherbot commented Nov 9, 2014

dvyukov commented Jan 29, 2015

kardianos commented Apr 3, 2015

davecheney commented Apr 3, 2015

kardianos commented Apr 3, 2015

bradfitz commented Apr 3, 2015

minux commented Apr 3, 2015 via email

kardianos commented Apr 4, 2015

rsc commented Apr 10, 2015

dvyukov commented Apr 10, 2015

4ad commented Apr 10, 2015

dvyukov commented Apr 10, 2015

bradfitz commented Apr 10, 2015

davecheney commented Apr 10, 2015

tv42 commented Aug 21, 2015

kardianos commented Nov 14, 2016

tv42 commented Nov 14, 2016

tv42 commented Nov 14, 2016 •

edited

Loading

rsc commented Nov 14, 2016

4ad commented Nov 14, 2016

runtime: problems with rdtsc in VMs moving backward #8976

runtime: problems with rdtsc in VMs moving backward #8976

Comments

randall77 commented Oct 21, 2014

4ad commented Oct 21, 2014

ianlancetaylor commented Oct 21, 2014

randall77 commented Oct 21, 2014

davecheney commented Oct 21, 2014

4ad commented Oct 21, 2014

alexbrainman commented Oct 21, 2014

gopherbot commented Nov 9, 2014

dvyukov commented Jan 29, 2015

kardianos commented Apr 3, 2015

davecheney commented Apr 3, 2015

kardianos commented Apr 3, 2015

bradfitz commented Apr 3, 2015

minux commented Apr 3, 2015 via email

kardianos commented Apr 4, 2015

rsc commented Apr 10, 2015

dvyukov commented Apr 10, 2015

4ad commented Apr 10, 2015

dvyukov commented Apr 10, 2015

bradfitz commented Apr 10, 2015

davecheney commented Apr 10, 2015

tv42 commented Aug 21, 2015

kardianos commented Nov 14, 2016

tv42 commented Nov 14, 2016

tv42 commented Nov 14, 2016 • edited Loading

rsc commented Nov 14, 2016

4ad commented Nov 14, 2016

tv42 commented Nov 14, 2016 •

edited

Loading