Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Clock sources are slow to read and imprecise #77

Closed
tycho opened this issue Apr 8, 2016 · 18 comments
Closed

Clock sources are slow to read and imprecise #77

tycho opened this issue Apr 8, 2016 · 18 comments
Labels

Comments

@tycho
Copy link

tycho commented Apr 8, 2016

On a normal bare metal Linux system:

Name          Cost(ns)      +/-    Resol  Mono  Fail  Warp  Stal  Regr
tsc              10.97    0.21%     ----   Yes     0     0     0     0
gettimeofday     17.60    0.01%  1000KHz    No     0     0  1000     0
                 17.60    0.01%
realtime         17.60    0.07%     ----   Yes     0     0     0     0
realtime_crs      4.66    0.06%    100Hz    No   999     0     0     0
                  3.33  142.09%
monotonic        17.20    0.06%     ----   Yes     0     0     0     0
monotonic_crs     4.52    0.02%    100Hz    No   999     0     0     0
                  3.34  142.09%
monotonic_raw    69.76    0.10%     ----   Yes     0     0     0     0
boottime         73.53    0.05%     ----   Yes     0     0     0     0
process         115.67    0.08%     ----   Yes     0     0     0     0
thread          109.64    0.07%     ----   Yes     0     0     0     0
clock           116.93    0.04%  1000KHz    No     0     0  1000     0
getrusage       181.32    0.05%    100Hz    No   996     3     3     0
ftime            20.78    0.04%   1000Hz    No   995     0     4     0
time              2.84    0.03%      1Hz    No  1000     0     0     0

On "Bash on Windows", the clock behavior isn't as nice:

Name          Cost(ns)      +/-    Resol  Mono  Fail  Warp  Stal  Regr
tsc              12.53   86.77%     ----   Yes     0     0     0     0
gettimeofday    166.10    0.72%   1032Hz    No   943     5    55     0
realtime        182.39    1.02%  3333KHz    No     0     0   131     0
realtime_crs    168.97    0.83%   2144Hz    No   934     0    65     0
monotonic       186.90    0.38%  1000KHz    No     0     0   998     0
monotonic_crs   188.76    0.33%  1000KHz    No     0     0   998     0
Failed to read from monotonic_raw clock (1, 4)
boottime        186.66    0.37%  1000KHz    No     0     0   998     0
process         272.19    0.81%     64Hz    No   996     3     3     0
thread          229.25    0.84%     64Hz    No   997     2     2     0
clock           278.50    1.23%     64Hz    No   996     3     3     0
getrusage       385.45    0.78%     64Hz    No   995     4     4     0
ftime           207.14    0.81%   1000Hz    No   959     0    40     0
time            156.06    0.86%      1Hz    No  1000     0     0     0

To explain the columns a bit:

The Cost(ns) column is the time cost to read that clock in nanoseconds, as measured by the TSC (except when measuring the cost of reading the TSC itself, which is measured by looking at gettimeofday).

Resol is the observable tick rate of the clock.

Mono indicates whether the clock source is monotonic, with an additional restriction. Not only must the clock only move forward, it must never return the same value (i.e. high frequency).

Fail indicates the number of times the clock source failed to advance in >= 200 reads.

Warp indicates the number of times the clock jumped forward by an unexpectedly large span (threshold for this is currently 1ms).

Stal indicates the number of times the clock source failed to advance in more than 1, but less than 200 reads.

Regr indicates the number of times the clock moved backwards. This can happen due to hypervisor behavior, NTP adjustments, and other clock changes. It's extremely bad behavior if you use the clock source for timespan measurement (e.g. profiling, benchmarks, etc).

@russalex
Copy link
Contributor

russalex commented Apr 9, 2016

Amazing bug. Thanks for the details. I am taking this back to the team. Will update here when I have more info.

@HenkPoley
Copy link

What tool did you use to derive those numbers?

@tycho
Copy link
Author

tycho commented Apr 15, 2016

Hi @HenkPoley. I sent the source to Microsoft through other channels. But since people are asking, I've published the source code under the ISC license here: https://github.com/tycho/clockperf

@tycho
Copy link
Author

tycho commented Aug 2, 2016

No obvious improvement on 14393.10:

Name          Cost(ns)      +/-    Resol  Mono  Fail  Warp  Stal  Regr
tsc              12.51   86.77%     ----   Yes     0     0     0     0
gettimeofday    154.47    0.33%   1000Hz    No   959    14    40     0
realtime        171.15    0.19%  3333KHz    No     0     0   302     0
realtime_crs    157.35    0.11%   2052Hz    No   953    15    46     0
monotonic       182.39    0.34%  1000KHz    No     0     0   999     0
monotonic_crs   183.29    0.64%  1000KHz    No     0     0   998     0
Failed to read from monotonic_raw clock (1, 4)
boottime        184.65    0.24%  1000KHz    No     0     0   998     0
process         261.79    0.55%     64Hz    No   996     3     3     0
thread          212.44    0.21%     64Hz    No   997     2     2     0
clock           264.02    0.60%     64Hz    No   996     3     3     0
getrusage       369.11    0.34%     64Hz    No   995     4     4     0
ftime           194.95    0.48%   1000Hz    No   961     0    37     0
time            146.61    0.23%      1Hz    No  1000     0     0     0

@russalex
Copy link
Contributor

russalex commented Aug 2, 2016

Still a great bug, but unfortunately we did not have enough time to dive into this as much as I would have liked to in this release. As of not it remains on the backlog.

@tycho
Copy link
Author

tycho commented Aug 3, 2016

@russalex Understood. Just wanted to post an update with the latest build. Not trying to force a change ahead of schedule. 😄

@tycho
Copy link
Author

tycho commented Apr 27, 2018

It's been a while since I did an update on this issue, so let's see how things are in a very recent build of Windows (built April 20th, according to the version stamp).

It looks like clock sources have gotten much more expensive (2-3x?), though I'm not sure what system I originally measured on now so these results aren't exactly fair. But then again, I'd expect this more modern system to have unfairly better results, not worse.

This system is an Intel Core i7-7820X host, running build 17655:

clockperf v1.1.0-12-gac7b528

Invariant TSC: Yes

tsc                    resolution = 3599MHz
realtime               resolution = 3333KHz
realtime_crs           resolution = 64Hz
monotonic              resolution = 3333KHz
monotonic_crs          resolution = 3333KHz
monotonic_raw          resolution = 3333KHz
boottime               resolution = 3333KHz
clock                  resolution = 1000KHz
time                   resolution = 1Hz

Name                Cost(ns)      +/-    Resol  Mono  Fail  Warp  Stal  Regr
tsc                    12.06    0.45%     ----   Yes     0     0     0     0
gettimeofday          367.32    5.95%  1000KHz    No     0     0   767     0
realtime              415.38    0.75%     ----   Yes     0     0     0     0
realtime_crs          442.56    5.91%    40KHz    No   913    20    86     0
monotonic             426.12    0.49%  1000KHz    No     0     0   367     0
monotonic_crs         534.07    6.44%  1000KHz    No     0     0   323     0
monotonic_raw         566.26    6.90%  1000KHz    No     0     0   301     0
boottime              502.51    5.94%  1000KHz    No     0     0   344     0
process               499.95    4.19%     64Hz    No   993     6     6     0
thread                461.24    4.56%     64Hz    No   994     5     5     0
clock                 576.07    7.53%     64Hz    No   992     7     7     0
getrusage             614.34    6.43%     64Hz    No   992     7     7     0
ftime                 442.92    9.14%   1000Hz    No   915     0    84     0
time                  317.52    6.47%      1Hz    No   999     0     0     0

As an aside... CLOCK_REALTIME may appear to be monotonic above, but it's probably not. The criteria clockperf uses is "did we see a higher value on every read?" But that alleged monotonicity may just be because it's an expensive clock source to read. The claimed resolution (via the clock_getres API) is 3333KHz. In order to actually observe that frequency we'd need to get at least one identical read, and to do that we'd need to be able to read it faster than it updates. If the claimed resolution is to be believed, then the cost to read would need to be under 300ns for an identical read to happen. Right now it's far over that.

Also, none of the clock_gettime sources appear to have observable resolutions that match their claimed clock_getres values.

On the plus side, it looks like CLOCK_MONOTONIC_RAW got implemented. Hooray!

@tycho
Copy link
Author

tycho commented Mar 2, 2019

Time for a pseudo-yearly update. Same host hardware as before, but now running on insider preview build number 18348.1 (19h1_release.190226-1407).

clockperf v2.0.1-6-g19cd92a

== Reported Clock Frequencies ==

tsc                    3601MHz
realtime               10MHz
realtime_crs           64Hz
monotonic              10MHz
monotonic_crs          10MHz
monotonic_raw          10MHz
boottime               10MHz
process                64Hz
thread                 64Hz
clock                  1000KHz
time                   1Hz


== Clock Behavior Tests ==

Name                Cost(ns)      +/-    Resol  Mono  Fail  Warp  Stal  Regr
tsc                    11.16    2.44%     ----   Yes     0     0     0     0
gettimeofday          554.77    0.76%  1000KHz    No     0     0   161     0
realtime              698.65    0.65%     ----   Yes     0     0     0     0
realtime_crs          650.99    0.59%   1040Hz    No   878    21   121     0
monotonic             687.00    0.63%     ----   Yes     0     0     0     0
monotonic_crs         685.16    0.56%     ----   Yes     0     0     0     0
monotonic_raw         687.93    0.62%     ----   Yes     0     0     0     0
boottime              687.83    0.61%     ----   Yes     0     0     0     0
process              1385.43    0.35%     64Hz    No   982    17    17     0
thread                693.90    0.54%     64Hz    No   991     8     8     0
clock                1387.35    0.32%     64Hz    No   982    17    17     0
getrusage            1524.88    0.67%     64Hz    No   980    19    19     0
ftime                 595.66    0.42%   1000Hz    No   886     0   112     0
time                  528.24    1.08%      1Hz    No   999     0     0     0
                      500.37  113.92%

Observations:

  • The costs to read have gotten worse again across the board. 😞
  • CLOCK_REALTIME (but not coarse variant), CLOCK_MONOTONIC (and coarse/raw variants), and CLOCK_BOOTTIME are all now 10MHz clocks (QueryPerformanceCounter under the hood, I presume?)
  • CLOCK_PROCESS_CPUTIME_ID and CLOCK_THREAD_CPUTIME_ID are now reporting frequencies through clock_getres() (previously they were readable with clock_gettime() but their frequencies could not be queried with clock_getres() for some reason)
  • The reported frequency for the 10MHz+ clocks is observably accurate as far as we can tell right now. But this is mostly because the cost to read them is still too high -- we can only read these clocks at a rate of about 1.5MHz in the best case. We'd need to be able to read them at ~100ns cost in order to determine whether the reported 10MHz frequency is observably accurate.
  • On the plus side, the 10MHz clocks now have a sufficiently high frequency to appear monotonic to anything reading them.
  • The reported frequency for CLOCK_REALTIME_COARSE doesn't match the results. It claims (via clock_getres()) to be a 64Hz clock but it's observably ticking at 1KHz.
  • clock() isn't ticking at the correct frequency. It's observably 64Hz, but it should be ticking at the constant CLOCKS_PER_SEC value (1MHz). This is a side effect of CLOCK_PROCESS_CPUTIME_ID being too low resolution, at 64Hz (glibc uses CLOCK_PROCESS_CPUTIME_ID to implement clock() under the hood).

@tycho
Copy link
Author

tycho commented Mar 4, 2019

I think that's a separate problem.

I'm not particularly concerned with the real time clock as I am with the high resolution clock sources that drat near everything uses. If a monotonic clock costs a lot to read, it'll artificially hurt performance of some benchmarks (e.g. running a workload like fio with latency measurement would have massive overhead just from reading the clock). It'll also hurt any real world applications that read the clock -- which is almost anything with a tick rate implemented (e.g. game servers).

In some further testing in the past few days I noticed that getrusage() and clock() start doing really weird stuff after a while on WSL, too. I don't think they actually hooked up proper process/thread time accounting to those APIs, which could screw with some benchmarks. For instance, the openssl speed tool uses getrusage() to determine CPU time used during a run, instead of using wall clock time.

@tycho
Copy link
Author

tycho commented Mar 4, 2019

This has nothing to do with timezones or the realtime clock.

@EmpiricEmpire
Copy link

Yeah, silly Darthie.

@h1z1
Copy link

h1z1 commented Apr 27, 2020

Been following this for a while. Curious @tycho what hardware were you testing on because I find some systems are abysmal wrt clock performance regardless of what is tried. That is in Linux too.

@Hi-Angel
Copy link

Hi-Angel commented Apr 27, 2020

Been following this for a while. Curious @tycho what hardware were you testing on because I find some systems are abysmal wrt clock performance regardless of what is tried. That is in Linux too.

Just a sec. ago downloaded the code, built and tested. Looks fine for me. A laptop Dell Inspiron 5370 here, Archlinux, kernel 5.6.6, libc 2.31

Screenshot_20200427_142200

@Hi-Angel
Copy link

It varies though, I guess it depends on the current clock frequency.
Screenshot_20200427_142719

@h1z1
Copy link

h1z1 commented Apr 27, 2020

That is what I suspect in some cases - the hardware and or drivers. CPU affinity, CPU speed changes, system clocksource, etc all appear to cause instability that shouldn't be happening.

@tycho
Copy link
Author

tycho commented Jun 23, 2020

This is much better under WSL2, since it doesn't need to rely on Win32 clock sources under the hood:

Name                Cost(ns)      +/-    Resol  Mono  Fail  Warp  Stal  Regr
tsc                    10.23    1.42%  1000MHz    No     0     0    58     0
gettimeofday           43.80    0.39%  1000KHz    No     0     0   999     0
realtime               47.34    1.19%    10MHz    No     0     0   246     0
realtime_crs            8.49    5.16%    100Hz    No   999     0     0     0
                        8.34   84.91%
monotonic              46.51    1.61%    10MHz    No     0     0   312     0
monotonic_crs           8.48    2.91%    100Hz    No   999     0     0     0
                        8.34   84.91%
monotonic_raw         168.67    1.06%     ----   Yes     0     0     0     0
boottime              145.65    2.05%     ----   Yes     0     0     0     0
process               326.68    1.48%     ----   Yes     0     0     0     0
thread                198.46    2.26%     ----   Yes     0     0     0     0
clock                 309.48    2.82%  1000KHz    No     0     0   952     0
getrusage             428.42    3.40%  1000KHz    No     0     0   544     0
ftime                  48.09    1.51%   1000Hz    No   990     0     9     0
time                    4.99    7.03%      1Hz    No  1000     0     0     0
$ uname -r
4.19.104-microsoft-standard

I'm going to call this resolved since WSL2 seems to be the way forward.

@tycho tycho closed this as completed Jun 23, 2020
@WSLUser
Copy link

WSLUser commented Jun 23, 2020

This is much better under WSL2, since it doesn't need to rely on Win32 clock sources under the hood

WSL2 is a lightweight VM. So like other VMs, it gets it's clock source from the host. So what you said is completely wrong. There is an issue however with the clock drifting and is corrected by using wsl --shutdown.

So what we can say is "WSL2 uses the Hyper-V clock sync in the backend to the Win32 clock sources, which is much faster and more precise than the clock fetching performed by WSL1."

@tycho
Copy link
Author

tycho commented Jun 23, 2020

You misinterpreted what I said.

With WSL1, it had to map all the Linux clock APIs like gettimeofday/CLOCK_REALTIME/etc to Win32/NT clock APIs like GetTickCount/timeGetTime/GetSystemTimeAsFileTime/etc. Nothing in WSL2 is doing a layer of indirection to map Linux clock APIs into Windows APIs. It's all effectively coming straight from the TSC, with a clock divider to make the TSC a constant 10MHz instead of the hardware-native TSC frequency (this is normally done for things like VM live migration, even though that doesn't make much sense in the case of a WSL guest). Even Win32 APIs rely on the TSC or HPET internally, but my point was that the extra layer of indirection WSL1 had between userspace and the hardware clock read is eliminated by WSL2.

The clock drift/sync issues you're referring to sounds more like issues with the RTC drifting, which is a different story. And that is orthogonal to the high resolution clock sources. I'm also surprised to hear the RTC is drifting at all with WSL2, as I would have expected that to be handled by some Hyper-V enlightenment. The next time you see it drift, does hwclock --hctosys --verbose as root in the WSL2 guest temporarily fix it? That should also tell you how bad the drift is at the time (look for the Calculated Hardware Clock drift is ... line), and it should get around needing to do a full wsl --shutdown.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

7 participants