Skip to content
This repository has been archived by the owner on Jul 1, 2022. It is now read-only.

Side effects of imprecise clock #192

Closed
mabn opened this issue May 31, 2017 · 21 comments
Closed

Side effects of imprecise clock #192

mabn opened this issue May 31, 2017 · 21 comments

Comments

@mabn
Copy link
Contributor

mabn commented May 31, 2017

I found an interesting problem that took some time to track down.

Basically spans in the jaeger UI were "centered" in a weird way. Initially I thought that it's some UI issue, but after comparing the json returned by jaeger-query with cassandra it turned out that jaeger-query completely messes span start times. And by completely I mean that span taking 1ms was shifted by 3ms. Here's how it looks like:

Alt text

In this example "child 2" should happen as the last span inside the "parent" - should be stuck to the right edge of the parent. There's no delay because all those spans are reported by the same process.

After digging through the jaeger code I found ClockSkew and there are two problems here:

  1. ClockSkew adjust span start times if they overlap with their parent. It does it only if parent and child are reported by different hosts, but the host is taken from "ip" tag of the process. If it's missing the default is to assume that these are different hosts.

  2. jaeger-client-java by default uses SystemClock which has millisecond precision, but timestamps reported to jaeger are in microseconds. To convert the time to micros it multiplies it by 1000:

    public long currentTimeMicros() {  return System.currentTimeMillis() * 1000; }

    One can think of it as rounding down exact start time to milliseconds.

    Span duration on the other hand is calculated with more precise System.nanoTime.
    As a result of this rounding it's possible that:

    round(parent.start) + parent.duration < round(child.start) + child.duration
    

    for children which end at the same time as the parent it happens very often.

So - child span end time exceeds parent's finish time by between 0 and 1000 micros which in combination with missing process "ip" tag triggers clock skew adjuster. And for very long parents and very short children the adjustment is drastic (around 50% of parent duration).

There are two options:

  1. change the default in jaeger-query
  2. better clock implementation in jaeger-client-java
  3. add "ip" process tag by default

I guess that the default make sense so better clock implementation would be a good improvement.

@yurishkuro
Copy link
Member

Nice catch.

Are you using java client that reports jaeger-thrift model (recently released)? It should not have the problem with "ip" tag.

The behavior of clock skew adjustment aside, the real issue seems to be the difference in precision between startTime and duration. Unfortunately, I don't know of a good way to get accurate startTime precision. The System.nanoSeconds() we're using for duration is not an offset from epoch, but from some random point in time, so not suitable for startTime.

@mabn
Copy link
Contributor Author

mabn commented Jun 1, 2017

Yes, I use the thrift Sender - com.uber.jaeger.senders.UdpSender. But it doesn't add "ip" tag. It probably should be added inside Tracer constructor - ip is obtained there, but added to the tags.

As for the clock - I'm not sure. I'd consider keeping offset between currentTimeMillis and nanoTime and use nanoTime + offset, but the offset is not fixed - ntpd adjusts the clock, there might be leap seconds. So it should be updated from time to time... It might be tricky to get it right.

@yurishkuro
Copy link
Member

Yes, I use the thrift Sender

I mean v0.19.0 of the client - it sends jaeger.thrift instead of zipkin.thrift, where the tracer tags are sent once, but applied to each span on the backend.

@mabn
Copy link
Contributor Author

mabn commented Jun 1, 2017

Yes, this one.

@yurishkuro
Copy link
Member

@mabn is it actually necessary to adjust the offset? The main requirements for the trace are:

  • startTime + duration == endTime
  • For spans started in the same process at true time t1 and t2 we have startTime1 - startTime2 == t1 - t2

It does not matter if both startTime and endTime are "off" by some delta, clocks across processes are skewed anyway, and the ClockSkewAdjuster is expected to fix those. As long as nanoTime() is monotonic, it doesn't even seem to matter if it's consistently faster or slower than the wall clock time, since nanoTime() is defined as the correct way for measuring elapsed time.

So would the following work?

// on tracer initialization, capture
initWallclock = currentTimeMillis()
initElapsed = nanoTime()

// when starting the span
startTimeNanos = nanoTime()
startTimeMicros = initWallclock * 1000 + (startTimeNanos - initElapsed) / 1000

// when finishing the span
durationMicros = (nanoTime() - startTimeNanos) / 1000

cc @sul4bh @pavolloffay

@mabn
Copy link
Contributor Author

mabn commented Jun 3, 2017

That approach comes to mind, but I'm not sure if it works. My concern is that clock drifts and is adjusted from time to time - e.g. by nptd. The rate of drift changes - e.g it depends on the temperature but in the cloud environment there are additional factors caused by virtualization. There are also edge cases like leap seconds.

The approach you mention basically records the offset between nanoTime() and currentTimeMillis() at Tracer initialization and uses it later, so it would be fine if the offset didn't change.

On linux nanoTime() uses clock_gettime with CLOCK_MONOTONIC. Additionally CLOCK_MONOTONIC is adjusted by NTP so it seems that the offset between nanoTime() and currentTimeMillis() should be more or less constant on linux with NTP configured to use adjtime (not sure about ntp_adjtime discipline though)

It would be good to test it, but for example POSIX allows CLOCK_MONOTONIC frequency to vary by up to 500ppm (8ms/day). If the offset is not constant it might lead to significant inaccuracies in start_time for long running processes - which is the normal case for services.

So it gets tricky and OS-specific.

I'd rather see one of following solutions:

  • Calculate the offset inside Tracer.extract() and store it inside Span or SpanContext. Pass it to child spans. Use nanoTime + offset to calculate start_time. This will ensure that inside a single trace within a single process the start times are accurate with nanosecond precision. The drawback is that the offset has to be passed around and is no longer encapsulated inside Clock.
  • Create Clock implementation which updates offset periodically - e.g. once every 30 seconds. If process spends usually significantly less than 30 seconds on processing a single trace then it will be rare for the trace to be affected by offset change. But it will happen and the jumps will be between -1ms and +1ms (assuming that currentTimeMillis has 1ms accuracy, which is not the case on some systems - e.g. on windows xp). The jumps aren't a big issue because JVM pauses happen anyway. If it ensured that time cannot go backwards it would be a pretty good solution. Big plus - it fits the current Clock interface.

I also remember some discussion about implementing better time source in some opentracing-related project, but I wasn't able to find it.

@yurishkuro
Copy link
Member

Additionally CLOCK_MONOTONIC is adjusted by NTP

yes, but it still remains monotonic, the adjustments are probably done as smearing of the delta over some larger time interval, making the change unnoticeable to the application.

It would be good to test it, but for example POSIX allows CLOCK_MONOTONIC frequency to vary by up to 500ppm (8ms/day). If the offset is not constant it might lead to significant inaccuracies in start_time for long running processes - which is the normal case for services.

Now sure how big of a problem this is. In some cases the system's HW timer can be so out of whack that measuring precise elapsed time for tracing is probably not the biggest problem. And for long running processes, 8ms error per day is nothing, and again we'd expect the system to have a proper NTP configuration to compensate for the frequency drift: "This clock's frequency might be adjusted in a PLL control loop once an external reference (NTP, GPS, etc.) has been available long enough to measure the ±500 ppm frequency error and instability of typical motherboard oscillators."

So my preference would be to start with this simplified approach. Of the two other approaches you listed, I think the first one is doable, but certainly more complicated, while the second may still result in unaccounted jumps and bad timing for short spans.

@yurishkuro
Copy link
Member

btw, thanks for the references.

@pavolloffay
Copy link
Member

@yurishkuro
Copy link
Member

@mabn
Copy link
Contributor Author

mabn commented Jul 15, 2017

I did this:
https://gist.github.com/mabn/9587658d78d730917559cd61a274ea4f
Seems to work so far. I haven't seen it logging warnings yet.

@yurishkuro
Copy link
Member

@mabn we're thinking of going with your option 1 where the millis wallclock timestamp is captured in the SpanContext and all other timestamps are calculated as offsets using nanos() (e.g. jaegertracing/jaeger-client-node#122 (comment))

Your gist is a simpler version, which might be good enough. I wouldn't expect it to log the differences unless your machine's HW timer is seriously faster or slower than a true clock AND you have ntpd adjustments happening.

@mabn
Copy link
Contributor Author

mabn commented Aug 1, 2017

I'm pretty sure that a leap second will make my clock 1-second off until application restart. Same with any manual (non-ntpd) adjustments.

@jpkrohling
Copy link
Collaborator

Doesn't the code in the Gist require Java 8? Or is it about a different clock?

https://docs.oracle.com/javase/8/docs/api/java/time/Clock.html

I'd be curious to see if this behavior would also be present for recent JVMs (8+), as date/time manipulation is really better and more precise with the new APIs.

@olivercf
Copy link

olivercf commented May 3, 2018

Is this being worked on? I seem to be experiencing this rather severely:
screenshot from 2018-05-03 19-14-25
(top blue span is the parent, all other blue spans are immediate children that happen consecutively in a loop)

@mabn
Copy link
Contributor Author

mabn commented May 7, 2018

@jpkrohling It's not JDK8 Clock, it's io.jaegertracing.utils.Clock one.

Yes, this behaviour it's present in JDK8 and it will also be present in newer ones because io.jaegertracing.utils.SystemClock uses System.currentTimeMillis.

On JDK9+ it might be a good idea to simply use system clock - it has now sufficient precision due to https://bugs.openjdk.java.net/browse/JDK-8068730 - Increase the precision of the implementation of java.time.Clock.systemUTC().

@olivercf
Copy link

olivercf commented May 8, 2018

Is there a workaround involving setting certain tags? I'm manually setting the process "ip" tag so everything has the same IP (it is all running in docker on the same machine), but still seem to be experiencing the problem.

screenshot from 2018-05-08 10-32-24

screenshot from 2018-05-08 10-32-55

@yurishkuro
Copy link
Member

@olivercf this appears to be a separate, off-topic question. How is it related to the clocks? If it's not, please move to another issue.

@olivercf
Copy link

olivercf commented May 8, 2018

@yurishkuro The start times appear to snapping/rounding to incorrect times as described in the initial message in this issue:

So - child span end time exceeds parent's finish time by between 0 and 1000 micros which in combination with missing process "ip" tag triggers clock skew adjuster. And for very long parents and very short children the adjustment is drastic (around 50% of parent duration).

As you can see from my first post, some of the children are 0ms, and the parent is relatively long. All of the children should be contained within the parent in my example, as the parent is a for loop and the children are each iteration, however it seems all but the first child has been snapped to a particular start time, resulting in a completely incorrect graph. You can see some children's finish times are after the parent's. As they happen consecutively, you'd expect each child's start time to be the finish time of the last.

The first post mentions that if the "ip" tag is the same, the clock skew adjuster shouldn't kick in, but it looks like it does in my case. It seems that is relevant information for this issue, and I'm also wondering if there is a workaround for the meantime until it is fixed.

@mabn
Copy link
Contributor Author

mabn commented May 8, 2018

@olivercf what you are experiencing is because your spans are very short (sub 1ms) but Java system clock has 1ms precision so the start time is truncated and they all seem to start at the same time. The workaround I use is this clock implementation:
https://gist.github.com/mabn/9587658d78d730917559cd61a274ea4f
use it when creating a Tracer instance and it should solve your issue.

@olivercf
Copy link

olivercf commented May 8, 2018

@mabn Thanks, that worked.
screenshot from 2018-05-08 16-15-00

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants