Metrics: Consider introducing dedicated measure instrument for timings #464

jmacd · 2020-02-13T21:49:31Z

This text was considered for the 0.3 release, but held back. Consider this for 0.4.

### Option: Dedicated Measure instrument for timing measurements

As a language-optional feature, the API MAY support a dedicated
instrument for reporting timing measurements.  This kind of
instrument, with recommended name `Timer` (and `BoundTimer`), is
semantically equivalent to a Measure instrument.  Like the Measure
instrument, Timers support a `Record()` function.  The input value to
this instrument is in the language's conventional data type for timing
measurements.

Timer instruments MUST capture only the magnitude of the input value
(i.e., an absolute value).  When the user provides a negative value to
the Timer `Record()` function, the captured measurement is the number
stripped of its sign.

For example, in Go the API will accept a `time.Duration`, and in C++
the API will accept a `std::chrono::duration`.  These instruments
apply the correct units automatically, reducing the potential for
confusion over timing metric events.

See the comment: #430 (comment)

The text was updated successfully, but these errors were encountered:

andrewhsu · 2020-04-21T15:23:26Z

from the spec sig mtg today, sounds like this one could be suitable for v0.5 milestone

jmacd · 2020-06-29T17:10:57Z

This issue is being highlighted in a comparison of Micrometer and OpenTelemetry.
(Not surprisingly!) We really need a timing instrument.

jkwatson · 2020-06-29T17:20:30Z

I continue to think this is a good idea. It provides a very concrete instrument for doing what is likely the most common instrumentation metric recordings.

nnanda2016 · 2020-06-29T18:36:19Z

@jmacd You mentioned few design points in the Gitter channel. Should they be updated here?

jmacd · 2020-06-29T18:52:54Z

I wrote:

There are some design questions perhaps. Two basic APIs are:
(1) This is an instrument semantically like ValueRecorder that takes the builtin duration type, 
(2) this is an instrument that knows about clocks and can time you a section and metric it (which is a lot like a Span).

I think we want (1) and to tell users of (2) they might consider using a Span instead.

james-bebbington · 2020-06-29T22:50:10Z

I think we want (1) and to tell users of (2) they might consider using a Span instead.

If all we want is (1), couldn't we support an additional data type (i.e. int64, double, & duration) rather than creating a new instrument?

kenfinnigan · 2020-08-31T14:02:31Z

What's the status of this?

With respect to the comment above, I think (2) is still valid because not all spans are sampled. It's not possible to determine the latency and throughput of a method without timers being present for all executions.

jonatan-ivanov · 2021-06-01T22:11:30Z

My two cents:

Recording time values is not a first-class concept in OpenTelemetry, there is no component that supports this by default but users need to measure time on their own and record it using a Histogram.

My experience with measuring elapsed time is the following:

It is hard to do it right (see later)
It can be hard to test (see later)
Handling the base time unit can be confusing for different backends (see later)
It is the most used scenario in terms of recording metrics (please let us know if you have a different experience)

Please also notice that having a Timer can (should) also mean more than "just" recording a latency Histogram:

A Counter for the rate of the recordings (by attributes)
A Gauge for the max latency
Configurable "percentile" values (e.g.: TP99)

Why measuring elapsed time is hard?

Since there is no clock abstraction currently in OTel, it is easy for users to calculate time values incorrectly (e.g.: using wall time instead of monotonic time) so depending on how do you get the current time, you might get into various issues.

Granularity

The method that you use maybe returns the current time in milliseconds but the granularity of the time value can depend on the underlying operating system and may be larger than the base unit. For example, many operating systems measure time in units of tens of milliseconds.

Let's say your OS measures time in chunks of 100ms. In this scenario there is no way to measure anything that needs to be more precise than this. So if an operation takes somewhere between 20 and 80ms, your measurement will always be either 0 or 100ms (but this precision can be worse, e.g.: 1s).

Time can change (non-monotonic)

The method that you use to get the current time might consider:

Leap seconds: so measurements could be off when this happens
The un-uniformness of Earth's rotation (slows down and speeds up in complicated ways): so time might not always flow uniformly, measurements could be off
Adjustments due to synchronizing the system clock (Network Time Protocol (NTP)): so measurements could be really off when this happens
Daylight Saving Time (DST): off by one hour?

If you want to have some fun, read this: https://infiniteundo.com/post/25326999628/falsehoods-programmers-believe-about-time.

There might be other issues too but my point is: measuring elapsed time is hard (because of this languages/language-SDKs has support for measuring elapsed time).

Let me show two OpenTelemetry examples that demonstrate some of the problems above:

OpenTelemetry Python example
If you check the Python docs (time.time()), it points out both of the issues above:

... not all systems provide time with a better precision than 1 second. While this function normally returns non-decreasing values, it can return a lower value than a previous call if the system clock has been set back between the two calls.

OpenTelemetry Java example
If you check the Javadocs of System.currentTimeMillis() and the Date class, they also point out both issues above.

Why testing a timer is hard?

Because of the issues above and because of the uncertainty of the duration of operations, it can be hard to verify if your instrumentation is correct. Having a clock abstraction makes this easy since you can mock/fake the clock so that it will return what you want.

Why could the time unit be a problem?

Some metrics backends have a preference for a base time unit (e.g.: Prometheus expects times in seconds). This preference is a concern for exporting but recording should be possible in any time unit. Without having a first-class concept of a Timer, this doesn’t seem possible currently without parsing the String unit to determine if it is a time unit.

/cc: @reyang @jmacd @jkwatson @bogdandrutu

jsuereth · 2021-06-03T18:16:38Z

C++ also has std::chrono::high_resolution_clock. I'd be a fan of a Timer-specific instrument that simplifies measuring latency as I think this is a critical use case for metrics.

victlu · 2021-06-11T01:10:41Z

Per SIG meeting, here's an example for a C# implementation.

API and usage...

    public class MetricApiTimerTest
    {
        [Fact]
        public void TestTimer()
        {
            using var meter = new Meter("BasicAllTest", "0.0.1");

            var timer = meter.CreateTimer<int>("timer", "Timer example");

            var tags = new KeyValuePair<string, object?>("location", "here");

            // Example 1

            using (var mark1 = timer.Start(tags))
            {
                this.DoOperation();
            }


            // Example 2

            var mark2 = timer.Start(tags);
            this.DoOperation();
            timer.Stop(mark2);


            // Example 3

            var mark3 = timer.Start();
            this.DoOperation();
            timer.Stop(mark3, tags);
        }

        private void DoOperation()
        {
            Task.Delay(1000).Wait();
        }
    }

An implementation extending Histogram<int> instrument.

    static class MeterExtension
    {
        static public Timer<T> CreateTimer<T>(this Meter meter, string name, string? description = null) where T : struct
        {
            return new Timer<T>(meter, name, "ms", description);
        }
    }

    public sealed class Timer<T> : Instrument<int> where T : struct
    {
        Histogram<T> histogram;

        public Timer(Meter meter, string name, string? unit, string? description)
            : base(meter, name, unit, description)
        {
            this.histogram = meter.CreateHistogram<T>(name, unit, description);
        }

        public TimerMark<T> Start(params KeyValuePair<string, object?>[] tags)
        {
            return new TimerMark<T>(this, tags);
        }

        public void Stop(TimerMark<T> mark, params KeyValuePair<string, object?>[] tags)
        {
            mark.tags = tags;
            mark.Dispose();
        }

        internal void Record(TimeSpan elapsed, KeyValuePair<string, object?>[] tags)
        {
            T value;

            if (typeof(T) == typeof(int))
            {
                value = (T)(object)elapsed.TotalMilliseconds;
            }
            else if (typeof(T) == typeof(double))
            {
                value = (T)(object)elapsed.TotalMilliseconds;
            }
            else
            {
                throw new Exception("Unsupported Type");
            }

            this.histogram.Record(value, tags);
        }

        public class TimerMark<T1> : IDisposable where T1 : struct
        {
            private Timer<T1> timer;
            private Stopwatch sw = new Stopwatch();
            internal KeyValuePair<string, object?>[] tags;

            public TimerMark(Timer<T1> timer, params KeyValuePair<string, object?>[] tags)
            {
                this.timer = timer;
                this.tags = tags;
                this.sw.Start();
            }

            public void Dispose()
            {
                this.sw.Stop();
                this.timer.Record(this.sw.Elapsed, this.tags);
            }
        }

jonatan-ivanov · 2021-06-11T19:05:21Z

Per our previous SIG meeting, here's a Java example: https://github.com/jonatan-ivanov/otel-spec-gh-464
It tries to demonstrate the followings:

The user is guided while they want to use a clock (monotonic and wall time, see: Clock)
The user is guided while they want to measure time (they don't need to the recorded values themselves, see: Timer)
Testing is easy using a fake clock (see: MockClock and TimerDemo)

Please check out the code and the "tests", here's an example to use the timer and fake the clock:

Timer timer = meter.createTimer("exampleTimer", "just an example");
Timer.Sample sample = timer.start(fakeClock);
fakeClock.add(Duration.ofSeconds(1));
long recordedDuration = timer.stop(sample);

assertThat(recordedDuration).isEqualTo(Duration.ofSeconds(1).toNanos());

This is just an example, there are other arrangements to do this, this is just an example (e.g.: timer methods and how the clock is injected, histogram is not used in the example, etc.).

Aneurysm9 · 2021-06-14T15:59:34Z

Here's a possible implementation in Go:

type TimerInstrument struct {
	syncInstrument
}

type Timer struct {
	mu sync.Mutex
	instr TimerInstrument
	start time.Time
	labels []attribute.KeyValue
}

func (t TimerInstrument) Start(ctx context.Context, labels ...attribute.KeyValue) *Timer {
	return &Timer{instr: t, start: time.Now(), labels: labels}
}

func (t TimerInstrument) Record(ctx context.Context, f func(context.Context) error, labels ...attribute.KeyValue) error {
	timer := t.Start(ctx, labels...)
	err := f(ctx)
	err2 := timer.Stop(ctx)

	if err == nil && err2 != nil {
		return fmt.Errorf("error stopping timer: %w", err2)
	} else if err != nil && err2 == nil {
		return fmt.Errorf("error executing timed function: %w", err)
	} else if err != nil && err2 != nil {
		return fmt.Errorf("error executing timed function (%s) and error stopping timer (%s)", err, err2)
	}

	return nil
}

func (t *Timer) Stop(ctx context.Context, labels ...attribute.KeyValue) error {
	t.mu.Lock()
	defer t.mu.Unlock()

	if t.start.IsZero() {
		return fmt.Errorf("timer not active, unable to stop")
	}

	dur := time.Since(t.start)
	t.instr.directRecord(ctx, number.NewInt64Number(int64(dur)), append(t.labels, labels...))
	t.start = time.Time{}

	return nil
}

It seems largely similar to what @victlu and @jonatan-ivanov have proposed, though it would have the user call Stop() on the value returned from Start() rather than passing that value back to the instrument. I'm not married to this approach, this is just what I've done in the past.

I think it will be important to be able to provide attributes both when starting and stopping a timer.

I've included a Record() method that could be used to record the time consumed by a function. I'm not sure how useful it is, but there was discussion earlier about adding such a capability and this demonstrates that it should be relatively easy once Start() and Stop() are implemented.

jmacd · 2021-06-14T17:36:51Z

Here is a Go prototype as well. The thing I like about this design is that you can write a one-liner to record a duration.

https://github.com/lightstep/opentelemetry-prometheus-sidecar/blob/main/telemetry/timer.go

anuraaga · 2021-06-14T23:43:39Z

For reference, here's our current implementation of request metrics in Java

https://github.com/open-telemetry/opentelemetry-java-instrumentation/blob/main/instrumentation-api/src/main/java/io/opentelemetry/instrumentation/api/instrumenter/http/HttpServerMetrics.java#L154

At first I was quite sad not to have a Timer API which I would have just put into the Context directly - but since I ended up needing to propagate labels from start to end of the request, adding a long field became less of a deal. The discoverability of the API for Timer and ease of testing are of course still great reasons to have it though.

Edit: But if there were a similar class for the counter too, where add and sub don't operate on the meter but happens based on an intermediate object, then labels don't have to be propagated / grokked and that would be nice too. I think this issue may be not only about timings but about any metric that spans a request's beginning and end which includes active request counting as well.

donbourne · 2022-04-05T14:32:46Z

Hello! I'm working on the next generation of the MicroProfile Metrics spec. We're intending to allow for any metrics API to be used as the implementation of our next API. One important part of that API will be (and has been in past releases of our spec) a Timer metric. So it would be great if OTel could find a solution for this issue so that OTel metrics could be used by implementers of the next version of the MicroProfile Metrics spec.

trask · 2022-04-07T16:54:59Z

hi @donbourne! we are currently capturing "timing" metrics as histograms, e.g. http.server.duration

it's likely that a future "timer" instrument would just be a (very helpful) abstraction on top of histogram

since you would be providing your own Timer abstraction in MicroProfile metrics, would it work for you to bridge that directly to otel histogram for now?

jmacd · 2022-04-07T18:19:04Z

Agreeing with @trask -- to me the reason for a dedicated timing instrument is to abstract away details about handling clocks and/or to have a more stopwatch-like API. If you already have a timing measurement, then just use a histogram with the correct units.

donbourne · 2022-04-08T13:51:55Z

thanks @trask and @jmacd , knowing that the plan for OTel is to continue to use a histogram for Timers is useful. I realize that different metrics APIs will use different ways of representing a Timer, so it would be difficult for us to try to standardize on the exact output expected for timers. That said, using a histogram is similar in spirit to what we've done in the past for Timers (and what Micrometer can do when using a Prometheus Meter Registry).

So while I agree that having a timing instrument seems very useful from a syntax perspective, I get your point that an implementer of MicroProfile Metrics could make do with a histogram (particularly if future timing instrument will be implemented on top of a histogram).

MrAlias · 2023-02-06T17:52:25Z

Related to open-telemetry/oteps#129

EugeneKrapivin · 2023-05-21T15:57:43Z

Hey all! interesting discussion, I'm migrating the organization I work in from multiple metric strategies (Metrics.Net and app-metrics to influxDB) to a single industry standard solution, which I'm happy to embrace.

This conversation caught my eye because OTEL doesn't define a "timer", and here you are discussing a possible implementation.

I'm afraid that defining a Timer as a useful abstraction on top of a histogram might lead to a confusion with developers that are not familiar with the spec and discussion leading to it.

How would such an instrument help me to get mean, min, max and quantiles?
I'm a bit unsure regarding the use of histogram for the measurement of timed operations, since it lacks precision.

jsuereth · 2023-05-22T14:20:00Z

Histograms are designed for scenarios where you want mean, min, max and quantiles. The difference between a Histogram and other types of in-memory aggregations (like Summary/Counter) is that Histogram pushes as much data as reasonable (according to bucket boundaries) so the Quantile computation can happen server side.

Every histogram has these components:

sum of all values
count of all values
min of all values
max of all values
"bucket counts" (used to compute quantiles with sum/count/min/max to varying degrees of accuracy).

While there are many ways of calculating Quantiles, Histograms have become very popular due to better ability to handle time-alignment issues at query time over raw "gauge of quantile" type behavior we had before.

Hope that helps!

jonatan-ivanov · 2023-05-22T17:32:17Z

I'm afraid that defining a Timer as a useful abstraction on top of a histogram might lead to a confusion with developers that are not familiar with the spec and discussion leading to it.

Wouldn't defining a Histogram lead to the same confusion with developers that are not familiar with the spec? :)

To me the biggest advantage of having a Timer is the ability of measuring elapsed time. Which seems to be surprisingly hard because it seems most of the users are not get it right even though the solution is surprisingly easy in most platforms (see above or the article I linked above, though time is hard in general).

There are more advantages like testing and handling time units too, see my comment above.

fstab · 2024-04-18T09:25:26Z

For reference: The Prometheus Java client library has a TimerApi, but that's just syntactic sugar for Histogram, Summary, and Gauge. https://github.com/prometheus/client_java/blob/main/prometheus-metrics-core/src/main/java/io/prometheus/metrics/core/datapoints/TimerApi.java

jmacd added the spec:metrics Related to the specification/metrics directory label Feb 13, 2020

jmacd added this to the v0.4 milestone Feb 13, 2020

jmacd mentioned this issue Feb 13, 2020

Combined Metrics API updates for v0.3 (Add Observer instrument, remove Gauge instrument, etc.) #430

Merged

jmacd mentioned this issue Feb 25, 2020

Metrics API instrument foundation and refinements open-telemetry/oteps#88

Merged

bogdandrutu assigned jmacd Apr 7, 2020

bogdandrutu modified the milestones: v0.4, v0.5 Apr 21, 2020

carlosalberto modified the milestones: v0.5, v0.6 Jun 9, 2020

jmacd mentioned this issue Jun 24, 2020

[RFC] 3.0? yaorg/node-measured#82

Open

bogdandrutu added the area:api Cross language API specification issue label Jun 26, 2020

reyang added the release:required-for-ga Must be resolved before GA release, or nice to have before GA label Jul 10, 2020

andrewhsu added this to Required for GA, needs action. Add label:release:required-for-ga to issues and PRs when moving them to this column. in GA Spec Burndown Jul 13, 2020

bogdandrutu added the priority:p3 Lowest priority level label Jul 24, 2020

andrewhsu added release:allowed-for-ga Editorial changes that can still be added before GA since they don't require action by SIGs and removed release:required-for-ga Must be resolved before GA release, or nice to have before GA labels Oct 20, 2020

jmacd removed their assignment Feb 4, 2021

reyang self-assigned this Mar 3, 2021

reyang modified the milestones: v0.6, Metrics API/SDK Experimental Release Mar 3, 2021

reyang added this to To do in Spec - Metrics API/SDK Feature Freeze May 6, 2021

reyang modified the milestones: Metrics API/SDK Experimental Release, Metrics API/SDK Feature Freeze Jun 10, 2021

victlu mentioned this issue Jun 14, 2021

Introduce Timer instrument open-telemetry/opentelemetry-dotnet#2082

Closed

anuraaga mentioned this issue Jun 15, 2021

Instrumentation API open-telemetry/oteps#165

Closed

reyang mentioned this issue Jul 27, 2021

Move metrics API to Feature-freeze #1833

Merged

reyang added release:after-ga Not required before GA release, and not going to work on before GA and removed release:allowed-for-ga Editorial changes that can still be added before GA since they don't require action by SIGs labels Sep 14, 2021

reyang removed this from To do in Spec - Metrics API/SDK Feature Freeze Sep 14, 2021

reyang removed this from the Metrics API/SDK Feature Freeze milestone Sep 14, 2021

reyang added this to the Metrics Future Release milestone Nov 7, 2021

reyang removed this from Required/Allowed for GA, todo. in GA Spec Burndown Nov 7, 2021

reyang removed their assignment Nov 7, 2021

jonatan-ivanov mentioned this issue Jun 15, 2023

REQUEST: New membership for jonatan-ivanov open-telemetry/community#1548

Closed

6 tasks

dyladan added triage:accepted:needs-sponsor and removed release:after-ga Not required before GA release, and not going to work on before GA priority:p3 Lowest priority level labels Mar 29, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Metrics: Consider introducing dedicated measure instrument for timings #464

Metrics: Consider introducing dedicated measure instrument for timings #464

jmacd commented Feb 13, 2020

andrewhsu commented Apr 21, 2020

jmacd commented Jun 29, 2020

jkwatson commented Jun 29, 2020

nnanda2016 commented Jun 29, 2020

jmacd commented Jun 29, 2020 •

edited

james-bebbington commented Jun 29, 2020

kenfinnigan commented Aug 31, 2020

jonatan-ivanov commented Jun 1, 2021 •

edited

jsuereth commented Jun 3, 2021

victlu commented Jun 11, 2021 •

edited

jonatan-ivanov commented Jun 11, 2021 •

edited

Aneurysm9 commented Jun 14, 2021

jmacd commented Jun 14, 2021

anuraaga commented Jun 14, 2021 •

edited

donbourne commented Apr 5, 2022

trask commented Apr 7, 2022

jmacd commented Apr 7, 2022

donbourne commented Apr 8, 2022

MrAlias commented Feb 6, 2023

EugeneKrapivin commented May 21, 2023

jsuereth commented May 22, 2023

jonatan-ivanov commented May 22, 2023

fstab commented Apr 18, 2024

Metrics: Consider introducing dedicated measure instrument for timings #464

Metrics: Consider introducing dedicated measure instrument for timings #464

Comments

jmacd commented Feb 13, 2020

andrewhsu commented Apr 21, 2020

jmacd commented Jun 29, 2020

jkwatson commented Jun 29, 2020

nnanda2016 commented Jun 29, 2020

jmacd commented Jun 29, 2020 • edited

james-bebbington commented Jun 29, 2020

kenfinnigan commented Aug 31, 2020

jonatan-ivanov commented Jun 1, 2021 • edited

Why measuring elapsed time is hard?

Granularity

Time can change (non-monotonic)

Why testing a timer is hard?

Why could the time unit be a problem?

jsuereth commented Jun 3, 2021

victlu commented Jun 11, 2021 • edited

jonatan-ivanov commented Jun 11, 2021 • edited

Aneurysm9 commented Jun 14, 2021

jmacd commented Jun 14, 2021

anuraaga commented Jun 14, 2021 • edited

donbourne commented Apr 5, 2022

trask commented Apr 7, 2022

jmacd commented Apr 7, 2022

donbourne commented Apr 8, 2022

MrAlias commented Feb 6, 2023

EugeneKrapivin commented May 21, 2023

jsuereth commented May 22, 2023

jonatan-ivanov commented May 22, 2023

fstab commented Apr 18, 2024

jmacd commented Jun 29, 2020 •

edited

jonatan-ivanov commented Jun 1, 2021 •

edited

victlu commented Jun 11, 2021 •

edited

jonatan-ivanov commented Jun 11, 2021 •

edited

anuraaga commented Jun 14, 2021 •

edited