Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Making tracing SDK metrics aware #381

Open
sfriberg opened this issue Dec 10, 2019 · 23 comments
Open

Making tracing SDK metrics aware #381

sfriberg opened this issue Dec 10, 2019 · 23 comments
Labels
area:sdk Related to the SDK spec:metrics Related to the specification/metrics directory triage:deciding:community-feedback

Comments

@sfriberg
Copy link

Making Tracing API metrics aware

One interesting aspect of Open Telemetry is the goal to provide tracing, metrics, and later a logging API. Right now these APIs are fairly separate so as an end-user or library owner I would need to write code to add a span and some more code to add a metric. This is not different that I would need to do today with OpenCensus or with OpenTracing and a metrics library (Prometheus, DropWizard, etc.), but I think OT would allow this to be simplified due to it's all in one nature.

With the metrics and API that takes a Measurement it feels like the integration with Spans would come fairly natural.

Why

Single code point extension, get metrics for "free".

As a end-user I want to write a little code as possible. Instead of couple of lines for metrics, a few more for tracing and one for logging, if I can get it all done in a single place that would be helpful.

Increased uptake of metrics library

Easily getting metrics from spans would be one further path that we could increase the uptake of the metrics library. Switching from OpenCensus and OpenTelemetry is pretty much required for tracing, but metrics doesn't have the same forced migration and will face a much more mature and larger ecosystem.

Span exploration

Currently the naming of Spans is rather close to that of using labels in metrics names, it ends up being an explosion in names that all have a different path, instead of having a HTTP request span that you can easily filter for properties you are interested in.

Filtering and sampling of Traces and Spans using metric data

One of the longer term interesting aspects of this closer relationship will be that metrics could be used by spans/traces do decide if a span should be after the trace is complete/tail-based sampling. Since the metrics name is know (or the same as the span name) as well as labels used for the metric you would be able to easily query a metrics db to get for example the 99%tile for the last 24 hour and explicitly store all of those metrics as well as any randomly sampled traces.

How

Others will probably have better suggestion, but an initial idea would be to have the constructor/builder pattern simply allow, generateMetric(boolean) as part of building the Span.

As for labels I think it would be interesting to discuss if labels should be part of the span. One could think of it as being different layers of contextual data for a span. Top layer would be Resources (process level labels), Labels (general request labels), and Attributed (request unique (or high cardinality) labels). Metrics and spans would then be exported with these as appropriate.

Naming of spans could be the same as the metric for easy association and lookup, but for backwards compatibility we could allow a Operation Name the constructed through a templated string that could read a label, such as {{label.method}}:{{label:parameterized_url}} for a HTTP request span.

@jmacd
Copy link
Contributor

jmacd commented Dec 10, 2019

The tracing API should not be made aware of metrics. A tracing SDK should be made configurable to support the kinds of extension you are interested in. The API already supports what is needed in this regard -- every metrics API has a context argument, and an integrated SDK could be configured to aggregate metrics on a per-span level. Issue #259 is a placeholder for this sort of discussion while #347 is still in-progress.

This topic was discussed in today's Spec SIG meeting as well. I am looking for a recording.

@jmacd jmacd added the spec:metrics Related to the specification/metrics directory label Dec 12, 2019
@sfriberg
Copy link
Author

@jmacd Why do you see this more as a SDK integration than part of the API?

Similar to Resources being shared between the two, why couldn't labels be a shared API construct in the same fashion?

My understanding of the API vs SDK separation isn't the best either, any pointers to documentation that defines them?

@jmacd
Copy link
Contributor

jmacd commented Dec 12, 2019

We have a Library Guidelines document that explains the API/SDK separation. The way I think about this is that the API should define semantics of the operations and leave implementation to the choice of SDK. The SDK should be able to implement metrics-aware-tracing and tracing-aware-metrics if it pleases....

So long as we set up the APIs with the necessary information crossing the API/SDK boundary and give useful semantics to the API operations.

For tracing-aware-metrics, the task would be to develop a SDK that implements both tracing and metrics with the awareness that you want.

There are several natural metrics implied by every span that could be automatically instrumented. There's a counter instrument for the number of span started, a measure instrument for the latency of each span, and in some cases it's useful to think of a distinct counter of the number of spans finished.

Basically, I think we can accomplish the sorts of things you imagine without any new API surface. You had written about a generateMetric(boolean) method on the Span API. I'm proposing that this can be accomplished by a metrics-or-tracing configuration in the SDK. If there is enough interest, we should be able to get features of this nature into the default SDK.

@kbrockhoff
Copy link
Member

I agree this should be handled by the SDK. I would really like to see this in the default SDK.

@sfriberg
Copy link
Author

sfriberg commented Jan 3, 2020

I think to make this feasible and "easily" implementable in the SDK I think updating the API is required to a certain extent.

With the rather specialized Span name that is currently being used in Distributed Tracing, such as "GET:/resource/{id}", using a more generic name similar to a metric name would be useful. A HTTP request could have the generic Span name "http_request", from this we could automatically generate metrics as required, "http_request_duration (histogram)", "http_request_status (counter)". We could generate these through SDK configuration and specify which data to extract from a Span to include in the metric, and which labels.

For backwards compatibility we could allow a Span to have a "OperationName" format string that would be a template string, "{{label.method}}:{{label.path}}", that picks up different labels from the Span to build the name. This could potentially also just be a SDK configuration or exporter configuration so that it is only actually generated when required. This would also help solve one of the issues we ran into with OpenTracing where operation name is set when the span is created so to change it contribution libraries added special hooks to allow users to change operation name and other labels as part of the span creation[1][2].

I think the other interesting API change would be to try and unify the resource/label-set/attributes that are current being used for process/metrics/spans, and instead use a single type that can be assigned to every level. If we have a LabelSet that can handle merges, we can use it to be able to add labels to a Resource, TraceFactory, NamedTracer, NamedMetricsProvider (not sure what the real name is), Metric and Span. The would be merged from highest to lowest level with a a defined precedence of what labels are being kept/overwritten. The SDK can then be used to define what labels to drop, keep etc from Spans and Metrics, and a user of the API and/or the SDK will have a single construct to learn and understand instead of 4 different varieties of sets.

1 - https://github.com/opentracing-contrib/java-web-servlet-filter/blob/master/opentracing-web-servlet-filter/src/main/java/io/opentracing/contrib/web/servlet/filter/ServletFilterSpanDecorator.java
2 - https://github.com/opentracing-contrib/java-jaxrs/blob/master/opentracing-jaxrs2/src/main/java/io/opentracing/contrib/jaxrs2/client/ClientSpanDecorator.java

@iNikem
Copy link
Contributor

iNikem commented May 25, 2020

@sfriberg Do you I understand your proposal (or at least part of it) correctly:

  • Span name should be low cardinality, e.g. http_request, and suitable for metric name
  • Span should have a separate attribute, called OperationName, to be dynamically composable from other attributes

If yes, then below are my objections :)
First, Otel Collector already allows for span name composition based on arbitrary labels.
Second, if current span names are too high cardinality for your metrics, why cannot you use any other attributes/labels for metric name generation? Both of the proposals above seems to me like pushing to already existing functionality lower in the stack, to the actual data generation. I would argue that collector and/or backend are better places for them. As it is currently :)

@jmacd jmacd changed the title Making tracing API metrics aware Making tracing SDK metrics aware May 28, 2020
@sfriberg
Copy link
Author

Thanks for the feedback @iNikem. Yes, that resembles what I'm looking at

First, Otel Collector already allows for span name composition based on arbitrary labels.
Didn't know this, do you have a pointer?

The main reason is to make it easier for developers and users of the metric to know what to look for and be able have some control of what they want to be collected. A span might be sampled so not making it to the backend and thus getting a completely accurate metric won't be possible unless you have a metric as well. So in part yes it is pushing functionality further down in the stack, but mainly to allow easier implementation for a developer that instead of adding a span + two counters (total requests, total duration) that would be created automatically with a easily found name.

@sfriberg
Copy link
Author

@iNikem Thanks, had missed that. Is this something we purely aim to do in the collector or is the intent to support it at all levels in SpanProcessors?

Having this capability feels like it would further warrant a basic naming scheme on the generation side so that you could easily write a rule for the specific span you are interested in using the name. If the span names comes with complex naming doing the filtering and renaming gets harder.

@jmacd
Copy link
Contributor

jmacd commented May 29, 2020

I think this issue is straying onto a new topic. There seems to be a point of contention about span naming requirements in order to facilitate generating metrics from spans. I 100% agree, but I don't think we should discuss it in this ticket because it's a large topic.

Responding to a misc point above: it should be possible to either:
(1) synthesize metrics on the client from a span, even when a span is sampled. you'd have to fire your metrics from inside the Sampler API, which has not been proven out.
(2) synthesize metrics from sampled spans on the server. much like the original statsd metric interface supports a "rate" parameter, generating metrics from sampled spans will apply the adjusted count of the sample span as a multiplier on the metric event. (See e.g. discussion in open-telemetry/oteps#107).

@reyang reyang added the area:sdk Related to the SDK label Jun 30, 2020
@reyang reyang added question Question for discussion release:after-ga Not required before GA release, and not going to work on before GA labels Jul 10, 2020
@pauldraper
Copy link

pauldraper commented Aug 6, 2020

For backwards compatibility we could allow a Span to have a "OperationName" format string that would be a template string, "{{label.method}}:{{label.path}}",

FWIW this is very similar what I have done for span naming for consideration when using Datadog as a backend.

#531 (comment)

open-telemetry/opentelemetry-js#1316 (comment)

Datadog has the concept of a generic "operation" (e.g. http.request) and an application-defined "resource" (GET /users/{id}).

In terms of metrics, everything with the same operation can be sensibly aggregated (you can aggregate over all http.request) and the resource becomes a tag for further filtering/grouping.

I've been naming spans http.request:GET /users/{id} sqs.dequeue:my-example and then processing them for tracing and metrics.

@pauldraper
Copy link

pauldraper commented Aug 28, 2020

Second, if current span names are too high cardinality for your metrics, why cannot you use any other attributes/labels for metric name generation?

It's not exactly that the current spans are too high cardinality, like too many to fit in a dropdown or a list.

It's that they could be aggregated and compared against other spans.

http.request:GET_/accounts/{id} and http.request:GET_/users/{id} are fine low-cardinality names for spans, but in many contexts it would be useful to ask "how many http.request:* have http.status:500"? I.e. you want to combine them.

This is exactly what Datadog does with span operation (http.request) and span resource (GET_/users/{id}). And in fact, it generates metrics automatically. The metric name is the span operation and the resource metric tag is the span resource.

Both metric names and metric tags are low-cardinality; a metric name is not only low-carnality it means that everything with the name can be sensible aggregated (e.g. HTTP requests).

@pauldraper
Copy link

pauldraper commented Aug 28, 2020

At the end of the day, the important question is:

What should instrumenters be doing?

After instrumenting libraries, it certainly feels like there is substantial duplication. E.g. I'm instrumenting some RPC library:

  • I write metrics for how long the exchange takes, how large the request was, and how large the response was, tagged by RPC operation.
  • I write logs for each RPC operation including how long the exchange takes, how large the request was, how large the response was, and the destination.
  • I write traces named by the RPC operation, with attributes how large the request was, how large the response was, and the destination.

Should instrumenters be writing to tracing, metrics, and log APIs?

Or should they be writing to the highest-order API (traces) and expecting that SDKs will pull metrics from there?

(Note that metrics are least common denominator; it's quite realistic for a consumer to have a metrics backend, but not an actual tracing backend.)

This issue really deserves some attention; the entire premise of OTel is that instrumentations and backends can be developed in isolation, and this is a very large practical concern for anyone who has actually tried to write a portable instrumentation.

@justinfoote
Copy link
Member

@pauldraper, I 100% agree.

My answer to the question

Should instrumenters be writing to tracing, metrics, and log APIs?

Is a definite no.
Though I'm not sure I agree with elevating the Trace API to "highest-order" status. Perhaps there's room for a new timing abstraction above all of these.

We discussed this briefly in the metrics SIG last week -- the idea that we may want a way to generate metrics from Spans automatically. This is complicated by the internals of some tracing SDKs, which substitute NoOpSpans when a trace is not sampled.

@pauldraper
Copy link

pauldraper commented Sep 3, 2020

Perhaps there's room for a new timing abstraction above all of these.

the idea that we may want a way to generate metrics from spans automatically.

Until this is resolved sometime in 2022, I've built a little abstraction that registers metadata about span name and attributes. These are then used to generate several metrics with attributes for each span (currently I use it as a utility at the instrumentation API level):

  1. Convert the span name into a metric name and metric attribute(s). E.g. http.request GET /users/{id} becomes name http.request and attribute http.route:GET /users/{id}. That metric name is to create a count metric and duration metric.
  2. Add select (numeric) span attributes as metrics. E.g. http.request.size:1059 is becomes name http.request.size:1059 (or name http.message.size with attribute http.message.type:request).
  3. Add select (low-cardinality) span attributes as metric attributes. E.g. http.method:GET is added as http.method:GET.

The downside is this approach requires a good bit of additional metadata than just the span. The upside is that it generates some high-quality metrics.

@as-polyakov
Copy link

as-polyakov commented Sep 8, 2020

Until this is resolved sometime in 2022, I've built a little abstraction that registers metadata about span name and attributes. These are then used to generate several metrics with attributes for each span (currently I use it as a utility at the instrumentation API level)

we doing similar thing. But apart from generating metrics from spans we also allow all metrics to be correlated to spans (do not confuse with exemplars). Our idea is this - spans and metrics are fundamentally semantically linked. Most of (synchronous) metrics are emitted during some logical unit of work covered by a span. We use semantical term "event" to outline the common abstraction for logical unit of work. Event is basically a span which has some predefined metadata (like type - "http.request", "db.query", name, userId, etc) and parent-child connections (just like span) - eventId/parentEventId. Technically event can be boiled down to a subclass of a Span. All metrics, spans and logs are logically associated to some event and events are chained in causal relationship which is propagated across service boundaries with CorrelationContext. Hence metrics now have causal parent-child connection to each other via eventId/parentEventId plus well known set of metadata associated to each metric.

What it gives us - ability to query metrics in relation plus a set of common metadata always associated with each telemetry data. Let's talk about metrics relation. In our case we can have ServiceA and ServiceB both calling ServiceC. We see ServiceA's latency metric spiked because of ServiceC latency. If we just look at ServiceC p90 latency it might not show much since it aggregates across all calls from ServiceA and ServiceB and thus may be well within a range. With parent-child relationship on metrics however we can exactly correlate ServiceC's latency metric which caused ServiceA's latency spike. Kind of
select metrics where event ID’s parent event is from service A AND where it’s parent event’s latency is over 400ms

Now implementation wise we are experimenting with changing SDK's metric processors to always tag metrics with current event attributes which are pulled from Event.current() similar how Span.current() works.
Similarly to span, we start event first, then each time metric is recorded we pull current event from thread-local storage and serialize it as a set of kv tags which are then appended into metric tags.

I am curious to hear if the folks here thought about similar problems and how they approach it.

@Oberon00
Copy link
Member

Oberon00 commented Sep 15, 2020

Our idea is this - spans and metrics are fundamentally semantically linked.

It sounds to me like you have more or-less re-implemented what Context is supposed to solve (but doesn't currently). Please take a look at #510 and #875.

@as-polyakov
Copy link

Our idea is this - spans and metrics are fundamentally semantically linked.

It sounds to me like you have more or-less re-implemented what Context is supposed to solve (but doesn't currently). Please take a look at #510 and #875.

Thanks for the links, Christian. Seems like having Context as a Span parent solves at least half of the problem - now our "event" (context) is present in each Span. But in order to get metrics and spans share the same context, don't we need similar parent relationship between Metric and Context, i.e. each emitted metric also has Context associated? Or did I get it wrong?
Ultimately, in my receiving end I would need to pull metrics, pull span and have them correlated through some common context. AFAIK for metrics the only medium I can use to transmit this information is labels (tags)
From what I see currently:

@pauldraper
Copy link

pauldraper commented Oct 5, 2020

each emitted metric also has Context associated? Or did I get it wrong?

Yes, that's correct.

For languages with automatic context propagation like Java, the context would usually be the current one.

@pauldraper
Copy link

pauldraper commented Dec 23, 2020

I'm sure how using Context solves the core issue here.

I wanted my metrics on my spans (or my span attributes in my metrics).

Should instrumenters be writing to tracing, metrics, and log APIs?

Is a definite no.

This is the core concern: where I should be instrumenting code with all three APIs?

@pauldraper
Copy link

There are two opposing opinions expressed here:

  1. @jmacd believes the span API should be kept different than the metrics API. A library instrumented for metrics and tracing must instrument both separately.

  2. @justinfoote believes the instrumenting the span API should automatically create metrics as well. That implicitly combines the tracing and metrics APIs.

(I also fear there is a third poorly conceived option floating around, where the consumer of the library has to configure a bunch of SDK stuff to convert spans to metrics, select tags, etc. That is absolutely bad; the library should be able to report metrics without voluminous end-user configuration.)

@hwinkel
Copy link

hwinkel commented Sep 11, 2023

As far i remember tracing could or even should just trace a certain percentage or subset of the function calls or requests to reduce the tracing resource impact which might slow down systems. Where metrics typically collected all the time.

Ist this assumption still valid this days?

@pauldraper
Copy link

That is correct.

@trask trask added triage:deciding:community-feedback and removed question Question for discussion release:after-ga Not required before GA release, and not going to work on before GA labels Apr 2, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area:sdk Related to the SDK spec:metrics Related to the specification/metrics directory triage:deciding:community-feedback
Projects
None yet
Development

No branches or pull requests