Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[exporter/datadog] Semantic conventions for spans #1909

Open
cemo opened this issue Dec 26, 2020 · 50 comments
Open

[exporter/datadog] Semantic conventions for spans #1909

cemo opened this issue Dec 26, 2020 · 50 comments
Labels
bug Something isn't working data:traces Trace related issues exporter/datadog Datadog components never stale Issues marked with this label will be never staled and automatically removed priority:p2 Medium release:allowed-for-ga

Comments

@cemo
Copy link

cemo commented Dec 26, 2020

Describe the bug
The current way of displaying spans is not conforming to spec.

image

Steps to reproduce
Use opentelemetry-java-instrumentation and opentelemetry-collector-contrib

What did you expect to see?
I am expecing to see name's of spans conforming spec such as:

  • HTTP GET/POST/etc
  • CREATE TABLE sync_test.users ( id UUID PRIMARY KEY, name text )
  • example.Greeter/Conversation

What did you see instead?

  • An unreadable span name: io.opentelemetry.java.server.SPAN_KIND_xxx

Additional context

See conventions:

https://github.com/open-telemetry/opentelemetry-specification/blob/master/specification/trace/semantic_conventions/http.md

@ericmustin
Copy link
Contributor

Thanks for filing this. We recently merged in work to improve datadog span operation and span resource naming here: #1861

The semantic conventions you're referring to are OpenTelemetry Semantic Conventions, which differ from the current format of Datadog traces. In our exporter we attempt to retain all of this information in a way that fits within the Datadog tracing format. Span Operation Names need to be a combination of Instrumentation Library + Span Kind in order to ensure that trace related metrics are unique to the IL+Kind.

The information you're referring to

HTTP GET/POST/etc
CREATE TABLE sync_test.users ( id UUID PRIMARY KEY, name text )
example.Greeter/ConversationHTTP GET/POST/etc
CREATE TABLE sync_test.users ( id UUID PRIMARY KEY, name text )
example.Greeter/Conversation

Should be available via the Span Resource. I've highlighted where this is available in the screenshot you provided, here: screenshot.

We have plans this quarter to continue to improve the translation of opentelemetry spans to datadog spans, and can add to this issue when there are future PRs in this space. That being said, At this time I do not believe the are immediate plans to change the format of span operation from Instrumentation Library + Span Kind.

Hope that helps!

@cemo
Copy link
Author

cemo commented Jan 5, 2021

thanks @ericmustin for clarification. Would you accept an optional parameter at collector for otel compatible naming?

@ericmustin
Copy link
Contributor

@cemo I think that's a thoughtful suggestion but i'm just trying to understand the use case here? otel span name != datadog span operation name and setting the operation name as the otel span name would create a number of issues in the datadog tracing UI and backend. Otel span name is equivalent to a datadog span resource however.

Perhaps the root cause here is that the length of instrumentation library + span kind in the UI is hard to read? We could perhaps concatenate these further, or expose a mapping configuration option to allow a user to convert any span operation name to a more readable string? So for example, io.opentelemetry.javaagent.spring_webmvc.span_kind_internal could be converted to something shorter like spring_webmvc.internal? Would that be helpful ? We have something similar to this for tag filtering in the datadog-agent and could expose a similar format here.

@cemo
Copy link
Author

cemo commented Jan 5, 2021

Hi @ericmustin, I am just trying to have more readable spans.

  1. I would definitely happy to see as spring_webmvc.internal.
  1. I would like to share a sample from datadog's newsletter. :)

image

I think this is more similar to opentelemetry conventions. Current situation is simply unreadable on our side.

  1. Can you please explain why providing uniqueness is necessary?

Span Operation Names need to be a combination of Instrumentation Library + Span Kind in order to ensure that trace related metrics are unique to the IL+Kind.

@ericmustin
Copy link
Contributor

  1. @cemo Ok thats good to hear. I do want to work to expose a way to make this more readable both for your use case and generally.

  2. The reference you're pointing to (datadog's instrumentation) is a good example of how we'd prefer the data be formatted, and is how datadog specific tracing sdk's format things (dd-trace-java, for ex). Unfortunately, in OpenTelemetry, we have to work with the information that the OpenTelemetry SDK's make available to us. In your case, we'd love to be able to only capture the instrumented library (mongo, spring, redis, etc), but all we have available to us is the name of the instrumentation library, ie, the package that applies instrumentation to the library in question. See: https://github.com/open-telemetry/opentelemetry-java/blob/master/exporters/otlp/trace/src/main/java/io/opentelemetry/exporter/otlp/trace/InstrumentationLibraryMarshaler.java#L18-#L19 . So we have to work with what we have, which is the relatively ugly "full instrumentation library name", and it's nearly impossible to add heuristics that could mutate io.opentelemetry.javaagent.spring_webmvc.span_kind_internal => spring_webmvc.internal automatically, for all languages and instrumentation library names, as the conventions vary from language to language. I think the best solution here is to expose an optional 'mapping' configuration which lets you match+replace span operation names to ones that are preferrable, based on a regex.

  3. The reason we need span operation names to be unique to instrumentation library + span kind is because we generate trace-metrics for hits, latency, errors per operation name (tagged with service, resource, and env). So for example if there were two database clients within the same application (a redis client and a mongodb client), and we only had a span operation name of client and not redis_instrumentatioin_library.client and mongodb_instrumentation_library.client, then the metrics for hits/errors/latency for these two database clients would be combined and seem inaccurate, or wouldn't allow a user to determine which database client was having issues.

@ericmustin
Copy link
Contributor

@cemo one thought I had is that in interim, if the instrumentation_library.name is not being actively used in other parts of your pipeline, is to apply a resource processor that could modify the instrumentation library names to your liking:

If this was something you'd be open to I could attempt to suggest a configuration to use

@cemo
Copy link
Author

cemo commented Jan 5, 2021

Thank you so much for detailed explanation @ericmustin. I will add a process to mitigate issue however what I want is actually to have a similar experience as datadog native agent. As you can see in the subsection of previous Datadog's example, http traces are named with specific part for their endpoints such as /users.

image

  1. Please correct me if I am wrong but this requires all kind of spans to be translated into the Datadog format since there is no unique way of translation of each Open Telemetry Span to Datadog Span?
  2. Maybe for the most common spans such as Redis, Mongo, Mysql for DB, it may be considered to one to one custom mappings? I mean for example in Datadog exporter part, there can be a special rule to change Mysql spans to Datadog compatible spans. I know there are maybe many custom span too but at least addressing 15-20 kind of spans will solve %90 of the problem. What do you think?

@mx-psi
Copy link
Member

mx-psi commented Jan 7, 2021

Hi @andrewhsu, can you assign @ericmustin instead? Traces issues and features requests are generally handled by Eric

@ericmustin
Copy link
Contributor

@cemo

I know there are maybe many custom span too but at least addressing 15-20 kind of spans will solve %90 of the problem. What do you think?

Yes, I think this is a good approach and should help the broad majority of users. I'll try to update this thread when there's a PR up, I am not sure what the timeline is on that at this time.

dyladan referenced this issue in dynatrace-oss-contrib/opentelemetry-collector-contrib Jan 29, 2021
…ver (#2089)

* Fix the scraper/discover manager coordination on the Prometheus receiver

The receiver contains various unnecessary sections. Rewriting the
receiver's Start for better maintainability.

Related to #1909.

* Use the background context

* Remove dead code
@alanisaac
Copy link

alanisaac commented Aug 24, 2021

I also initially had questions on the instrumentation library prefix, but that part makes sense to me now for the reasons given above. Thanks for the explanation!

That said, I have a case where I'm using a server framework that does not make it easy to get a route template for an endpoint. Instead, we opted to use "HTTP {METHOD} {ClassName}" as our span name, where {ClassName} is the class handling the particular route. This is still low cardinality, but provides us with more information than "HTTP {METHOD}".

However, because of the way the Datadog exporter processes the span names, and because we have http.method but not http.route all we get in Datadog is "{instrumenting_lib} {METHOD}", which is much less useful. Given the constraints above, ideally I'd rather see "{instrumenting_lib} {SpanName}".

In general, what's the reasoning behind the DD exporter forming its own resource name rather than falling back on the span name for the resource always? i.e. why isn't the logic in that function just s.Name()? In theory, if an instrumenting library is following the OpenTelemetry semantic conventions for http, messaging, RPC, etc., shouldn't the span name already be in a suitable format?

@ericmustin
Copy link
Contributor

ericmustin commented Aug 24, 2021

@alanisaac appreciate the feedback. I guess there's a few things here but, generally speaking, if folks want to just use s.Name() as the goto for their resource name instead of using our heuristics/translation (which ought to enhance things but, as you point out, can have cases where it does not), i'm not really looking to die on this hill and don't want to be a blocker for that. I think it's reasonable to maybe provide a config option that lets ppl toggle on use_otel_span_name_as_datadog_resource_name or something like that. Let me chat with my team about it and followup here.

@alanisaac
Copy link

alanisaac commented Aug 24, 2021

@ericmustin

Thank you for the incredibly fast response! Something like use_otel_span_name_as_datadog_resource_name would be a great solution for our use case.

@ericmustin
Copy link
Contributor

So I spoke a bit with my team here, i think there's some open questions we have around how this confg should work and how much flexibility should be built into it, and in the immediate term i'm not sure this will be prioritised in the upcoming few sprints, but I've added this feature request to our tracking internally and will try to update if/when it becomes available.

@mx-psi mx-psi added the exporter/datadog Datadog components label Sep 23, 2021
@mx-psi mx-psi added the data:traces Trace related issues label Oct 28, 2021
@tonglil
Copy link
Contributor

tonglil commented Nov 19, 2021

Instrumentation library as span name

As another data point, the naming of spans based on instrumentation library is so long that it is pretty much useless when looking at it by default.

In a simple Go application using Gorilla mux Otel contrib instrumentation and making HTTP calls, I have 2 extremely long IL names before I get any useful information. The sub span is actually unreadable at a glance:
image

So far I've used the collector's span_name_remappings to reduce this down, however I don't see this scaling or feasible for future scenarios where we need to keep updating the collector's configs and redeploying them just so that the visualizations make sense:

    traces:
      span_name_remappings:
        io.opentelemetry.javaagent.spring.client: spring.client
        # instrumentation::express.server: express # example actually breaks the config because of double ::
        go.opentelemetry.io/contrib/instrumentation/github.com/gorilla/mux/otelmux: mux # doesn't work, need to specify kind too
        go.opentelemetry.io_contrib_instrumentation_github.com_gorilla_mux_otelmux.server: mux
        go.opentelemetry.io/contrib/instrumentation/github.com/gorilla/mux/otelmux.server: mux # doesn't work, need to figure out what DD is doing to "sanitize" it first
        go.opentelemetry.io/contrib/instrumentation/net/http/otelhttp: http # doesn't work, need to figure out / -> _ and SPAN_KIND
        go.opentelemetry.io_contrib_instrumentation_net_http_otelhttp: http # doesn't work, didn't include SPAN_KIND
        go.opentelemetry.io_contrib_instrumentation_net_http_otelhttp.client: http
  1. Tracking this down was not an easy process.
  2. This is going to end up as a huge mapping list at one point for all different kinds of instrumentation libraries.

Missing HTTP client paths

Furthermore, the HTTP client request spans using the net/http OTel contrib instrumentation are also kind of lacking with just GET/POST/etc without the URL information of the span.

What I want is:
image
But all I get is actually just:
image

I've tried to override this with this, but Datadog just ignores it wholesale (which doesn't seem correct):

	client := otelhttp.DefaultClient
	client.Transport = otelhttp.NewTransport(
		http.DefaultTransport,
		otelhttp.WithSpanNameFormatter(func(op string, r *http.Request) string {
			return fmt.Sprintf("%s %s", r.Method, r.URL.Path)
		}),
	)

Instead, I have to set the route attribute to do this, which is awkward and doesn't make sense for client requests:

	client := otelhttp.DefaultClient
	client.Transport = otelhttp.NewTransport(
		http.DefaultTransport,
		otelhttp.WithSpanOptions(trace.WithAttributes(
			semconv.HTTPRouteKey.String(req.URL.Path), // The http.route tag is necessary in order for Datadog to properly name spans with http.* tags (web type)
		)),
	)

Funnily enough, these keys don't work:

			attribute.Key("resource").String("/custom-resource")
			attribute.Key("operation").String("custom-operation")

There seems to be a related conversation here about why Datadog doesn't show this here: DataDog/dd-trace-rb#277

Lastly, the span contains all the info Datadog (req.URL.Path) needs to populate the field in the UI, but Datadog is actually processing/omitting some of the http.url span attribute. Here's what I get in Jaeger and Zipkin with WithSpanNameFormatter:
142552075-be1e29d7-14a3-4aec-92a2-88cf47393ca7
Screen Shot 2021-11-18 at 6 12 22 PM

But Datadog strips the field and omits info:
Screen Shot 2021-11-18 at 6 02 39 PM

@ericmustin
Copy link
Contributor

@tonglil Hey! Took a quick look, sounds like reasonable feedback, that being said I'm no longer employed at Datadog, last I left it there was some backlog items internally to make this more flexible, but as I'm no longer involved, support@datadoghq.com would be the optimal place to get actionable feedback or movement on this issue. All the best!

@mx-psi
Copy link
Member

mx-psi commented Nov 19, 2021

Thanks for your detailed message @tonglil; I will also try to flag this internally (and thanks @ericmustin for pointing them to support!). While I am assigned to this issue I am not involved in the APM-side of Datadog so I can't directly help with code, but I want to try to at least try to move this forward.

I have tried to list down the Datadog exporter-specific problems that you mention on the following list in a bite-sized format:

  1. the span_name_remappings option is cumbersome to use because sanitization of names happens first and so you need to know how sanitization works if your span name contains characters like /,
  2. even ignoring sanitization issues, it's still difficult to figure out what span name to use since it's unclear what it ends up being (e.g. it's unclear that the kind needs to be added),
  3. the renaming approach is difficult to do at scale, since you have to rename basically everything and it's a lot work (partly because of 1 and 2),
  4. it's confusing to try to change the name of a span on the OTel contrib net/http implementation because WithSpanNameFormatter doesn't work well with Datadog (while it does with Jaeger/Zipkin),
  5. the URL path is omitted from the Datadog span name when using the OTel contrib net/http implementation.

Do these descriptions make sense? I want to, if possible, split this issue into smaller and better-defined issues (either here on Github or in our internal tracker). In particular, (1),(2), and (4) feel like they belong to the same documentation issue, (3) may be addressed by a solution like #1909 (comment) and without further context (5) sounds like a bug.

@tonglil
Copy link
Contributor

tonglil commented Nov 20, 2021

@mx-psi

  1. yes, this is correct
  2. yes, this is correct
  3. yes, this is correct, and additionally, we have to continuously restart the agent to pickup the changes, and distributing this data to multiple collectors that could be running out there (not under centralized control) - perhaps it would be better to set these mappings in Datadog's APM settings UI (since this config is already specific to DD and not OTel's other receivers)
  4. yes, I should mention that the span name set with this method is identified by OTel itself through the stdout exporter and the collector's logging exporter as the span name too - it's just the Datadog exporter is choosing to ignore it
  5. yes, and I don't know why but it's choosing to omit a portion of the path: for example, it keeps https://httpbin.org/status/ but not 200?name=hello; why wouldn't it omit the entire path or keep the entire path? This is strange behavior.

Thank you.

@cemo
Copy link
Author

cemo commented Nov 20, 2021

It is very sad that such a critical issue is not tried to be addressed so far. This issue is much worse in java land since it has more instrumentation support.

I think this can be even solved in the server side. Span has the information of tracing libraries. It can be formatted according to each library. Is this doable on server side? Am I missing something?

@github-actions
Copy link
Contributor

This issue has been inactive for 60 days. It will be closed in 60 days if there is no activity. To ping code owners by adding a component label, see Adding Labels via Comments, or if you are unsure of which component this issue relates to, please ping @open-telemetry/collector-contrib-triagers. If this issue is still relevant, please ping the code owners or leave a comment explaining why it is still relevant. Otherwise, please close it.

Pinging code owners:

See Adding Labels via Comments if you do not have permissions to add labels yourself.

@github-actions github-actions bot added the Stale label Jan 24, 2023
@tsouza-squid
Copy link

Hey @gbbr , any updates on this issue?

@github-actions github-actions bot removed the Stale label Mar 14, 2023
@RangelReale
Copy link

Also looking forward to this, that long operation name makes it very hard to see things in the trace UI.

@RangelReale
Copy link

I ended up creating a custom TracerProvider which wraps the global one and sets the operation.name and resource.name attributes before calling the source TracerProvider / Tracer.

All integrations I saw supports sending a custom WithTracerProvider option, so this should work with any library.

https://gist.github.com/RangelReale/1e50518a1e3c73eb56748192c5746163

@RangelReale
Copy link

I ended up creating a package for this code, with contrib for common libraries.

https://github.com/RangelReale/opentelemetry-go-datadog-spanname

@gbbr
Copy link
Member

gbbr commented Apr 12, 2023

Sorry for the lack of responses here. I am actively working on this as we speak.

@gbbr
Copy link
Member

gbbr commented Apr 13, 2023

I will be away next week at KubeCon EU, but do wanted to inform you that we are considering the problem twofold here:

  1. Change the current way that the operation name is being computed (perhaps to something more in line or similar to what @RangelReale's provider does). This will result in a breaking change but will likely be for the best.
  2. Analyse how the operation name is converted internally to see if there is a better way.

Ultimately, we want everyone to have a pleasing experience without the need to switch flags such as the span_name_as_resource_name setting, which is mostly a remnant from the very old Datadog Exporter POC.

Please let me know what you think. Would it be an acceptable solution to improve the Datadog span name in such a way that it is shorter and easier to work with, similar to the proposal above?

@RangelReale
Copy link

I think the operation name is DataDog specific, there should be no good easy way of guessing it from each of the otel instrumentations besides having mappings.

@ericmustin
Copy link
Contributor

ericmustin commented Apr 13, 2023

@gbbr hey bud hope all is well. just wanted to provide the chesterton's fence for this long standing complaint. When implementing this in the POC (many years ago) we encountered a problem mentioned earlier in this issue, here

The reason we need span operation names to be unique to instrumentation library + span kind is because we generate trace-metrics for hits, latency, errors per operation name (tagged with service, resource, and env). So for example if there were two database clients within the same application (a redis client and a mongodb client), and we only had a span operation name of client and not redis_instrumentatioin_library.client and mongodb_instrumentation_library.client, then the metrics for hits/errors/latency for these two database clients would be combined and seem inaccurate, or wouldn't allow a user to determine which database client was having issues.

No clue if that's still how the RED(hits/latency/errors) trace-metrics are generated internally at datadog, or if that is just a small edge case at this point and not worth having a bad experience for the majority of users, but just a heads up not to footgun yourself by re-introducing that issue.

(for anyone following along, please note that I have not worked at ddog in a long time and am not speaking on the organizations behalf, just trying to provide context so the current folks working on this have all the information available to them)

@cemo
Copy link
Author

cemo commented Apr 13, 2023

I can not understand why this issue is not addressed on the DD server side? They have everything to format it as a span datadog agent creates. Each library must be mapped into something as native data agent creates. What is the problem of this approach?

@gbbr
Copy link
Member

gbbr commented Apr 13, 2023

I think the operation name is DataDog specific, there should be no good easy way of guessing it from each of the otel instrumentations besides having mappings.

No worries. We'll figure out a good way to extract the right operation name. The question is whether that would be a satisfactory solution to you (and everyone else): having a shorter, more inline with Datadog "SDKs" operation name.

I can not understand why this issue is not addressed on the DD server side? They have everything to format it as a span datadog agent creates. Each library must be mapped into something as native data agent creates. What is the problem of this approach?

I agree with you @cemo. That is what I tried to point out in my second bullet point above. We are working on it. In the meantime, I'll try to improve the existing code to create better operation names that aren't as long and confusing.

@RangelReale
Copy link

@gbbr I think the best solution would be to make sure the operation/span names from the Otel library is the same as the equivalent ones in dd-trace-go library. That's what I did, I went into each instrumentation in that library and copied the names that it would output.

@rucciva
Copy link

rucciva commented Jun 8, 2023

hi @gbbr , i don't know if this is related or not, but i'm also having a bit problem regarding the resource name

i'm currently instrumenting a php application using dd_trace.
when i don't use opentelemetry, the resource_name on datadog is equal to http.route.action tag:
Screenshot 2023-06-08 at 15 04 14

but when i redirect the trace to opentelemetry collector (with the intention to add several tags on the trace data), it only include the http.method, making it harder to distinguish the traffic when creating dashboard.
Screenshot 2023-06-08 at 15 05 24

@gbbr
Copy link
Member

gbbr commented Jun 8, 2023

@rucciva this issue seems like a tangent but slightly different. Can we treat it separately from this one to keep things focused here? It would help me to see your code. Can you open a separate issue and give more details? Or, if you don't want to share it publicly, reach out to our support and send everything over, mentioning my name -- I'd be happy to help.

@rucciva
Copy link

rucciva commented Jun 8, 2023

noted @gbbr , i'll open a ticket, thanks for the response

@utezduyar
Copy link
Contributor

Maybe I'm misunderstanding something fundamental but... isn't the span_name_as_resource_name flag introduced in #6611 misnamed? I.e., I think enabling that flag will use the span name as the DD operation name, not as the resource name.

Long thread but this comment is crucial! I believe the intention from the beginning was to introduce a flag that sets the DD resource name as span name but the implementation sets the operation name as the span name. This might be the desired outcome for some but I needed to have the resource name as the span name so I can have APM Resource pages.

Until DD comes up with something better or fixes the meaning of the flag, I have added the following to the collector.

transform:
   error_mode: ignore
   trace_statements:
     - context: span
       statements:
         # DataDog cannot pick up the span name as resource on
         # Service Entry Spans. We append it here. This might have a
         # side effect to overwrite an application that is instrumented with
         # DataDog SDK. However, as long as OpenTelemetry SDK is used to
         # instrument, this transformation will help.
         - set(attributes["resource.name"], name)

@github-actions
Copy link
Contributor

This issue has been inactive for 60 days. It will be closed in 60 days if there is no activity. To ping code owners by adding a component label, see Adding Labels via Comments, or if you are unsure of which component this issue relates to, please ping @open-telemetry/collector-contrib-triagers. If this issue is still relevant, please ping the code owners or leave a comment explaining why it is still relevant. Otherwise, please close it.

Pinging code owners:

See Adding Labels via Comments if you do not have permissions to add labels yourself.

@github-actions github-actions bot added the Stale label Sep 11, 2023
@rcollette
Copy link

Relevant

@github-actions github-actions bot removed the Stale label Sep 11, 2023
@seth-macpherson
Copy link

Not stale

@songy23 songy23 added the never stale Issues marked with this label will be never staled and automatically removed label Oct 19, 2023
@mx-psi mx-psi unassigned gbbr Dec 15, 2023
@apederson94
Copy link

Hoping for a nice fix for this as well

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working data:traces Trace related issues exporter/datadog Datadog components never stale Issues marked with this label will be never staled and automatically removed priority:p2 Medium release:allowed-for-ga
Projects
None yet
Development

No branches or pull requests