feat: [do not merge] OpenTelemetry Datadog Exporter #1316

ericmustin · 2020-07-15T13:57:01Z

Edit

DataDog has released an OpenTelemetry exporter which can be found at https://github.com/DataDog/dd-opentelemetry-exporter-js

Which problem is this PR solving?

This PR adds a plugin for exporting traces to the 3rd party vendor Datadog. It is meant for review purposes only and should not be merged since, afaik current opentelemetry exporter vendor policy dictates that 3rd party vendor specific exporters should be hosted under the Vendor's Github Org, and they should be linked to via the OpenTelemetry-Registry. That being said per SIG mtg discussion I was told this would be the appropriate way to be able to get as many eyes for a review as possible.

Any and all feedback is very much appreciated!

Short description of the changes

This PR adds a few components which I'll attempt to summarise below
- DatadogSpanProcessor: An implementation of the SpanProcessor. A current constraint when exporting traces to the Datadog Trace-Agent is that the traces must be 'complete' ie, all started spans in a trace should be sent in one array when they are all finished. The DatadogSpanProcessor is functionally very similar to the BatchSpanProcessor in that it batches together spans for export, the main implementation difference that it only exports complete traces, and includes both a maxQueueSize and maxTraceSize
- DatadogExporter: An exporter implementation that exposes export and shutdown methods. The exporter in practice is very similar to the JaegerExporter in that it mainly serves to transform OpenTelemetry Spans into Datadog Spans (via the helper methods in ./transform.ts ), and then flush those to the Datadog trace-agent. It does so by leveraging the datadog tracer dd-trace-js's internal AgentExporter , which converts the Spans into the appropriate msgpack format and sends them via http to the trace-agent. Additionally, the DatadogExporter allows users to set Datadog specific tags such as env, service_name, and version. both by Env var and via in-code configuration
- DatadogPropagator: A propagator implementation that injects/extracts datadog specific distributed tracing propagation headers. This is similar to the B3Propagator in that it must handle 128bit => 64bit trace-id conversion.
- DatadogProbabilitySampler: A sampler implementation that records all spans but only samples span according to the sampling rate. This is similar to the ProbabilitySampler with the main difference being that all spans are records (but not sampled). The underlying reasons for this is that Datadog's Trace-Agent generates some trace-related metrics such as hits and errors, and in order to properly upscale sampled traces into accurate hit counts, it needs to be aware of all traces/spans. Since the ReadableSpan doesn't expose the sampling rate of the trace, the only way to determine this information is to export all spans and traces (with traces that should be dropped due to sampling having their datadog specific sampling rate metric tag adjusted accordingly but still exported to the trace agent)

Example Usage

import { NodeTracerProvider } from '@opentelemetry/node';
import { DatadogSpanProcessor, DatadogExporter, DatadogPropagator, DatadogProbabilitySampler } from '@opentelemetry/exporter-datadog';
const provider = new NodeTracerProvider();
const exporterOptions = {
  service_name: 'my-service', // optional
  agent_url: 'http://localhost:8126', // optional
  tags: 'example_key:example_value,example_key_two:value_two', // optional
  env: 'production', // optional
  version: '1.0' // optional
}
const exporter = new DatadogExporter(exporterOptions);
//  Now, register the exporter.
provider.addSpanProcessor(new DatadogSpanProcessor(exporter));
// Next, add the Datadog Propagator for distributed tracing
provider.register({
  propagator: new DatadogPropagator(),
  // while datadog suggests the default ALWAYS_ON sampling, for probability sampling,
  // to ensure the appropriate generation of tracing metrics by the datadog-agent,
  // use the `DatadogProbabilitySampler`
  sampler: new DatadogProbabilitySampler(0.75)  
})

Example Output

…ling, span_kind, operation name mapping

…e sampling rate is accounted for on auto_reject

…ing'

…idation of trace state keys

…om testing

…ess Maps and account for buffer max trace and queue size

…s and update underlying behavior to match expectations

…t updates (#9999)

merge master

merge latest changes from master

dyladan

Do not merge

dyladan · 2020-07-15T17:03:32Z

Based on your explanation of your sampler, you may be interested in this open-telemetry/oteps#107

codecov · 2020-07-15T18:42:57Z

Codecov Report

Merging #1316 into master will decrease coverage by 0.03%.
The diff coverage is 92.77%.

@@            Coverage Diff             @@
##           master    #1316      +/-   ##
==========================================
- Coverage   93.16%   93.12%   -0.04%     
==========================================
  Files         139      146       +7     
  Lines        3921     4336     +415     
  Branches      804      922     +118     
==========================================
+ Hits         3653     4038     +385     
- Misses        268      298      +30

Impacted Files	Coverage Δ
...metry-exporter-datadog/src/datadogSpanProcessor.ts	`87.38% <87.38%> (ø)`
...es/opentelemetry-exporter-datadog/src/transform.ts	`90.90% <90.90%> (ø)`
...ages/opentelemetry-exporter-datadog/src/datadog.ts	`94.11% <94.11%> (ø)`
...-exporter-datadog/src/datadogProbabilitySampler.ts	`94.73% <94.73%> (ø)`
...elemetry-exporter-datadog/src/datadogPropagator.ts	`98.18% <98.18%> (ø)`
...ges/opentelemetry-exporter-datadog/src/defaults.ts	`100.00% <100.00%> (ø)`
...ckages/opentelemetry-exporter-datadog/src/types.ts	`100.00% <100.00%> (ø)`
... and 4 more

vmarchaud

Overall looks good to me. The only things that stand out is the specific behavior to wait for all spans to be ended to send them; the user should be aware that the exporter will drop spans that should have been traced (specially because the default NoopLogger, it might be hard for user to understand where spans are dropped)

vmarchaud · 2020-07-15T19:18:38Z

packages/opentelemetry-exporter-datadog/src/datadogSpanProcessor.ts

+        this._traces_spans_started.delete(traceId);
+        this._traces_spans_finished.delete(traceId);
+        this._check_traces_queue.delete(traceId);
+        this._exporter.export(spans, () => {});


you might want to log an error here

I agree this will make an export fail silently which is no good

yup makes sense, i've added an error log here. (are there preferences on log level? error feels appropriate here since most debug logs wont get registered, but not sure if there's otel conventions are noiseyness)

The exporter for the otel collector logs an error for reference, so i think thats the way to go

ty, will keep as error log.

dyladan

Seems ok to me in general. I would worry about the possibility for denial of tracing by an attacker who intentionally makes many requests which don't end, but this is probably minor and there are other ways to combat that.

dyladan · 2020-07-15T19:59:05Z

packages/opentelemetry-exporter-datadog/src/datadogSpanProcessor.ts

+    this._exporter.shutdown();
+  }
+
+  // does nothing.


Suggested change

// does nothing.

// does something.

:)

good catch, updated to be a bit more specific

dyladan · 2020-07-15T20:03:00Z

packages/opentelemetry-exporter-datadog/src/datadogSpanProcessor.ts

+        this._traces_spans_started.delete(traceId);
+        this._traces_spans_finished.delete(traceId);
+        this._check_traces_queue.delete(traceId);
+        this._exporter.export(spans, () => {});


I agree this will make an export fail silently which is no good

dyladan · 2020-07-15T20:04:53Z

packages/opentelemetry-exporter-datadog/src/datadogSpanProcessor.ts

+
+    return (
+      this._traces_spans_started.get(traceId) -
+        this._traces_spans_finished.get(traceId) <=


Weird indenting here. Does the linter force this?

yup, there's a few like this, the linter complains otherwise, this is the output of npm run lint:fix i believe.

dyladan · 2020-07-15T20:06:49Z

packages/opentelemetry-exporter-datadog/src/datadogSpanProcessor.ts

+    }
+  }
+
+  getTraceContext(span: ReadableSpan): string | undefined[] {


Suggested change

getTraceContext(span: ReadableSpan): string | undefined[] {

getTraceContext(span: ReadableSpan): (string | undefined)[] {

OR

Suggested change

getTraceContext(span: ReadableSpan): string | undefined[] {

getTraceContext(span: ReadableSpan): Array<string | undefined> {

OR

Suggested change

getTraceContext(span: ReadableSpan): string | undefined[] {

getTraceContext(span: ReadableSpan): [string, string, string | undefined] {

thank you, updated. apologies here I don't use typescript terribly often as you can probably tell, so if there's any ts choices here that you think are questionable i'd certainly be open to updating.

dyladan · 2020-07-15T20:10:25Z

packages/opentelemetry-exporter-datadog/src/datadog.ts

+    );
+
+    try {
+      this._exporter.export(formattedDatadogSpans);


Is this sync?

it is, but it's only appending the trace to the AgentExporter buffer (which gets flushed on an interval), see: https://github.com/DataDog/dd-trace-js/blob/89e9caa74f643fa31e3aba80f608a0f8857f8b74/packages/dd-trace/src/exporters/agent/index.js#L18

obecny

first pass

packages/opentelemetry-exporter-datadog/src/datadog.ts

packages/opentelemetry-exporter-datadog/src/datadogPropagator.ts

packages/opentelemetry-exporter-datadog/src/transform.ts

packages/opentelemetry-exporter-datadog/src/datadog.ts

pauldraper · 2020-07-16T00:48:40Z

The least clear thing is how to populate Datadog's name and resource, because OTel doesn't have both concepts. open-telemetry/opentelemetry-specification#531

When I created my version of the exporter, I split the span names. Like http.request:/accounts/{int}/users/{int}/permissions became name http.request and resource /accounts/{int}/users/{int}/permissions. Or pg.query:SELECT became name pg.query and resource SELECT.

I like this approach too though.

ericmustin · 2020-07-16T11:32:03Z

Seems ok to me in general. I would worry about the possibility for denial of tracing by an attacker who intentionally makes many requests which don't end, but this is probably minor and there are other ways to combat that.

@dyladan in the otel ruby trace exporter i had some logic for dropping a "random" trace when the queue hits max size, which prevents the sort of denial attack you're referring to and allows the traces that do get exported to represent a better approximation of actual application behavior. Do you think that's worth implementing here, it obviously has some shortfalls but could be better than no traces being exported at all in the case of extremely high load+incomplete traces

dyladan · 2020-07-16T13:19:10Z

Seems ok to me in general. I would worry about the possibility for denial of tracing by an attacker who intentionally makes many requests which don't end, but this is probably minor and there are other ways to combat that.

@dyladan in the otel ruby trace exporter i had some logic for dropping a "random" trace when the queue hits max size, which prevents the sort of denial attack you're referring to and allows the traces that do get exported to represent a better approximation of actual application behavior. Do you think that's worth implementing here, it obviously has some shortfalls but could be better than no traces being exported at all in the case of extremely high load+incomplete traces

It's up to you if it's worth the extra work for the minimal benefit. I would probably not worry about it unless it becomes a problem. The better solution would be for the agent to not require full traces so you can export spans at any time :)

jon-whit · 2020-07-16T17:48:27Z

@ericmustin dude, you're a community hero! Thanks for putting this up. I've been wanting this now at my organization for a while. Looking forward to getting this merged in within the coming weeks :)

…plify tag setting logic and span generation

obecny

lgtm, thx for all changes, there are just few minor comments yet from my side, but I'm approving, nice job :)

obecny · 2020-07-16T21:18:06Z

packages/opentelemetry-exporter-datadog/src/transform.ts

+  return ddSpanBase;
+}
+
+function addErrors(ddSpanBase: typeof Span, span: ReadableSpan): void {
  if (span.status && span.status.code && span.status.code > 0) {
    // TODO: set error.msg error.type error.stack based on error events


Nit: would you mind create an issue in repo so this is not forgotten ?

yup can do, i'll add an open issue linking to the spec pr open-telemetry/opentelemetry-specification#697 (which got approved it looks like 🎉 )

obecny · 2020-07-16T21:20:08Z

packages/opentelemetry-exporter-datadog/src/transform.ts

  if (span.status && span.status.code && span.status.code > 0) {
    // TODO: set error.msg error.type error.stack based on error events
    // Opentelemetry-js has not yet implemented https://github.com/open-telemetry/opentelemetry-specification/pull/697
    // the type and stacktrace are not officially recorded. Until this implemented,
    // we can infer a type by using the status code and also the non spec `<library>.error_name` attribute
    const possibleType = inferErrorType(span);
-    ddSpanBase.setTag('error', 1);
-    ddSpanBase.setTag('error.msg', span.status.message);
+    ddSpanBase.setTag(DatadogDefaults.ERROR_TAG, 1);


What does this 1 mean, can this have a different values? maybe add some explanation or soem enum what it really means - I see this in few other places so this might make sense to put here some const default

the error span tag can be either 1 or 0 if set. 0 is ok, 1 is error. Generally it only gets set in the case of an error. I've moved this into a more descriptive enum with a code comment to clarify.

packages/opentelemetry-exporter-datadog/src/transform.ts

…trings to enums where possible

ericmustin · 2020-07-17T09:57:02Z

@pauldraper yup I hear you, appreciate the feedback, and yes i agree with what you've brought up. For now the approach across ddog exporters in other languages (python, and ruby which is soon to ship, and here) is:

For resource, use some attribute/tag heuristics to determine if it's a web/http related span, and if so we set the datadogSpan resource to be the http method + route ex: GET api/info POST admin/update etc etc. This gets sanitized/normalized at the datadog-agent level (where the exporter is sending traces to).
Otherwise it defaults to the otel span name, which is sometimes not especially granular in the way users might prefer for helping get deeper visibility into a particular service.
for span operation name, we're mainly using the span.kind along with the instrumentationLibrary.name when it matches a supported plugin. so a span name's oughta look something like express.server http.client etc etc

I think we're definitely looking to be iterative here and so if there's feedback (when this starts to get used by folks) around a need for better resource/service/operation name visibility (in the datadog senses of the terms) for specific types of spans, we want to address that.

more thoughts about db spans and some general config thoughts

DB spans come to mind here as a place where it would be really nice to have the datadog resource set to the full obfuscated sql query., but i think it's a potential footgun. I'm still relatively new to the Otel space, but imo DB span granularity/normalization is an area i'd prefer to see addressed in a more formalised way at the Spec level, instead of having to apply a lot of vendor specific heuristics at the exporter level. The fact is that the approach to setting some sort of low cardinality span name that represents the database command varies pretty wildly across languages. JS takes an approach (pg, mysql) that is similar to what we recently added to a ruby instrumentation approach and imo is i think a nice middle ground between 0 granularity and ...shoving a sql parsing into the tracer... .But python instrumentation OTOH doesn't attempt to infer the sql command. That being said, to borrow a phrase, you goto war with the army you have, and so if the users are finding the mysql.query:SELECT pg.query:DELETE resource naming convention unhelpful and would prefer just SELECT DELETE as the resource, I think i'd be open to just adding an additional heuristic into the exporter.
Another idea we've seen success with on the datadog side of things is just allowing some sort of arbitrary post_processing hook where uses can define a function that takes the span as an input and allows them to modify/mutate it to fit their preferred use case. It introduces some crazy performance issues though when used improperly, it makes for a really time consuming config/onboarding experience for many users, and i'm not sure if there's a precedent around that approach in otel in general.

… queue, updated snake to camel case where appropriate

ericmustin · 2020-07-18T11:08:37Z

@dyladan I went ahead and added some logic for dropping a random trace in the event of very high load + many incomplete traces, lmk what you think, it's relatively straightforward. When appropriate I can close this PR and start to publish this under a datadog repo+ make a pr with that repo to get it listed in the registry. And it goes without saying but appreciate the really thoughtful and timely feedback from you, @obecny and @vmarchaud on this.

pauldraper · 2020-07-18T20:45:15Z

DB spans come to mind here as a place where it would be really nice to have the datadog resource set to the full obfuscated sql query

DB queries aren't necessarily low cardinality, though many perf-based tools do rely on that fact.

Another issue is that DB queries can be very long (many hundreds or thousands of characters).

Another idea we've seen success with on the datadog side of things is just allowing some sort of arbitrary post_processing hook

I think this is a good approach, and in practice quite useful.

dyladan · 2020-08-04T14:16:04Z

Is this now OK to close?

jon-whit · 2020-08-27T20:22:20Z

What's the status on this? I don't see any follow up work from it.

ericmustin · 2020-08-28T15:32:51Z

@jon-whit hey there, we've released a beta exporter if you want to give it a shot. https://github.com/DataDog/dd-opentelemetry-exporter-js

dyladan · 2020-08-28T15:42:19Z

@ericmustin you may want to add it to the registry if you haven't already

ericmustin · 2020-08-28T15:57:01Z

@dyladan ah yup, will do, i have to add a bunch of ruby stuff in there as well

ericmustin added 19 commits July 6, 2020 17:40

[contrib-datadog]: add basic ddog span processor logic

b3168a7

[contrib-datadog]: poc exporting with mock span formatting

a7cf582

[contrib-datadog]: transform to ddspan in progress work

7344a45

[contrib-datdog]: merge changes from master

beb3498

[contrib-datadog]: transform to ddspan basic skeleton, todos for samp…

d6889f5

…ling, span_kind, operation name mapping

[contrib-datadog]: add support for non spec error handling

e114707

[contrib-datadog]: add propagation into example

d11bf22

[contrib-datadog]: add probability sampler

d3875fe

[contrib-datadog]: include example of datadog probability sampling

17e754b

[contrib-datadog]: start tests boilerplate and transform tests, ensur…

72f9de7

…e sampling rate is accounted for on auto_reject

[contrib-datadog]: add helper functions for tests and test span sampl…

04c3445

…ing'

[contrib-datadog]: add test for origin headers and accout for w3c val…

f1123c4

…idation of trace state keys

[contrib-datadog]: add tests and update various parts off exporter fr…

2465f05

…om testing

[contrib-datadog]: mock out exporter export function test, WIP

b1fe81a

[contrib-datadog]: add exporter tests, update span processor to use l…

5da535a

…ess Maps and account for buffer max trace and queue size

[contrib-datadog]: finish propagation and sampling and processor test…

bc4ae91

…s and update underlying behavior to match expectations

feat: [contrib-datadog]: start lining cleanup and package json es lin…

00641b4

…t updates (#9999)

feat: [contrib-datadog]: add readme, linting annd types

06a25f7

Merge branch 'master' into otel_datadog_exporter

01c1218

merge master

ericmustin requested review from dyladan, legendecas, markwolff, mayurkale22, mwear, naseemkullah, obecny, OlivierAlbertini and vmarchaud as code owners July 15, 2020 13:57

ericmustin added 2 commits July 15, 2020 17:04

feat: [datadog-contrib]: linting and types cleanup

affb564

feat: [datadog-contrib]: revert example changes

f6c808b

Merge branch 'master' into otel_datadog_exporter

c792b74

merge latest changes from master

dyladan requested changes Jul 15, 2020

View reviewed changes

ericmustin added 2 commits July 15, 2020 20:11

feat: [datadog-contrib]: small readme update to kick build

a083a77

feat: [datadog-contrib]: fix tests for readable span type issues

cb0feff

ericmustin mentioned this pull request Jul 15, 2020

Support OpenTelemetry DataDog/dd-trace-js#820

Closed

vmarchaud reviewed Jul 15, 2020

View reviewed changes

dyladan reviewed Jul 15, 2020

View reviewed changes

obecny reviewed Jul 15, 2020

View reviewed changes

feat: [datadog-contrib]: implement feedback, add enums and mocks, sim…

ad8c7be

…plify tag setting logic and span generation

obecny approved these changes Jul 16, 2020

View reviewed changes

feat: [datadog-contrib]: cleanup tag to attribute function and move s…

5b71b40

…trings to enums where possible

ericmustin mentioned this pull request Jul 16, 2020

Implement capturing error type/msg/stacktrace as events per spec #1322

Closed

feat: [datadog-contrib]: add logic for dropping a trace on full trace…

243e25a

… queue, updated snake to camel case where appropriate

ericmustin closed this Aug 4, 2020

pauldraper mentioned this pull request Aug 6, 2020

Making tracing SDK metrics aware open-telemetry/opentelemetry-specification#381

Open

ericmustin mentioned this pull request Aug 31, 2020

Add Datadog JS Exporter to registry open-telemetry/opentelemetry.io#244

Merged

	getTraceContext(span: ReadableSpan): string \| undefined[] {
	getTraceContext(span: ReadableSpan): (string \| undefined)[] {

	getTraceContext(span: ReadableSpan): string \| undefined[] {
	getTraceContext(span: ReadableSpan): Array<string \| undefined> {

	getTraceContext(span: ReadableSpan): string \| undefined[] {
	getTraceContext(span: ReadableSpan): [string, string, string \| undefined] {

feat: [do not merge] OpenTelemetry Datadog Exporter #1316

feat: [do not merge] OpenTelemetry Datadog Exporter #1316

Conversation

ericmustin commented Jul 15, 2020 • edited by dyladan Loading

Edit

Which problem is this PR solving?

Short description of the changes

Example Usage

Example Output

dyladan left a comment

Choose a reason for hiding this comment

dyladan commented Jul 15, 2020

codecov bot commented Jul 15, 2020 • edited Loading

Codecov Report

vmarchaud left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dyladan left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ericmustin Jul 16, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ericmustin Jul 16, 2020 • edited Loading

Choose a reason for hiding this comment

obecny left a comment

Choose a reason for hiding this comment

pauldraper commented Jul 16, 2020 • edited Loading

ericmustin commented Jul 16, 2020

dyladan commented Jul 16, 2020

jon-whit commented Jul 16, 2020

obecny left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ericmustin commented Jul 17, 2020

ericmustin commented Jul 18, 2020

pauldraper commented Jul 18, 2020 • edited Loading

dyladan commented Aug 4, 2020

jon-whit commented Aug 27, 2020

ericmustin commented Aug 28, 2020

dyladan commented Aug 28, 2020

ericmustin commented Aug 28, 2020

ericmustin commented Jul 15, 2020 •

edited by dyladan

Loading

codecov bot commented Jul 15, 2020 •

edited

Loading

ericmustin Jul 16, 2020 •

edited

Loading

ericmustin Jul 16, 2020 •

edited

Loading

pauldraper commented Jul 16, 2020 •

edited

Loading

pauldraper commented Jul 18, 2020 •

edited

Loading