error handling proposal #153

SergeyKanzhelev · 2019-06-21T20:16:58Z

This proposal suggests the change comparing to the current API/SDK implementation in Java.

specification/error-handling.md

songy23 · 2019-06-21T20:37:23Z

specification/error-handling.md

+
+1. APIs must not throw unhandled exceptions when the API is used incorrectly by
+   the end user. Smart defaults should be used so that the SDK generally works.
+   For instance, name like `empty` MUST be used when `null` value was passed as


Hmmm I think it's good practice to validate the required argument when end users call our API and throw exceptions when it violates certain constraints. Consider other cases like name being too long, I think it's better to throw exceptions than silently truncate it or use defaults.

Would it be better if application doesn't start due to some environment variable was misconfigured (for resource API for instance) or if application sent a slightly inconsistent telemetry?

Same for span name. Would you rather return 500 to customer or use empty for the null string? (see also comment for @bogdandrutu 's question

We had some conversations in Python SDK open-telemetry/opentelemetry-python#11 (comment).

One thing to consider: there is no simple way to determine if span name is too long since different exporters/backends might have different limitations. We don't want the users to see the application running fine for exporter A, and start to see unhandled exceptions when they switch to exporter B.

We don't want the users to see the application running fine for exporter A, and start to see unhandled exceptions when they switch to exporter B.

+1

bogdandrutu · 2019-06-21T21:40:29Z

specification/error-handling.md

+
+OpenTelemetry SDK must not throw or leak unhandled or user unhandled exceptions.
+
+1. APIs must not throw unhandled exceptions when the API is used incorrectly by


This is an interesting point. I think I have a different opinion:

When a specific implementation is used no different checks are applied - means that if the API says a span name is valid if not null, then implementation cannot throw exception if empty.

API can throw exception for things like null span name because that is documented.

Happy to be convinced otherwise, but my understanding is that as long as we apply the same checks across the API and SDK we can throw exception for obvious things.

one problem that you may run into is that arguments you are passing to telemetry API may be received from external source. So you have a null string. If you are smart and read the doc - you will check for null and pass some random name to the API. If you are smart and haven't read the doc - API will crash and potentially bring this request processing with you.

Both outcomes can be worked around for by applying the same smart default as customer would need to anyway.

Sure, but running a "blind" process with no monitor at all because we cannot export data to the metrics backend, is that better? I feel that fail fast approach should be used here and crashing during the initialization is very good in this case.

I agree with @bogdandrutu on "API can throw exception for things like null span name because that is documented".

Both outcomes can be worked around for by applying the same smart default as customer would need to anyway.

I agree with this. At the same time, we need a balance for reporting errors for truly fatal things (i.e. " cannot export data to the metrics backend"), but otherwise throwing exceptions should stay at a minimum level.

Specific case: I do think using a null name for Span could fallback to using a default name, instead of throwing. Documenting it is nice, but providing defaults for these kind of things is a good alternative ;)

I agree re: failing fast at initialization/construction time, but not in ways that might crash the program (or thread or ____) at runtime.

…ev/opentelemetry-specification into exceptionsHandling

iredelmeier

The doc currently mingles API and SDK error handling in a way that I expect to be confusing to multiple parties, particularly anyone either working on or using an alternative to the SDK.

Part of the problem may be upstream, i.e., that the API spec is missing logic and so the SDK is forced to backfill (and therefore error handle) it.

iredelmeier · 2019-06-24T22:10:46Z

specification/error-handling.md

+4. Beware of any call to external callbacks or override-able interface. Expect
+   them to throw.
+
+## SDK self-diagnostics


While very valuable, self-diagnostics are inherently complicated. This feels like something that needs a larger discussion.

jmacd · 2019-06-26T18:34:24Z

specification/error-handling.md

+4. Beware of any call to external callbacks or override-able interface. Expect
+   them to throw.
+
+## SDK self-diagnostics


I believe that user-facing instrumentation APIs should never return errors (or throw them).
One reason that users should never receive these, is that the'll be tempted to turn around and call the instrumentation library with the error.
The SDK will, however, encounter errors, and it a user has a right to know, but not with in-line code. In the opentracing Go library we addressed this by letting the application register a callback for receiving self diagnostics, including errors, that would allow the application to handle self diagnostics in a separate module.

We are doing exactly the same at Dynatrace with our agent SDK:
The SDK API never throws, since we do not want to change the application's behavior and risk crashing it if they don't catch everything properly, just because monitoring wouldn't work. In order to not let our SDK users in the dark and give them guidance for troubleshooting we also offer a diagnostic callback. It provides immediate logging output in case of errors (see https://github.com/Dynatrace/OneAgent-SDK#logging-callback). This turned out to work well and be pretty helpful in the past.

tedsuo · 2019-08-13T03:58:19Z

This has gone a bit stale. I believe @SergeyKanzhelev is on paternity leave. Any opinions on how we resolve this, or otherwise move it forwards?

SergeyKanzhelev · 2019-10-01T17:40:01Z

@mwear added this paragraph:

Error handling and performance

Error handling and extensive input validation may cause performance degradation,
especially on dynamic languages where the input object types are not guaranteed
in compile time. Runtime type checks will impact performance and are error
prone, exceptions may occur despite the best effort.

It is recommended to have a global exception handling logic that will guarantee
that exceptions are not leaking to the user code. And make a reasonable trade
off of the SDK performance and fullness of type checks that will provide a
better on-error behavior and SDK errors troubleshooting.

mwear · 2019-10-01T18:02:27Z

@SergeyKanzhelev: I like Error Handing and Performance paragraph you added above. I think the approach you describe in #153 (comment) makes sense, although, I don't see that explicitly mentioned in the document. Would it make sense to be more explicit, or are we leaving that out to give more freedom in implementation.

SergeyKanzhelev · 2019-10-01T18:14:31Z

@mwear I've added a note into guidance on returning no-op instead of null

mwear · 2019-10-01T18:52:33Z

Thanks @SergeyKanzhelev I'll bring this up at the Ruby SIG so folks are aware of it.

c24t · 2019-10-01T20:08:04Z

The SDK can do runtime type checks and provide defaults when they fail

Runtime type checks are not free, and will impact performance

These checks can be error prone, and exceptions could still occur despite best efforts

The SDK can rescue errors and return nulls when they are encountered

This leads to a fairly poor user experience as they will have to check for nulls on each API call

I think there's a fourth option here: that the SDK can rescue errors and return non-null defaults when they're encountered. If it's possible to provide defaults for every argument ("do runtime type checks and provide defaults when they fail") then presumably it's possible to provide the default return value that you'd get by calling the function with all the default args.

I don't mean to say this is the right solution, just that it's worth considering with the others.

c24t · 2019-10-01T20:34:12Z

specification/error-handling.md

+SDK authors MAY supply settings that allow end users to change the library's
+default error handling behavior. Application developers may want to run with
+strict error handling in a staging environment to catch invalid uses of the API,
+or malformed config.


@bogdandrutu you brought up the auditing use case in the SIG call, what do you think about expanding on strict error handling here? E.g.:

Under certain circumstances, it's preferable for an application to fail to complete an operation than to complete it but fail to produce telemetry data.

SDK authors SHOULD provide a configuration option that allows the end user to change the library's default error handling behavior. When this option is enabled, the SDK MUST NOT suppress errors or return placeholder objects.

Application developers may also want to run with strict error handling during tests, or in a staging environment, to catch invalid uses of the API, or malformed config.

@reyang IIRC you're interested in this too.

If we want to standardize this I'd suggest to have a separate work stream on it. "MAY" can be enough for a first iteration

mwear · 2019-10-01T21:01:37Z

I think there's a fourth option here: that the SDK can rescue errors and return non-null defaults when they're encountered. If it's possible to provide defaults for every argument ("do runtime type checks and provide defaults when they fail") then presumably it's possible to provide the default return value that you'd get by calling the function with all the default args.

I don't mean to say this is the right solution, just that it's worth considering with the others.

@c24t Is your comment the same as what is being suggested here: #153 (comment). If so, that is the approach that this document currently recommends.

c24t · 2019-10-01T21:41:04Z

If so, that is the approach that this document currently recommends.

Yup. We're on the same page, #153 (comment) seemed to suggest those were the only three options.

c24t · 2019-10-01T21:48:53Z

a8451f1 removes hard line breaks to match #192. It doesn't change the text otherwise.

It's a bikesheddy change, I know, but I think it's better to bite the bullet now than keep submitting PRs with text in two different formats.

SergeyKanzhelev · 2019-10-01T22:26:15Z

@iredelmeier, thank you for review. I feel that it is addressed now and ask to review again.

This proposal was discussed on today's spec SIG and a few before this. And there is a general feeling that we need to document error handling principles.

The doc currently mingles API and SDK error handling in a way that I expect to be confusing to multiple parties, particularly anyone either working on or using an alternative to the SDK.

Part of the problem may be upstream, i.e., that the API spec is missing logic and so the SDK is forced to backfill (and therefore error handle) it.

This doc asks all implementations to be consistent in their error handling practices: "OpenTelemetry implementations MUST NOT throw unhandled exceptions at run time.". Even though it may be hard to control, this is what we believe must simplify operation with the OpenTelemetry given it's nature of a monitoring, not a business critical tool in most cases.

While very valuable, self-diagnostics are inherently complicated. This feels like something that needs a larger discussion.

What forum do you suggest this discussion to be held on?

tedsuo

Thanks @SergeyKanzhelev, this latest version looks great.

specification/error-handling.md

Co-Authored-By: Armin Ruech <armin.ruech@gmail.com>

arminru

LGTM (even though my approval means nothing here 😄)
Thanks a lot for taking care of this, Sergey and Chris!

SergeyKanzhelev · 2019-10-14T19:05:12Z

this PR had a lot of bake time after was approved. Merging

* error handling proposal * added self-monitoring * Reword principles * Reword guidance * Reword diagnostics * Reword exceptions * Add note on logs, callbacks * Formatting * formatting and a mention of ToString * dynamic languages * returning noops, not nulls * Reformat for open-telemetry#192 * Update specification/error-handling.md Co-Authored-By: Armin Ruech <armin.ruech@gmail.com>

error handling proposal

1b6737a

SergeyKanzhelev added this to the API revision: 07-2019 milestone Jun 21, 2019

SergeyKanzhelev requested review from AloisReitbauer, bogdandrutu, c24t, carlosalberto, iredelmeier, reyang, songy23, tedsuo and yurishkuro as code owners June 21, 2019 20:16

songy23 reviewed Jun 21, 2019

View reviewed changes

bogdandrutu reviewed Jun 21, 2019

View reviewed changes

SergeyKanzhelev added 3 commits June 21, 2019 14:46

Merge branch 'master' into exceptionsHandling

ac14996

added self-monitoring

f90437e

Merge branch 'exceptionsHandling' of https://github.com/SergeyKanzhel…

d30f90a

…ev/opentelemetry-specification into exceptionsHandling

iredelmeier suggested changes Jun 24, 2019

View reviewed changes

iredelmeier reviewed Jun 24, 2019

View reviewed changes

jmacd reviewed Jun 26, 2019

View reviewed changes

arminru mentioned this pull request Jun 27, 2019

Change DistributedContextSdk.Builder to use current dctx as implicit parent open-telemetry/opentelemetry-java#439

Merged

mwear mentioned this pull request Jul 3, 2019

API prototype open-telemetry/opentelemetry-ruby#11

Merged

fbogsany mentioned this pull request Jul 11, 2019

API: decide what to do for invalid arguments open-telemetry/opentelemetry-ruby#14

Closed

c24t added 7 commits August 20, 2019 13:04

Merge branch 'master' into exceptionsHandling

b4eee5a

Reword principles

58179c6

Reword guidance

03e937e

Reword diagnostics

55bbad2

Reword exceptions

dcb740c

Add note on logs, callbacks

6c87edf

Formatting

3812350

returning noops, not nulls

872ebd4

Merge branch 'master' into exceptionsHandling

de7cb15

c24t self-requested a review October 1, 2019 19:46

c24t reviewed Oct 1, 2019

View reviewed changes

Reformat for open-telemetry#192

a8451f1

c24t approved these changes Oct 1, 2019

View reviewed changes

Merge branch 'master' into exceptionsHandling

e4ddb0c

tedsuo approved these changes Oct 4, 2019

View reviewed changes

arminru reviewed Oct 8, 2019

View reviewed changes

specification/error-handling.md Outdated Show resolved Hide resolved

SergeyKanzhelev and others added 3 commits October 8, 2019 17:04

Update specification/error-handling.md

eb4bc45

Co-Authored-By: Armin Ruech <armin.ruech@gmail.com>

Merge branch 'master' into exceptionsHandling

30bb855

Merge branch 'master' into exceptionsHandling

6fed7dc

iredelmeier approved these changes Oct 10, 2019

View reviewed changes

arminru approved these changes Oct 11, 2019

View reviewed changes

Merge branch 'master' into exceptionsHandling

83f4477

SergeyKanzhelev merged commit 57978d5 into open-telemetry:master Oct 14, 2019

SergeyKanzhelev deleted the exceptionsHandling branch October 14, 2019 19:05

arminru mentioned this pull request Jul 10, 2020

Add Global Error Handler to SDK #696

Closed

Oberon00 mentioned this pull request Aug 6, 2020

obtain a tracer/meter without a name #586

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

error handling proposal #153

error handling proposal #153

SergeyKanzhelev commented Jun 21, 2019

songy23 Jun 21, 2019

SergeyKanzhelev Jun 21, 2019

reyang Jun 21, 2019

carlosalberto Jun 22, 2019

bogdandrutu Jun 21, 2019

SergeyKanzhelev Jun 21, 2019

bogdandrutu Jun 21, 2019

reyang Jun 21, 2019

carlosalberto Jun 22, 2019

iredelmeier Jun 24, 2019

iredelmeier left a comment

iredelmeier Jun 24, 2019

jmacd Jun 26, 2019

arminru Jun 28, 2019

tedsuo commented Aug 13, 2019

SergeyKanzhelev commented Oct 1, 2019

Error handling and performance

mwear commented Oct 1, 2019

SergeyKanzhelev commented Oct 1, 2019

mwear commented Oct 1, 2019

c24t commented Oct 1, 2019 •

edited

Loading

c24t Oct 1, 2019

c24t Oct 1, 2019

SergeyKanzhelev Oct 1, 2019

mwear commented Oct 1, 2019

c24t commented Oct 1, 2019

c24t commented Oct 1, 2019

SergeyKanzhelev commented Oct 1, 2019

tedsuo left a comment

arminru left a comment •

edited

Loading

SergeyKanzhelev commented Oct 14, 2019


		OpenTelemetry SDK must not throw or leak unhandled or user unhandled exceptions.

		1. APIs must not throw unhandled exceptions when the API is used incorrectly by

error handling proposal #153

error handling proposal #153

Conversation

SergeyKanzhelev commented Jun 21, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

iredelmeier left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tedsuo commented Aug 13, 2019

SergeyKanzhelev commented Oct 1, 2019

Error handling and performance

mwear commented Oct 1, 2019

SergeyKanzhelev commented Oct 1, 2019

mwear commented Oct 1, 2019

c24t commented Oct 1, 2019 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mwear commented Oct 1, 2019

c24t commented Oct 1, 2019

c24t commented Oct 1, 2019

SergeyKanzhelev commented Oct 1, 2019

tedsuo left a comment

Choose a reason for hiding this comment

arminru left a comment • edited Loading

Choose a reason for hiding this comment

SergeyKanzhelev commented Oct 14, 2019

c24t commented Oct 1, 2019 •

edited

Loading

arminru left a comment •

edited

Loading