Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

error handling proposal #153

Merged

Conversation

SergeyKanzhelev
Copy link
Member

Fix #141

This proposal suggests the change comparing to the current API/SDK implementation in Java.

specification/error-handling.md Show resolved Hide resolved

1. APIs must not throw unhandled exceptions when the API is used incorrectly by
the end user. Smart defaults should be used so that the SDK generally works.
For instance, name like `empty` MUST be used when `null` value was passed as
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmmm I think it's good practice to validate the required argument when end users call our API and throw exceptions when it violates certain constraints. Consider other cases like name being too long, I think it's better to throw exceptions than silently truncate it or use defaults.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be better if application doesn't start due to some environment variable was misconfigured (for resource API for instance) or if application sent a slightly inconsistent telemetry?

Same for span name. Would you rather return 500 to customer or use empty for the null string? (see also comment for @bogdandrutu 's question

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We had some conversations in Python SDK open-telemetry/opentelemetry-python#11 (comment).

One thing to consider: there is no simple way to determine if span name is too long since different exporters/backends might have different limitations. We don't want the users to see the application running fine for exporter A, and start to see unhandled exceptions when they switch to exporter B.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't want the users to see the application running fine for exporter A, and start to see unhandled exceptions when they switch to exporter B.

+1


OpenTelemetry SDK must not throw or leak unhandled or user unhandled exceptions.

1. APIs must not throw unhandled exceptions when the API is used incorrectly by
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is an interesting point. I think I have a different opinion:

  1. When a specific implementation is used no different checks are applied - means that if the API says a span name is valid if not null, then implementation cannot throw exception if empty.
  2. API can throw exception for things like null span name because that is documented.

Happy to be convinced otherwise, but my understanding is that as long as we apply the same checks across the API and SDK we can throw exception for obvious things.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

one problem that you may run into is that arguments you are passing to telemetry API may be received from external source. So you have a null string. If you are smart and read the doc - you will check for null and pass some random name to the API. If you are smart and haven't read the doc - API will crash and potentially bring this request processing with you.

Both outcomes can be worked around for by applying the same smart default as customer would need to anyway.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, but running a "blind" process with no monitor at all because we cannot export data to the metrics backend, is that better? I feel that fail fast approach should be used here and crashing during the initialization is very good in this case.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with @bogdandrutu on "API can throw exception for things like null span name because that is documented".

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Both outcomes can be worked around for by applying the same smart default as customer would need to anyway.

I agree with this. At the same time, we need a balance for reporting errors for truly fatal things (i.e. " cannot export data to the metrics backend"), but otherwise throwing exceptions should stay at a minimum level.

Specific case: I do think using a null name for Span could fallback to using a default name, instead of throwing. Documenting it is nice, but providing defaults for these kind of things is a good alternative ;)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree re: failing fast at initialization/construction time, but not in ways that might crash the program (or thread or ____) at runtime.

Copy link
Member

@iredelmeier iredelmeier left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The doc currently mingles API and SDK error handling in a way that I expect to be confusing to multiple parties, particularly anyone either working on or using an alternative to the SDK.

Part of the problem may be upstream, i.e., that the API spec is missing logic and so the SDK is forced to backfill (and therefore error handle) it.

4. Beware of any call to external callbacks or override-able interface. Expect
them to throw.

## SDK self-diagnostics
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While very valuable, self-diagnostics are inherently complicated. This feels like something that needs a larger discussion.

4. Beware of any call to external callbacks or override-able interface. Expect
them to throw.

## SDK self-diagnostics
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe that user-facing instrumentation APIs should never return errors (or throw them).
One reason that users should never receive these, is that the'll be tempted to turn around and call the instrumentation library with the error.
The SDK will, however, encounter errors, and it a user has a right to know, but not with in-line code. In the opentracing Go library we addressed this by letting the application register a callback for receiving self diagnostics, including errors, that would allow the application to handle self diagnostics in a separate module.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are doing exactly the same at Dynatrace with our agent SDK:
The SDK API never throws, since we do not want to change the application's behavior and risk crashing it if they don't catch everything properly, just because monitoring wouldn't work. In order to not let our SDK users in the dark and give them guidance for troubleshooting we also offer a diagnostic callback. It provides immediate logging output in case of errors (see https://github.com/Dynatrace/OneAgent-SDK#logging-callback). This turned out to work well and be pretty helpful in the past.

@tedsuo
Copy link
Contributor

tedsuo commented Aug 13, 2019

This has gone a bit stale. I believe @SergeyKanzhelev is on paternity leave. Any opinions on how we resolve this, or otherwise move it forwards?

@SergeyKanzhelev
Copy link
Member Author

@mwear added this paragraph:

Error handling and performance

Error handling and extensive input validation may cause performance degradation,
especially on dynamic languages where the input object types are not guaranteed
in compile time. Runtime type checks will impact performance and are error
prone, exceptions may occur despite the best effort.

It is recommended to have a global exception handling logic that will guarantee
that exceptions are not leaking to the user code. And make a reasonable trade
off of the SDK performance and fullness of type checks that will provide a
better on-error behavior and SDK errors troubleshooting.

@mwear
Copy link
Member

mwear commented Oct 1, 2019

@SergeyKanzhelev: I like Error Handing and Performance paragraph you added above. I think the approach you describe in #153 (comment) makes sense, although, I don't see that explicitly mentioned in the document. Would it make sense to be more explicit, or are we leaving that out to give more freedom in implementation.

@SergeyKanzhelev
Copy link
Member Author

@mwear I've added a note into guidance on returning no-op instead of null

@mwear
Copy link
Member

mwear commented Oct 1, 2019

Thanks @SergeyKanzhelev I'll bring this up at the Ruby SIG so folks are aware of it.

@c24t c24t self-requested a review October 1, 2019 19:46
@c24t
Copy link
Member

c24t commented Oct 1, 2019

  • The SDK can do runtime type checks and provide defaults when they fail
    • Runtime type checks are not free, and will impact performance
    • These checks can be error prone, and exceptions could still occur despite best efforts
  • The SDK can rescue errors and return nulls when they are encountered
    • This leads to a fairly poor user experience as they will have to check for nulls on each API call

I think there's a fourth option here: that the SDK can rescue errors and return non-null defaults when they're encountered. If it's possible to provide defaults for every argument ("do runtime type checks and provide defaults when they fail") then presumably it's possible to provide the default return value that you'd get by calling the function with all the default args.

I don't mean to say this is the right solution, just that it's worth considering with the others.

Comment on lines 84 to 87
SDK authors MAY supply settings that allow end users to change the library's
default error handling behavior. Application developers may want to run with
strict error handling in a staging environment to catch invalid uses of the API,
or malformed config.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bogdandrutu you brought up the auditing use case in the SIG call, what do you think about expanding on strict error handling here? E.g.:

Under certain circumstances, it's preferable for an application to fail to complete an operation than to complete it but fail to produce telemetry data.

SDK authors SHOULD provide a configuration option that allows the end user to change the library's default error handling behavior. When this option is enabled, the SDK MUST NOT suppress errors or return placeholder objects.

Application developers may also want to run with strict error handling during tests, or in a staging environment, to catch invalid uses of the API, or malformed config.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@reyang IIRC you're interested in this too.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we want to standardize this I'd suggest to have a separate work stream on it. "MAY" can be enough for a first iteration

@mwear
Copy link
Member

mwear commented Oct 1, 2019

I think there's a fourth option here: that the SDK can rescue errors and return non-null defaults when they're encountered. If it's possible to provide defaults for every argument ("do runtime type checks and provide defaults when they fail") then presumably it's possible to provide the default return value that you'd get by calling the function with all the default args.

I don't mean to say this is the right solution, just that it's worth considering with the others.

@c24t Is your comment the same as what is being suggested here: #153 (comment). If so, that is the approach that this document currently recommends.

@c24t
Copy link
Member

c24t commented Oct 1, 2019

If so, that is the approach that this document currently recommends.

Yup. We're on the same page, #153 (comment) seemed to suggest those were the only three options.

@c24t
Copy link
Member

c24t commented Oct 1, 2019

a8451f1 removes hard line breaks to match #192. It doesn't change the text otherwise.

It's a bikesheddy change, I know, but I think it's better to bite the bullet now than keep submitting PRs with text in two different formats.

@SergeyKanzhelev
Copy link
Member Author

@iredelmeier, thank you for review. I feel that it is addressed now and ask to review again.

This proposal was discussed on today's spec SIG and a few before this. And there is a general feeling that we need to document error handling principles.

The doc currently mingles API and SDK error handling in a way that I expect to be confusing to multiple parties, particularly anyone either working on or using an alternative to the SDK.

Part of the problem may be upstream, i.e., that the API spec is missing logic and so the SDK is forced to backfill (and therefore error handle) it.

This doc asks all implementations to be consistent in their error handling practices: "OpenTelemetry implementations MUST NOT throw unhandled exceptions at run time.". Even though it may be hard to control, this is what we believe must simplify operation with the OpenTelemetry given it's nature of a monitoring, not a business critical tool in most cases.

While very valuable, self-diagnostics are inherently complicated. This feels like something that needs a larger discussion.

What forum do you suggest this discussion to be held on?

Copy link
Contributor

@tedsuo tedsuo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @SergeyKanzhelev, this latest version looks great.

Copy link
Member

@arminru arminru left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM (even though my approval means nothing here 😄)
Thanks a lot for taking care of this, Sergey and Chris!

@SergeyKanzhelev
Copy link
Member Author

this PR had a lot of bake time after was approved. Merging

@SergeyKanzhelev SergeyKanzhelev merged commit 57978d5 into open-telemetry:master Oct 14, 2019
@SergeyKanzhelev SergeyKanzhelev deleted the exceptionsHandling branch October 14, 2019 19:05
SergeyKanzhelev added a commit to SergeyKanzhelev/opentelemetry-specification that referenced this pull request Feb 18, 2020
* error handling proposal

* added self-monitoring

* Reword principles

* Reword guidance

* Reword diagnostics

* Reword exceptions

* Add note on logs, callbacks

* Formatting

* formatting and a mention of ToString

* dynamic languages

* returning noops, not nulls

* Reformat for open-telemetry#192

* Update specification/error-handling.md

Co-Authored-By: Armin Ruech <armin.ruech@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Should span creation operations raise exceptions?