Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(tracing): use globalErrorHandler when flushing fails #1622

Merged

Conversation

johanneswuerbach
Copy link
Contributor

Fixes #1617

Short description of the changes

Use the global error handler #1514 when span flushing fails in the BatchSpanProcessor instead of causing an unhandled rejection.

Copy link
Member

@vmarchaud vmarchaud left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@codecov
Copy link

codecov bot commented Oct 24, 2020

Codecov Report

Merging #1622 into master will increase coverage by 0.00%.
The diff coverage is 90.00%.

@@           Coverage Diff           @@
##           master    #1622   +/-   ##
=======================================
  Coverage   91.21%   91.22%           
=======================================
  Files         165      165           
  Lines        5064     5069    +5     
  Branches     1038     1039    +1     
=======================================
+ Hits         4619     4624    +5     
  Misses        445      445           
Impacted Files Coverage Δ
...telemetry-tracing/src/export/BatchSpanProcessor.ts 92.18% <83.33%> (+0.25%) ⬆️
...elemetry-tracing/src/export/SimpleSpanProcessor.ts 85.18% <100.00%> (+1.85%) ⬆️

@johanneswuerbach
Copy link
Contributor Author

@vmarchaud something like this? You link points to the shutdown method, but the change in BatchSpanProcessor only affects the export. I also wrapped the export in the SimpleSpanProcessor or should I also wrap shutdown in both processors?

@vmarchaud
Copy link
Member

@johanneswuerbach Yeah sorry i linked to wrong line :/
I dont think we need to wrap shutdown right now, at least no one complained about this behavior, we'll see in the future.

I'm good with the PR now, even though i would like that we report the actual error that the exporter had but for now this is fine.

@vmarchaud
Copy link
Member

@open-telemetry/javascript-approvers I agree with @johanneswuerbach that this is important to fix and release asap, it's common best practices to exit the process when there is an unhandledRejection so most app will crash any time an exporter fails :/

Copy link
Member

@obecny obecny left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we are changing the api of BatchSpanProcessor here which might not cover all user cases. If method returns the promise it should handle both cases resolve and reject. Forcing to use global handler is not necessary something someone might want. Not even mentioning that the result from export is gone. Using global handler should be something that user might opt to turn on or off instead of dropping "reject" from promise. If someone already build a mechanism to retry again in case result is FAILED_RETRYABLE the whole logic will be gone. I'm against of changing the "natural" api of Promise which is returned by this method as then this method will become useless for other cases (for example to build auto retry for FAILED_RETRYABLE result etc.)

@johanneswuerbach
Copy link
Contributor Author

johanneswuerbach commented Oct 26, 2020

@obecny the problem in this case is that outside of forceFlush no method is actually exposing that state, so I would doubt that somebody build a (working) retry mechanism at this layer.

One call site where this method is currently called is:
https://github.com/open-telemetry/opentelemetry-js/blob/master/packages/opentelemetry-tracing/src/export/BatchSpanProcessor.ts#L93, which is called by onEnd, which returns nothing https://github.com/open-telemetry/opentelemetry-js/blob/master/packages/opentelemetry-tracing/src/export/BatchSpanProcessor.ts#L60. Another call site is https://github.com/open-telemetry/opentelemetry-js/blob/master/packages/opentelemetry-tracing/src/export/BatchSpanProcessor.ts#L122, where the promise rejection is completely ignored and the result also not returned.

While I'm not sure what the desired future is, there is #1569, which sounds like retry should be implemented at the exporter layer and not here.

@obecny
Copy link
Member

obecny commented Oct 26, 2020

@obecny the problem in this case is that outside of forceFlush no method is actually exposing that state, so I would doubt that somebody build a (working) retry mechanism at this layer.

One call site where this method is currently called is:
https://github.com/open-telemetry/opentelemetry-js/blob/master/packages/opentelemetry-tracing/src/export/BatchSpanProcessor.ts#L93, which is called by onEnd, which returns nothing https://github.com/open-telemetry/opentelemetry-js/blob/master/packages/opentelemetry-tracing/src/export/BatchSpanProcessor.ts#L60. Another call site is https://github.com/open-telemetry/opentelemetry-js/blob/master/packages/opentelemetry-tracing/src/export/BatchSpanProcessor.ts#L122, where the promise rejection is completely ignored and the result also not returned.

While I'm not sure what the desired future is, there is #1569, which sounds like retry should be implemented at the exporter layer and not here.

then why not in line 93 do something like this:

this._flush().catch(e => {
  globalErrorHandler(.............
});

and then line 122 should also have the same what line 93.

the flush is used also in shutdown - after your change the shutdown will never raise an error.

@johanneswuerbach
Copy link
Contributor Author

@vmarchaud / @dyladan would that also be okay? We generally see tracing as best-effort so we don't care about errors, but I'm happy to wrap the catch those errors in our apps.

@dyladan
Copy link
Member

dyladan commented Oct 28, 2020

@vmarchaud / @dyladan would that also be okay? We generally see tracing as best-effort so we don't care about errors, but I'm happy to wrap the catch those errors in our apps.

I don't think @obecny is suggesting you wrap the error in your app, but wrap it in shutdown where _flush is called.

I think the way the PR has it now is fine. The global error handler changed error behavior in a lot of places already and this is just one more.

If someone already build a mechanism to retry again in case result is FAILED_RETRYABLE the whole logic will be gone.

@obecny note that the spec for span processors does not return the result type https://github.com/open-telemetry/opentelemetry-specification/blob/master/specification/trace/sdk.md#interface-definition

Also see in the spec https://github.com/open-telemetry/opentelemetry-specification/blob/master/specification/trace/sdk.md#exportbatch

Any retry logic that is required by the exporter is the responsibility of the exporter.
Returns: ExportResult:

ExportResult is one of:

  • Success - The batch has been successfully exported. For protocol exporters this typically means that the data is sent over the wire and delivered to the destination server.
  • Failure - exporting failed. The batch must be dropped. For example, this can happen when the batch contains bad data and cannot be serialized.

The RETRYABLE vs NON_RETRYABLE failure has been removed from the spec completely and should be removed in another PR.

@obecny
Copy link
Member

obecny commented Oct 28, 2020

@vmarchaud / @dyladan would that also be okay? We generally see tracing as best-effort so we don't care about errors, but I'm happy to wrap the catch those errors in our apps.

I don't think @obecny is suggesting you wrap the error in your app, but wrap it in shutdown where _flush is called.

shutdown is called by "end user". Because the shutdown returns Promise the user should expect that this will be either resolved or rejected with information about error . If we change the logic that the shutdown will never be rejected then we are changing the api that someone might be already using differently or someone might want to take advantage of resolve/reject when building its own solution - we don't know it yet. So I'm against changing this behaviour just to silence the error.
It might be not the scope of this PR but maybe we should have a discussion how exactly we want to handle such cases. And then also think about our api changes if we want this or not. I don't want to decide in this particular case, that's why I'm suggesting to resolve this in a way that the original api of shutdown will still behave the same but for unhandled cases (line 93 the global error handler will be used).

As a user I would be really surprised why suddenly I don't see a bug from a method I just called that wasn't successful.

@dyladan
Copy link
Member

dyladan commented Oct 28, 2020

@vmarchaud / @dyladan would that also be okay? We generally see tracing as best-effort so we don't care about errors, but I'm happy to wrap the catch those errors in our apps.

I don't think @obecny is suggesting you wrap the error in your app, but wrap it in shutdown where _flush is called.

shutdown is called by "end user". Because the shutdown returns Promise the user should expect that this will be either resolved or rejected with information about error . If we change the logic that the shutdown will never be rejected then we are changing the api that someone might be already using differently or someone might want to take advantage of resolve/reject when building its own solution - we don't know it yet. So I'm against changing this behaviour just to silence the error.
It might be not the scope of this PR but maybe we should have a discussion how exactly we want to handle such cases. And then also think about our api changes if we want this or not. I don't want to decide in this particular case, that's why I'm suggesting to resolve this in a way that the original api of shutdown will still behave the same but for unhandled cases (line 93 the global error handler will be used).

As a user I would be really surprised why suddenly I don't see a bug from a method I just called that wasn't successful.

👍 seems reasonable to me.

@johanneswuerbach
Copy link
Contributor Author

Updated the PR, let me know if that looks better :-)

Copy link
Member

@obecny obecny left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm, thx for changes

@obecny obecny added the enhancement New feature or request label Oct 30, 2020
@dyladan dyladan merged commit b523dab into open-telemetry:master Oct 30, 2020
@johanneswuerbach johanneswuerbach deleted the tracing-unhandled-rejection branch October 30, 2020 15:11
@johanneswuerbach
Copy link
Contributor Author

Would it be possible to get this into v0.12.1 patch release (happy to cherry-pick) or do we need to wait for v0.13.0?

pichlermarc pushed a commit to dynatrace-oss-contrib/opentelemetry-js that referenced this pull request Dec 15, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

BatchSpanProcessor ExportResult other than SUCCESS causes UnhandledPromiseRejectionWarning: 1
4 participants