Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update transient errors retry timeout and retryable status codes #1842

Merged
merged 3 commits into from
May 14, 2021

Conversation

srikanthccv
Copy link
Member

Description

We agreed that OTLP exporter transient error retry timeout is way to high and change it to some reasonable number. This also updates the retryable codes list for grpc; PERMISSION_DENIED and UNAUTHENTICATED do not belong to that list.

Fixes #1670

@srikanthccv srikanthccv requested a review from a team as a code owner May 11, 2021 23:27
@srikanthccv srikanthccv requested review from codeboten and ocelotl and removed request for a team May 11, 2021 23:27
@@ -289,8 +286,6 @@ def _export(self, data: TypingSequence[SDKDataT]) -> ExportResultT:
if error.code() in [
StatusCode.CANCELLED,
StatusCode.DEADLINE_EXCEEDED,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Aren't cancelled and deadline exceeded errors raised by the client itself? Not 100% sure about deadline exceeded but I think cancelled should not be retried as it means (if I understand correctly) that the client decided to cancel the request for example in case it received some short of shutdown/exit signal.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is typically client but I believe this is allowed to cover the case when server does it https://grpc.io/docs/what-is-grpc/core-concepts/#cancelling-an-rpc.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

@owais owais May 13, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. Not a blocker for me then but I wonder if is there a simple way for the exporter to cancel an in-flight request on shutdown (with or without timeout)?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah I understand your concern. Anyway we have some work to do in making shutdown spec complaint we can address this there.

@@ -262,14 +262,11 @@ def _translate_data(
pass

def _export(self, data: TypingSequence[SDKDataT]) -> ExportResultT:

max_value = 64
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be possible to make this configurable? For example as an export_timeout: maximum duration for exporting spans argument to the exporter?

My use case is CLI applications where anything having them hang on exit for 64 seconds because a telemetry backend is down is unacceptable. For comparison the Go implementation has two separate timeouts ExportTimeout for the full operation and BatchTimeout for the direct publish.

For now I have special subclass with a copy of this method to tweak the behavior but it's a maintenance issue.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will bring up the topic of supporting configurability in next SIG meeting. Spec is not clear(at this point) in that regard and anything we do might become breaking change in future. And speaking of ExportTimeout and BatchTimeout python also has them as schedule_delay_millis and export_timeout_millis here I don't know why but we are only enforcing the export timeout only during force flush.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Talked about this in SIG call. This is out of scope for this. Consensus was this is not clear from Spec so we might end up making backwards incompatible change by introducing the support for configuring this. One thing we want to explore is making retry async and not blocking and also looking at how span processor timeout is behaving. Please create another issue and we will track it there.

@lzchen lzchen merged commit f816287 into open-telemetry:main May 14, 2021
@srikanthccv srikanthccv deleted the otlp-transient-errors-timeout branch September 24, 2021 08:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Default otel reporter timeout of 900s is too long.
4 participants