Update transient errors retry timeout and retryable status codes #1842

srikanthccv · 2021-05-11T23:27:12Z

Description

We agreed that OTLP exporter transient error retry timeout is way to high and change it to some reasonable number. This also updates the retryable codes list for grpc; PERMISSION_DENIED and UNAUTHENTICATED do not belong to that list.

Fixes #1670

owais · 2021-05-12T23:24:48Z

...pentelemetry-exporter-otlp-proto-grpc/src/opentelemetry/exporter/otlp/proto/grpc/exporter.py

@@ -289,8 +286,6 @@ def _export(self, data: TypingSequence[SDKDataT]) -> ExportResultT:
                if error.code() in [
                    StatusCode.CANCELLED,
                    StatusCode.DEADLINE_EXCEEDED,


Aren't cancelled and deadline exceeded errors raised by the client itself? Not 100% sure about deadline exceeded but I think cancelled should not be retried as it means (if I understand correctly) that the client decided to cancel the request for example in case it received some short of shutdown/exit signal.

It is typically client but I believe this is allowed to cover the case when server does it https://grpc.io/docs/what-is-grpc/core-concepts/#cancelling-an-rpc.

And I found there is a table which lists what is retryable https://github.com/open-telemetry/opentelemetry-specification/blob/main/specification/protocol/otlp.md#otlpgrpc-response.

I see. Not a blocker for me then but I wonder if is there a simple way for the exporter to cancel an in-flight request on shutdown (with or without timeout)?

Ah I understand your concern. Anyway we have some work to do in making shutdown spec complaint we can address this there.

jlisee · 2021-05-12T23:41:20Z

...pentelemetry-exporter-otlp-proto-grpc/src/opentelemetry/exporter/otlp/proto/grpc/exporter.py

@@ -262,14 +262,11 @@ def _translate_data(
        pass

    def _export(self, data: TypingSequence[SDKDataT]) -> ExportResultT:
+
+        max_value = 64


Would it be possible to make this configurable? For example as an export_timeout: maximum duration for exporting spans argument to the exporter?

My use case is CLI applications where anything having them hang on exit for 64 seconds because a telemetry backend is down is unacceptable. For comparison the Go implementation has two separate timeouts ExportTimeout for the full operation and BatchTimeout for the direct publish.

For now I have special subclass with a copy of this method to tweak the behavior but it's a maintenance issue.

I will bring up the topic of supporting configurability in next SIG meeting. Spec is not clear(at this point) in that regard and anything we do might become breaking change in future. And speaking of ExportTimeout and BatchTimeout python also has them as schedule_delay_millis and export_timeout_millis here I don't know why but we are only enforcing the export timeout only during force flush.

Talked about this in SIG call. This is out of scope for this. Consensus was this is not clear from Spec so we might end up making backwards incompatible change by introducing the support for configuring this. One thing we want to explore is making retry async and not blocking and also looking at how span processor timeout is behaving. Please create another issue and we will track it there.

Update transient errors retry timeout and retryable status codes

90c41e3

srikanthccv requested a review from a team as a code owner May 11, 2021 23:27

srikanthccv requested review from codeboten and ocelotl and removed request for a team May 11, 2021 23:27

srikanthccv added 2 commits May 12, 2021 04:58

Add CHANGELOG entry

910b646

Merge branch 'main' into otlp-transient-errors-timeout

8bfad6e

owais reviewed May 12, 2021

View reviewed changes

jlisee reviewed May 12, 2021

View reviewed changes

lzchen approved these changes May 14, 2021

View reviewed changes

owais approved these changes May 14, 2021

View reviewed changes

lzchen merged commit f816287 into open-telemetry:main May 14, 2021

lzchen mentioned this pull request Aug 16, 2021

Application hangs if Otel collector is down #2024

Closed

srikanthccv deleted the otlp-transient-errors-timeout branch September 24, 2021 08:40

avpps mentioned this pull request Nov 27, 2023

Configurable max export delay value for OTLPExporterMixin #3559

Open

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update transient errors retry timeout and retryable status codes #1842

Update transient errors retry timeout and retryable status codes #1842

srikanthccv commented May 11, 2021

owais May 12, 2021

srikanthccv May 13, 2021

srikanthccv May 13, 2021

owais May 13, 2021 •

edited

Loading

srikanthccv May 13, 2021

jlisee May 12, 2021

srikanthccv May 13, 2021

srikanthccv May 13, 2021

Update transient errors retry timeout and retryable status codes #1842

Update transient errors retry timeout and retryable status codes #1842

Conversation

srikanthccv commented May 11, 2021

Description

owais May 12, 2021

Choose a reason for hiding this comment

srikanthccv May 13, 2021

Choose a reason for hiding this comment

srikanthccv May 13, 2021

Choose a reason for hiding this comment

owais May 13, 2021 • edited Loading

Choose a reason for hiding this comment

srikanthccv May 13, 2021

Choose a reason for hiding this comment

jlisee May 12, 2021

Choose a reason for hiding this comment

srikanthccv May 13, 2021

Choose a reason for hiding this comment

srikanthccv May 13, 2021

Choose a reason for hiding this comment

owais May 13, 2021 •

edited

Loading