Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

x/telemetry/upload: decide upload retry logic #63694

Closed
hyangah opened this issue Oct 23, 2023 · 8 comments
Closed

x/telemetry/upload: decide upload retry logic #63694

hyangah opened this issue Oct 23, 2023 · 8 comments
Assignees
Labels
NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. telemetry x/telemetry issues
Milestone

Comments

@hyangah
Copy link
Contributor

hyangah commented Oct 23, 2023

The upload package keeps retrying unconditionally if the POST request ends with non-200 status code. code The upload program will keep retrying...
In case the upload fails due to transient network errors or some internal errors etc, a few retries before giving up completely make sense. But currently telemetry.go.dev upload server also returns http.StatusBadRequest, or other errors. In that case, retrying upload with the same data is not desirable.

Either make the upload package distinguish retryable and non-retryable status codes, or make the upload server never return an error if the upload shouldn't be retried.

cc @golang/tools-team @pjweinb @findleyr

@gopherbot gopherbot added the telemetry x/telemetry issues label Oct 23, 2023
@gopherbot gopherbot added this to the Unreleased milestone Oct 23, 2023
@dr2chase dr2chase added the NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. label Oct 23, 2023
@pjweinb
Copy link

pjweinb commented Oct 23, 2023

Yes, something should be done. None of the alternatives seem perfect:

  1. same behavior, but give up after 15 minutes. Then after a few weeks of this the report will be removed.
  2. upload server doesn't take the report, but returns no error. Then the report is moved into the local telemetry/upload directory, and looks like it was uploaded.
  3. upload server returns errors. What errors? (E.g., how many retryable erros are there for the upload process to think about?) What happens to the report for unretryable errors?

@hyangah
Copy link
Contributor Author

hyangah commented Oct 23, 2023

I am thinking combination of these:

  • 4XX status code can indicate client didn't behave as expected and client shouldn't need to retry. Alternatively, upload server can return http OK with a non-empty response (e.g. rejected) so client distinguished rejected upload requests.
  • Client will keep retrying (same behavior as now - whenever the uploader runs) and after days of retrying, give up and delete the intermediate reports (.json). This handles 5XX cases.

@hyangah
Copy link
Contributor Author

hyangah commented Oct 24, 2023

From today's discussion:

  • 4XX status code - client shouldn't retry.
  • Other status code - client should retry. Retry is done by leaving the intermediate json file untouched for 7days. Next upload processes will pick it up.

Not a release blocker. We will pick up in the next sprint.

@hyangah hyangah modified the milestones: Unreleased, Backlog Oct 24, 2023
@findleyr
Copy link
Contributor

Tentatively assigning to @pjweinb. Peter -- does this seem like something you could pick up?

@pjweinb
Copy link

pjweinb commented Feb 24, 2024

Looking at the code...

@pjweinb
Copy link

pjweinb commented Feb 28, 2024

The current behavior of the upload package is to try to upload once, and if that fails, it just leaves the uploadable report in the 'local' directory, for some future invocation to try to upload.

But if the upload package is called many times from different packages, each will try to upload the report. It is likely better that however many upload processes run, to only try to upload a particular report once every (say) 12 hours. If this is the right solution it wold be fairly easy to implement.

@findleyr
Copy link
Contributor

Decision:

The uploader should not retry 4xx error codes, but should retry 5xx. When not retrying, the uploader can just delete the uploadable report in the local directory.
Additionally, uploading should be rate limited, but that is integrated with telemetry.Start as part of #65500.

@gopherbot
Copy link

Change https://go.dev/cl/568238 mentions this issue: internal/upload: remove proposed upload on 4xx errors

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. telemetry x/telemetry issues
Projects
None yet
Development

No branches or pull requests

5 participants