Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: write retry delay #335

Merged
merged 4 commits into from Jun 22, 2022

Conversation

seth-hunter
Copy link
Contributor

#332 deadlock fix exposes design flaw in retry queue logic. Accumulation of delay should be tracked at the queue level, not at the batch level. Old design can result in repeated minimum-delay retries once batches start getting discarded due to age (oldest batch has never attempted write due to prior blockage, and prior batches expired due to age, however queue has not successfully written yet so should be at maximum-delay retry interval).

@codecov-commenter
Copy link

codecov-commenter commented Jun 18, 2022

Codecov Report

Merging #335 (337ba61) into master (fe6c7cb) will not change coverage.
The diff coverage is 100.00%.

@@           Coverage Diff           @@
##           master     #335   +/-   ##
=======================================
  Coverage   90.48%   90.48%           
=======================================
  Files          23       23           
  Lines        2490     2490           
=======================================
  Hits         2253     2253           
  Misses        178      178           
  Partials       59       59           
Impacted Files Coverage Δ
api/write.go 93.70% <100.00%> (ø)
api/writeAPIBlocking.go 84.21% <100.00%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update fe6c7cb...337ba61. Read the comment docs.

…carded

Design error, RetryDelay should be tracked in the service/queue and not in the batch. Similarly, a service/queue-level RetryAttempts should be used by computeRetryDelay() instead of a batch-level value.
…t.go

min() function isn't used in service.go, only in service_test.go
@seth-hunter
Copy link
Contributor Author

I've added a test case that explicitly illustrates the design flaw fixed in this PR. In this scenario, MaxRetryInterval is capped at 300 ms and a client is attempting to write every 20 ms to an unresponsive server.

Prior to this PR's fix, the behavior would be as follows...

Write error: Unexpected status code 429, Batch kept for retrying
Retry interval increased to 21 ms
Write proc: cannot write yet, storing batch to queue
Retry interval still at 21 ms
Write error: Unexpected status code 429, Batch kept for retrying
Retry interval increased to 47 ms
Write proc: cannot write yet, storing batch to queue
Retry interval still at 47 ms
Write proc: cannot write yet, storing batch to queue
Retry interval still at 47 ms
Write proc: oldest batch in retry queue expired, discarding
Write error: Unexpected status code 429, Batch kept for retrying
Retry interval increased to 27 ms
Write proc: oldest batch in retry queue expired, discarding
Write error: Unexpected status code 429, Batch kept for retrying
Retry interval increased to 39 ms
Write proc: oldest batch in retry queue expired, discarding
Write error: Unexpected status code 429, Batch kept for retrying
Retry interval increased to 21 ms
Write proc: oldest batch in retry queue expired, discarding
Write error: Unexpected status code 429, Batch kept for retrying
Retry interval increased to 38 ms
... (continues to attempt a server write every time client attempts to write) ...

... which I claim is incorrect. If server has not started responding, we do not suddenly want to start attempting to write to it again every time the client adds a new batch to the queue just because the oldest batch has expired. We should instead continue to wait MaxRetryInterval ms until the server begins responding again.

This improved behavior is demonstrated as follows in the PR, where RetryDelay is now tracked at the Write Queue level instead of the Batch level:

Write error: Unexpected status code 429, batch kept for retrying
Retry interval increased to 21 ms
Write proc: cannot write yet, storing batch to queue
Retry interval still at 21 ms
Write error: Unexpected status code 429, batch kept for retrying
Retry interval increased to 47 ms
Write proc: cannot write yet, storing batch to queue
Retry interval still at 47 ms
Write proc: cannot write yet, storing batch to queue
Retry interval still at 47 ms
Write proc: oldest batch in retry queue expired, discarding
Write error: Unexpected status code 429, batch kept for retrying
Retry interval increased to 87 ms
Write proc: oldest batch in retry queue expired, discarding
Write proc: cannot write yet, storing batch to queue
Retry interval still at 87 ms
Write proc: oldest batch in retry queue expired, discarding
Write proc: cannot write yet, storing batch to queue
Retry interval still at 87 ms
Write proc: oldest batch in retry queue expired, discarding
Write proc: cannot write yet, storing batch to queue
Retry interval still at 87 ms
Write proc: oldest batch in retry queue expired, discarding
Write proc: cannot write yet, storing batch to queue
Retry interval still at 87 ms
Write proc: oldest batch in retry queue expired, discarding
Write error: Unexpected status code 429, batch kept for retrying
Retry interval increased to 219 ms
Write proc: oldest batch in retry queue expired, discarding
Write proc: cannot write yet, storing batch to queue
Retry interval still at 219 ms
Write proc: oldest batch in retry queue expired, discarding
Write proc: cannot write yet, storing batch to queue
Retry interval still at 219 ms
Write proc: oldest batch in retry queue expired, discarding
Write proc: cannot write yet, storing batch to queue
Retry interval still at 219 ms
Write proc: oldest batch in retry queue expired, discarding
Write proc: cannot write yet, storing batch to queue
Retry interval still at 219 ms
Write proc: oldest batch in retry queue expired, discarding
Write proc: cannot write yet, storing batch to queue
Retry interval still at 219 ms
Write proc: oldest batch in retry queue expired, discarding
Write proc: cannot write yet, storing batch to queue
Retry interval still at 219 ms
Write proc: oldest batch in retry queue expired, discarding
Write proc: cannot write yet, storing batch to queue
Retry interval still at 219 ms
Write proc: oldest batch in retry queue expired, discarding
Write proc: cannot write yet, storing batch to queue
Retry interval still at 219 ms
Write proc: oldest batch in retry queue expired, discarding
Write proc: cannot write yet, storing batch to queue
Retry interval still at 219 ms
Write proc: oldest batch in retry queue expired, discarding
Write proc: cannot write yet, storing batch to queue
Retry interval still at 219 ms
Write proc: oldest batch in retry queue expired, discarding
Write error: Unexpected status code 429, batch kept for retrying
Retry interval increased to 300 ms
Write proc: oldest batch in retry queue expired, discarding
Write proc: cannot write yet, storing batch to queue
Retry interval still at 300 ms
Write proc: oldest batch in retry queue expired, discarding
Write proc: cannot write yet, storing batch to queue
Retry interval still at 300 ms
Write proc: oldest batch in retry queue expired, discarding
Write proc: cannot write yet, storing batch to queue
Retry interval still at 300 ms
Write proc: oldest batch in retry queue expired, discarding
Write proc: cannot write yet, storing batch to queue
Retry interval still at 300 ms
Write proc: oldest batch in retry queue expired, discarding
Write proc: cannot write yet, storing batch to queue
Retry interval still at 300 ms
Write proc: oldest batch in retry queue expired, discarding
Write proc: cannot write yet, storing batch to queue
Retry interval still at 300 ms
Write proc: oldest batch in retry queue expired, discarding
Write proc: cannot write yet, storing batch to queue
Retry interval still at 300 ms
Write proc: oldest batch in retry queue expired, discarding
Write proc: cannot write yet, storing batch to queue
Retry interval still at 300 ms
Write proc: oldest batch in retry queue expired, discarding
Write proc: cannot write yet, storing batch to queue
Retry interval still at 300 ms
Write proc: oldest batch in retry queue expired, discarding
Write proc: cannot write yet, storing batch to queue
Retry interval still at 300 ms
Write proc: oldest batch in retry queue expired, discarding
Write proc: cannot write yet, storing batch to queue
Retry interval still at 300 ms
Write proc: oldest batch in retry queue expired, discarding
Write proc: cannot write yet, storing batch to queue
Retry interval still at 300 ms
Write proc: oldest batch in retry queue expired, discarding
Write proc: cannot write yet, storing batch to queue
Retry interval still at 300 ms
Write proc: oldest batch in retry queue expired, discarding
Write proc: cannot write yet, storing batch to queue
Retry interval still at 300 ms
Write proc: oldest batch in retry queue expired, discarding
Write error: Unexpected status code 429, batch kept for retrying
Retry interval capped at 300 ms
Write proc: oldest batch in retry queue expired, discarding
Write proc: cannot write yet, storing batch to queue
Retry interval still at 300 ms
Write proc: oldest batch in retry queue expired, discarding
Write proc: cannot write yet, storing batch to queue
Retry interval still at 300 ms
Write proc: oldest batch in retry queue expired, discarding
Write proc: cannot write yet, storing batch to queue
Retry interval still at 300 ms
Write proc: oldest batch in retry queue expired, discarding
Write proc: cannot write yet, storing batch to queue
Retry interval still at 300 ms
... (continues to attempt a server write only every MaxRetryInterval ms) ...

Copy link
Contributor

@vlastahajek vlastahajek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for digging more into the issue. I agree that this is the better way to handle retry time.
Adding a test is a good practice! 👍

Just a small change.

Comment on lines 65 to 66
RetryDelay uint
RetryAttempts uint
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Those fields don't need to be public

@vlastahajek vlastahajek merged commit af34012 into influxdata:master Jun 22, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants