fix: BulkWriter: add backoff on retries#1447
Conversation
Codecov Report
@@ Coverage Diff @@
## master #1447 +/- ##
==========================================
+ Coverage 98.20% 98.22% +0.01%
==========================================
Files 32 32
Lines 19509 19565 +56
Branches 1276 1283 +7
==========================================
+ Hits 19159 19217 +58
+ Misses 346 344 -2
Partials 4 4
Continue to review full report at Codecov.
|
dev/src/bulk-writer.ts
Outdated
| // An array of pending write operations. Only contains writes that have not | ||
| // been resolved. | ||
| private pendingOps: Array<BulkWriterOperation> = []; | ||
| private _pendingOps: Array<BulkWriterOperation> = []; |
There was a problem hiding this comment.
This can just be renamed to pendingOps and made readonly.
dev/src/bulk-writer.ts
Outdated
| if (failureCount === 0) return 0; | ||
| if (useMaxDelay) return DEFAULT_BACKOFF_MAX_DELAY_MS; |
There was a problem hiding this comment.
While this is correct for our use case, this reads a bit strange. I would flip these two statements. That means that we would end up using useMaxDelay no matter what the failure count is, which I think is easier to understand.
| return Math.min( | ||
| DEFAULT_BACKOFF_MAX_DELAY_MS, | ||
| backoffMs + Math.random() * 0.3 * backoffMs | ||
| ); |
There was a problem hiding this comment.
Applying the jitter to the entire duration of the backoff creates a pretty strong variance, and is different from how we compute the backoff jitter elsewhere. Can we use the last backoff and then only apply the jitter to the next section? This would either be:
DEFAULT_BACKOFF_INITIAL_DELAY_MS *
Math.pow(DEFAULT_BACKOFF_FACTOR, failureCount-1) + (DEFAULT_BACKOFF_FACTOR * DEFAULT_BACKOFF_INITIAL_DELAY_MS * jitter); (Maybe?)
or a simple backoff time in Delayed operation.
There was a problem hiding this comment.
If I understand correctly, the ExponentialBackoff class has even more variance. We use currentBaseMs + (Math.random-0.5)*currentBaseMs, where currentBaseMs = initialDelay * pow(1.5, numAttempts).
I modeled the BulkWriter approach after ExponentialBackoff, and used 0.3 instead of 0.5 to ensure that the backoff increases each time.
This is bit more of a nit (and it probably won't affect performance), but I feel that adding the delay with jitter rather than multiplying it goes against the spirit of exponential backoff, because 100x^k != 100x^(k-1)+x
| get failedAttempts(): number { | ||
| return this._failedAttempts; | ||
| } | ||
|
|
There was a problem hiding this comment.
Some thoughts/ideas:
-
Could each operation calculate its own backoff? We can then use a simpler calculation where each backoff depends on the previous backoff time (including jitter). You would just keep a simple "last_backoff_duration" in BulkWriterOperation and multiple it per failed attempt.
-
Would it be possible to use the
Backoffclass? We should just need to expose a method to return the backoff duration. -
I think it would be cleaner if we stored a
lastErrorinstead of the more specificresourceExhausted. It is cleaner and more versatile for later modifications.
There was a problem hiding this comment.
Could each operation calculate its own backoff?
Done.
Would it be possible to use the Backoff class?
See my below comment on how the backoff used here is different than ExponentialBackoff. I don't think there is much to reuse in terms of logic here.
I think it would be cleaner if we stored a lastError instead of the more specific resourceExhausted.
Done.
| }); | ||
| }); | ||
|
|
||
| it('applies maximum backoff on retries for RESOURCE_EXHAUSTED', async () => { |
There was a problem hiding this comment.
Let's also add a test that uses multiple different backoffs in a batch and a test that verifies that we send a second batch before the first batch if the first batch has a backoff.
| /*! | ||
| * The default jitter to apply. For example, a factor of 0.3 means a 30% jitter | ||
| * is applied. | ||
| */ |
There was a problem hiding this comment.
This should mention that this relates to backoff. It needs some context in this module.
dev/src/bulk-writer.ts
Outdated
| private deferred = new Deferred<WriteResult>(); | ||
| private failedAttempts = 0; | ||
| private _failedAttempts = 0; | ||
| private _lastError?: Status; |
dev/src/bulk-writer.ts
Outdated
| return this.deferred.promise; | ||
| } | ||
|
|
||
| get lastError(): Status | undefined { |
There was a problem hiding this comment.
I am surprised this need to be exported.
dev/src/bulk-writer.ts
Outdated
| const delayMs = this._rateLimiter.getNextRequestDelayMs( | ||
| pendingBatch._opCount | ||
| ); | ||
| const finalDelayMs = Math.max(backoffMsWithJitter, delayMs); |
There was a problem hiding this comment.
You can move out of the if statement and then use it in the if(...) check itself.
dev/src/bulk-writer.ts
Outdated
| const jitter = | ||
| Math.random() * | ||
| DEFAULT_JITTER_FACTOR * | ||
| (Math.round(Math.random()) ? 1 : -1); |
There was a problem hiding this comment.
Math.random() is not cheap. You can remove one call: Math.random() * 2 * DEFAULT_JITTER_FACTOR - DEFAULT_JITTER_FACTOR.
There was a problem hiding this comment.
Ooh that's some nice mathing! Refactored it slightly to make it more understandable (at least to me).
| (1 - DEFAULT_JITTER_FACTOR) * expected[timeoutHandlerCounter], | ||
| (1 + DEFAULT_JITTER_FACTOR) * expected[timeoutHandlerCounter] | ||
| ); | ||
| timeoutHandlerCounter++; |
dev/src/bulk-writer.ts
Outdated
| class BulkWriterOperation { | ||
| private deferred = new Deferred<WriteResult>(); | ||
| private failedAttempts = 0; | ||
| private _failedAttempts = 0; |
There was a problem hiding this comment.
You can rename this back to its old name since there is no getter anymore.
No description provided.