Skip to content

Add LongSuccessResetBackoff path in backoff function#179

Merged
evanj80-illumio merged 9 commits intomainfrom
exponential-backoff-nonterminating-happypath
Apr 18, 2025
Merged

Add LongSuccessResetBackoff path in backoff function#179
evanj80-illumio merged 9 commits intomainfrom
exponential-backoff-nonterminating-happypath

Conversation

@AriSweedler
Copy link
Copy Markdown
Contributor

Let's say we fail 7 times, then succeed for 45 minutes minutes before failing again. Sometimes this 45 minute period of good behavior should be considered a success, even though the function didn't return "no error". This is the LongSuccessBackoffReset path

To use this feature, configure the exponentialBackoff with the opts field ActionTimeToConsiderSuccess. A function that runs forever in the happy path should be considered recovered after the action runs for ActionTimeToConsiderSuccess.

If you don't configure an ActionTimeToConsiderSuccess then this feature is ignored.

Let's say we fail 7 times, then succeed for 45 minutes minutes before
failing again. Sometimes this 45 minute period of good behavior should
be considered a success, even though the function didn't return "no
error".

ActionTimeToConsiderSuccess is used in the case that the happy-path of our
function is blocking forever. A function like this may have issues that
require exponentialBackoff but should be considered recovered after
the action runs for ActionTimeToConsiderSuccess.

If you don't configure an ActionTimeToConsiderSuccess then this feature
is ignored
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
@evanj80-illumio
Copy link
Copy Markdown
Contributor

This make sense to me! What do you think we should make this value to start. Also I thought this backoff logic already handled this case?

@AriSweedler
Copy link
Copy Markdown
Contributor Author

This make sense to me! What do you think we should make this value to start. Also I thought this backoff logic already handled this case?

I think 15 minutes and 30 minutes are fine. I will make them configurable with these default values, though.

Realistically the important thing to hit is: can it connect and send a few resources. We basically know that after a few seconds. But I'd rather fail open then fail closed. If we restart streams every 5 minutes it could be normal it could be bad. If we restart streams every 24 hours its definitely fine. Where do u draw the line? I took my best guess, that's how I came up with 15 minutes. As for the 30 minute delay - well if it's a real issue that requires re-auth then it should be less than 30 minutes for the 10 failures. (we could spend 10*1min in backoff)

@AriSweedler AriSweedler requested a review from Copilot April 16, 2025 13:57
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot reviewed 2 out of 2 changed files in this pull request and generated no comments.

Comments suppressed due to low confidence (1)

internal/controller/streams_retry.go:132

  • [nitpick] The commit title refers to a 'LongSuccessBackoffReset' path, while the function is named 'LongSuccessResetBackoff'. Consider renaming the function to 'LongSuccessBackoffReset' for naming consistency.
func (s *state) LongSuccessResetBackoff() {

@AriSweedler
Copy link
Copy Markdown
Contributor Author

Also I thought this backoff logic already handled this case?

That was not my observation. We could be 7 consecutiveFailures deep, and even if we successfully streamed for 3 hours, the next error would immediately jump to the 8th consecutiveFailure. Instead of starting at 1.

Actions are func() error. We only reset the backoff depth upon the Action returning success. Returning success WAS defined as returns nil. Now success is defined as spent more than ActionTimeToConsiderSuccess OR returns nil.

@AriSweedler AriSweedler changed the title Added LongSuccessBackoffReset path in backoff function Added LongSuccessResetBackoff path in backoff function Apr 16, 2025
@AriSweedler AriSweedler requested a review from Copilot April 16, 2025 14:00
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.

@AriSweedler
Copy link
Copy Markdown
Contributor Author

As of #180 (review) listing is expected to take longer. 15 minutes as a default may be too small. I will increase it to 1 hour. And then I will increase the severe error time to 2 hours. cc @ganesh-talla thanks for these comments!

AriSweedler and others added 4 commits April 16, 2025 16:19
…k a success

If it takes 20 minutes to ingest resources but then we fail to watch,
then 15 mintues is too short. Bumping this up seemed like a good idea.
And then instead of hardcoding it, I made it configurable
@rlenglet rlenglet changed the title Added LongSuccessResetBackoff path in backoff function Add LongSuccessResetBackoff path in backoff function Apr 18, 2025
rlenglet
rlenglet previously approved these changes Apr 18, 2025
ganesh-talla
ganesh-talla previously approved these changes Apr 18, 2025
Co-authored-by: Romain Lenglet <romain.lenglet@berabera.info>
@AriSweedler AriSweedler dismissed stale reviews from ganesh-talla and rlenglet via 0713bb5 April 18, 2025 16:12
@AriSweedler AriSweedler requested a review from rlenglet April 18, 2025 16:57
Copy link
Copy Markdown
Contributor

@evanj80-illumio evanj80-illumio left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM 🥳 Thanks for the general spacing clean up as well!

rlenglet
rlenglet previously approved these changes Apr 18, 2025
@evanj80-illumio evanj80-illumio merged commit 9d48abb into main Apr 18, 2025
8 checks passed
@evanj80-illumio evanj80-illumio deleted the exponential-backoff-nonterminating-happypath branch April 18, 2025 21:09
pavankumarinnamuri pushed a commit that referenced this pull request Jan 16, 2026
Let's say we fail 7 times, then succeed for 45 minutes minutes before
failing again. Sometimes this 45 minute period of good behavior should
be considered a success, even though the function didn't return "no
error". This is the `LongSuccessBackoffReset` path

To use this feature, configure the `exponentialBackoff` with the opts
field `ActionTimeToConsiderSuccess`. A function that runs forever in the
happy path should be considered recovered after the action runs for
`ActionTimeToConsiderSuccess`.

If you don't configure an ActionTimeToConsiderSuccess then this feature
is ignored.

---------

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Romain Lenglet <romain.lenglet@berabera.info>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants