Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fixes to ensure Delete phase is always handled and prevent deploys from being stuck in a Savepointing phase #142

Merged
merged 6 commits into from
Dec 10, 2019

Conversation

glaksh100
Copy link
Contributor

@glaksh100 glaksh100 commented Dec 9, 2019

Problem

  • Deploy starts — Savepoint is triggered and triggerID registered
  • GetSavepointStatus call keeps failing for whatever reason
  • Retries are exhausted for GetSavepointStatus, however we have no real way to progress after this. The app gets stuck in a Savepointing Phase.
  • User tries to force delete the application. However, with retry counts not being reset and a non-nil LastSeenError, the operator never handles the Deleting phase (i.e. isTimeToHandlePhase always evaluates to false).

The PR fixes these problems in the following way:

  • Reduces retries for the GetSavepointStatus call from 20 --> 3 to reduce the time to detect savepoint failures (esp. 404s around savepoints not being found for a triggerID). I considered making this non-retryable, but I'm guessing there's a possibility that the savepoint status for a triggerID may not become immediately available.
  • When retries for GetSavepointStatus are exhausted, we cannot reliably rollback at this point, as we may not have a running job or a successful savepoint. So we allow for the remainder of the savepointing block to complete i.e. try to recover from an externalized checkpoint.
  • When a Delete is attempted, always handle the Deleting phase irrespective of what error/retry state the application is in.

@glaksh100 glaksh100 changed the title [WIP] Fix for deploy getting stuck Fixes to ensure Delete phase is always handled and prevent deploys from being stuck in a Savepointing phase Dec 9, 2019
@glaksh100
Copy link
Contributor Author

@mwylde can you PTAL?

Copy link
Contributor

@mwylde mwylde left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. I think this is a great start, but we should continue thinking about ways to make this pathway more robust.

@glaksh100 glaksh100 merged commit 4edffc5 into master Dec 10, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants