Recover workspaces from Unavailable#4085
Conversation
In high-pressure environments a workspace may be marked unavailable because the 10s timeout is not enough for a logical cluster to be propagated and be known to the shard. When this happens the workspace never recovers as there is nothing updating the status. To prevent this: 1. The timeout is increased to one minute via a documented const (so ~60 retries instead of 10) 2. Added reconcile that allows the Workspace to recover from the logical cluster not being visible during initialization Signed-off-by: Nelo-T. Wallus <red.brush9525@fastmail.com> Signed-off-by: Nelo-T. Wallus <n.wallus@sap.com>
35e531d to
c7d5e64
Compare
| logger.V(3).Info("LogicalCluster reappeared, recovering workspace", "cluster", workspace.Spec.Cluster) | ||
| conditions.MarkTrue(workspace, tenancyv1alpha1.WorkspaceInitialized) | ||
| // Immediately request requeueing to recover the workspace faster. | ||
| r.requeueAfter(workspace, time.Second) |
There was a problem hiding this comment.
Do we need to consider any other back-off period behaviors (exponential), to not spam too many retries?
There was a problem hiding this comment.
The expontential is why I added the requeue - at this point in the reconcile loop the workspace should be in an exponential backoff and this should mark the workspace as operational again.
By forcing a quick requeue I wanted to trigger the next reconcile loop quickly so the initialization can continue.
|
/lgtm lets see how it works or does not :D |
|
LGTM label has been added. DetailsGit tree hash: 8526a1d9c492d78e7cb2f47601733ff6240534f9 |
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: mjudeikis The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
Summary
In high-pressure environments a workspace may be marked unavailable because the 10s timeout is not enough for a logical cluster to be propagated and be known to the shard.
When this happens the workspace never recovers as there is nothing updating the status.
To prevent this:
What Type of PR Is This?
/kind bug
/kind flake
Related Issue(s)
Fixes #
Release Notes