Skip to content

Recover workspaces from Unavailable#4085

Merged
kcp-ci-bot merged 1 commit into
kcp-dev:mainfrom
ntnn:workspace-recover-from-unavailable
Apr 30, 2026
Merged

Recover workspaces from Unavailable#4085
kcp-ci-bot merged 1 commit into
kcp-dev:mainfrom
ntnn:workspace-recover-from-unavailable

Conversation

@ntnn

@ntnn ntnn commented Apr 28, 2026

Copy link
Copy Markdown
Member

Summary

In high-pressure environments a workspace may be marked unavailable because the 10s timeout is not enough for a logical cluster to be propagated and be known to the shard.

When this happens the workspace never recovers as there is nothing updating the status.

To prevent this:

  1. The timeout is increased to one minute via a documented const (so ~60 retries instead of 10)
  2. Added reconcile that allows the Workspace to recover from the logical cluster not being visible during initialization

What Type of PR Is This?

/kind bug
/kind flake

Related Issue(s)

Fixes #

Release Notes

NONE

@kcp-ci-bot kcp-ci-bot added release-note-none Denotes a PR that doesn't merit a release note. kind/bug Categorizes issue or PR as related to a bug. dco-signoff: yes Indicates the PR's author has signed the DCO. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Apr 28, 2026
In high-pressure environments a workspace may be marked unavailable
because the 10s timeout is not enough for a logical cluster to be
propagated and be known to the shard.

When this happens the workspace never recovers as there is nothing
updating the status.

To prevent this:

1. The timeout is increased to one minute via a documented const (so ~60
   retries instead of 10)
2. Added reconcile that allows the Workspace to recover from the logical
   cluster not being visible during initialization

Signed-off-by: Nelo-T. Wallus <red.brush9525@fastmail.com>
Signed-off-by: Nelo-T. Wallus <n.wallus@sap.com>
@ntnn ntnn force-pushed the workspace-recover-from-unavailable branch from 35e531d to c7d5e64 Compare April 28, 2026 20:51
@ntnn ntnn added this to tbd Apr 28, 2026
@ntnn ntnn moved this to In review in tbd Apr 28, 2026
logger.V(3).Info("LogicalCluster reappeared, recovering workspace", "cluster", workspace.Spec.Cluster)
conditions.MarkTrue(workspace, tenancyv1alpha1.WorkspaceInitialized)
// Immediately request requeueing to recover the workspace faster.
r.requeueAfter(workspace, time.Second)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to consider any other back-off period behaviors (exponential), to not spam too many retries?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The expontential is why I added the requeue - at this point in the reconcile loop the workspace should be in an exponential backoff and this should mark the workspace as operational again.
By forcing a quick requeue I wanted to trigger the next reconcile loop quickly so the initialization can continue.

@mjudeikis

Copy link
Copy Markdown
Contributor

/lgtm
/approve

lets see how it works or does not :D

@kcp-ci-bot kcp-ci-bot added the lgtm Indicates that a PR is ready to be merged. label Apr 30, 2026
@kcp-ci-bot

Copy link
Copy Markdown
Contributor

LGTM label has been added.

DetailsGit tree hash: 8526a1d9c492d78e7cb2f47601733ff6240534f9

@kcp-ci-bot

Copy link
Copy Markdown
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: mjudeikis

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@kcp-ci-bot kcp-ci-bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Apr 30, 2026
@kcp-ci-bot kcp-ci-bot merged commit c58f1b3 into kcp-dev:main Apr 30, 2026
14 checks passed
@github-project-automation github-project-automation Bot moved this from In review to Done in tbd Apr 30, 2026
@ntnn ntnn mentioned this pull request May 12, 2026
@ntnn ntnn deleted the workspace-recover-from-unavailable branch May 13, 2026 05:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. dco-signoff: yes Indicates the PR's author has signed the DCO. kind/bug Categorizes issue or PR as related to a bug. lgtm Indicates that a PR is ready to be merged. release-note-none Denotes a PR that doesn't merit a release note. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants