Recover workspaces from Unavailable by ntnn · Pull Request #4085 · kcp-dev/kcp

ntnn · 2026-04-28T20:12:40Z

Summary

In high-pressure environments a workspace may be marked unavailable because the 10s timeout is not enough for a logical cluster to be propagated and be known to the shard.

When this happens the workspace never recovers as there is nothing updating the status.

To prevent this:

The timeout is increased to one minute via a documented const (so ~60 retries instead of 10)
Added reconcile that allows the Workspace to recover from the logical cluster not being visible during initialization

What Type of PR Is This?

/kind bug
/kind flake

Related Issue(s)

Fixes #

Release Notes

NONE

In high-pressure environments a workspace may be marked unavailable because the 10s timeout is not enough for a logical cluster to be propagated and be known to the shard. When this happens the workspace never recovers as there is nothing updating the status. To prevent this: 1. The timeout is increased to one minute via a documented const (so ~60 retries instead of 10) 2. Added reconcile that allows the Workspace to recover from the logical cluster not being visible during initialization Signed-off-by: Nelo-T. Wallus <red.brush9525@fastmail.com> Signed-off-by: Nelo-T. Wallus <n.wallus@sap.com>

gman0 · 2026-04-29T07:53:40Z

+						logger.V(3).Info("LogicalCluster reappeared, recovering workspace", "cluster", workspace.Spec.Cluster)
+						conditions.MarkTrue(workspace, tenancyv1alpha1.WorkspaceInitialized)
+						// Immediately request requeueing to recover the workspace faster.
+						r.requeueAfter(workspace, time.Second)


Do we need to consider any other back-off period behaviors (exponential), to not spam too many retries?

The expontential is why I added the requeue - at this point in the reconcile loop the workspace should be in an exponential backoff and this should mark the workspace as operational again.
By forcing a quick requeue I wanted to trigger the next reconcile loop quickly so the initialization can continue.

mjudeikis · 2026-04-30T14:24:00Z

/lgtm
/approve

lets see how it works or does not :D

kcp-ci-bot · 2026-04-30T14:24:07Z

LGTM label has been added.

Details

Git tree hash: 8526a1d9c492d78e7cb2f47601733ff6240534f9

kcp-ci-bot · 2026-04-30T14:24:09Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: mjudeikis

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details

Needs approval from an approver in each of these files:

~~OWNERS~~ [mjudeikis]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

kcp-ci-bot added release-note-none Denotes a PR that doesn't merit a release note. kind/bug Categorizes issue or PR as related to a bug. dco-signoff: yes Indicates the PR's author has signed the DCO. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Apr 28, 2026

ntnn force-pushed the workspace-recover-from-unavailable branch from 35e531d to c7d5e64 Compare April 28, 2026 20:51

ntnn added this to tbd Apr 28, 2026

ntnn moved this to In review in tbd Apr 28, 2026

gman0 reviewed Apr 29, 2026

View reviewed changes

kcp-ci-bot assigned mjudeikis Apr 30, 2026

kcp-ci-bot added the lgtm Indicates that a PR is ready to be merged. label Apr 30, 2026

kcp-ci-bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Apr 30, 2026

kcp-ci-bot merged commit c58f1b3 into kcp-dev:main Apr 30, 2026
14 checks passed

github-project-automation Bot moved this from In review to Done in tbd Apr 30, 2026

ntnn mentioned this pull request May 12, 2026

kube rebase v1.36.0 #4117

Merged

ntnn deleted the workspace-recover-from-unavailable branch May 13, 2026 05:04

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Recover workspaces from Unavailable#4085

Recover workspaces from Unavailable#4085
kcp-ci-bot merged 1 commit into
kcp-dev:mainfrom
ntnn:workspace-recover-from-unavailable

ntnn commented Apr 28, 2026 •

edited

Loading

Uh oh!

gman0 Apr 29, 2026

Uh oh!

ntnn Apr 29, 2026

Uh oh!

mjudeikis commented Apr 30, 2026

Uh oh!

kcp-ci-bot commented Apr 30, 2026

Uh oh!

kcp-ci-bot commented Apr 30, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

ntnn commented Apr 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

What Type of PR Is This?

Related Issue(s)

Release Notes

Uh oh!

gman0 Apr 29, 2026

Choose a reason for hiding this comment

Uh oh!

ntnn Apr 29, 2026

Choose a reason for hiding this comment

Uh oh!

mjudeikis commented Apr 30, 2026

Uh oh!

kcp-ci-bot commented Apr 30, 2026

Uh oh!

kcp-ci-bot commented Apr 30, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

ntnn commented Apr 28, 2026 •

edited

Loading