🐛 Do not update KCP and MS status when unable to get workload cluster #10229

jessehu · 2024-03-06T05:03:59Z

What this PR does / why we need it:

Which issue(s) this PR fixes:
Fixes #10195

Test:

Added UT for KCP.Status update logic in the error scenario.
The UT code for existing MS.Status update logic does not contain fake Reconciler.Tracker object and not consider the getNode logic. So it needs more changes to add UT for this error scenario.
Ran e2e test for 3 CP & 3 Worker cluster creation with updating CAPI components version(based on CAPI 1.5.2) after the cluster created, observed the following expected behavior:
- MD.Status.ReadyReplicas won't change from 3 to 0 when getMachineNode() hit ErrClusterLocked error. The log "Requeuing because another worker has the lock on the ClusterCacheTracker" was printed.
- KCP.Status.ReadyReplicas won't change from 3 to 0 when controlPlane.GetWorkloadCluster(ctx) hit ErrClusterLocked error.

k8s-ci-robot · 2024-03-06T05:04:08Z

Welcome @jessehu!

It looks like this is your first PR to kubernetes-sigs/cluster-api 🎉. Please refer to our pull request process documentation to help your PR have a smooth ride to approval.

You will be prompted by a bot to use commands during the review process. Do not be afraid to follow the prompts! It is okay to experiment. Here is the bot commands documentation.

You can also check if kubernetes-sigs/cluster-api has its own contribution guidelines.

You may want to refer to our testing guide if you run into trouble with your tests not passing.

If you are having difficulty getting your pull request seen, please follow the recommended escalation practices. Also, for tips and tricks in the contribution process you may want to read the Kubernetes contributor cheat sheet. We want to make sure your contribution gets all the attention it needs!

Thank you, and welcome to Kubernetes. 😃

k8s-ci-robot · 2024-03-06T05:04:08Z

Hi @jessehu. Thanks for your PR.

I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

killianmuldoon

Thanks for taking this on!

/ok-to-test

jessehu · 2024-03-06T07:51:09Z

controlplane/kubeadm/internal/controllers/status.go

@@ -43,8 +43,7 @@ func (r *KubeadmControlPlaneReconciler) updateStatus(ctx context.Context, contro

 	// set basic data that does not require interacting with the workload cluster
 	controlPlane.KCP.Status.Replicas = replicas
-	controlPlane.KCP.Status.ReadyReplicas = 0
-	controlPlane.KCP.Status.UnavailableReplicas = replicas
+	controlPlane.KCP.Status.UnavailableReplicas = replicas - controlPlane.KCP.Status.ReadyReplicas


Note: This change avoids setting controlPlane.KCP.Status.ReadyReplicas from none 0 to 0 when GetWorkloadCluster(ctx) hits ErrClusterLocked or other error.

BTW：https://github.com/kubernetes-sigs/cluster-api/blob/release-1.5/controlplane/kubeadm/internal/controllers/status.go#L93 uses replicas rather than desiredReplicas to calculate UnavailableReplicas. Not sure if it's a bug or on purpose. cc @fabriziopandini please help take a look.

it would be odd to have two different ways to calculate the unavailable replicas within the same status update function.

hi @neolit123, actually it uses the same way to calculate the unavailable replicas. The line 46 is only for covering the case that KCP.Status.ReadyReplicas is 0 since the default value of UnavailableReplicas is 0 when KCP is newly created. If ReadyReplicas is not 0, UnavailableReplicas should equal to replicas - controlPlane.KCP.Status.ReadyReplicas.

line 45-46: controlPlane.KCP.Status.Replicas = replicas controlPlane.KCP.Status.UnavailableReplicas = replicas - controlPlane.KCP.Status.ReadyReplicas line 91-92: controlPlane.KCP.Status.ReadyReplicas = status.ReadyNodes controlPlane.KCP.Status.UnavailableReplicas = replicas - status.ReadyNodes

hi @neolit123 @fabriziopandini could you please help take some time to move this forward? We need this patch to fix the issue hit in our project based on CAPI.

@sbueringer Got your point. If doing it this way, we may also need to move controlPlane.KCP.Status.Replicas = replicas to before line 96 controlPlane.KCP.Status.ReadyReplicas = status.ReadyNodes, otherwise only Status.Replicas increases or decreases by 1 and other Replicas fields unchanged (e.g. UpdatedReplicas, ReadyReplicas, UnavailableReplicas etc. which seems unreasonable). Even if it's moved, the conditions are updated while no Replicas fields are updated (which is unreasonable but maybe acceptedable since it will be corrected in the coming reconciles soon ?).

If updating both Status.Replicas and Status.UnavailableReplicas before getting ClusterStatus like this PR currently behaviors is not expected:

it sounds unreasonable when replicas changes from 3 to 4 (when rolling out KCP): Replicas -> 4 & UnavailableReplicas -> 1 (incorrect)

it sounds unreasonable when replicas changes from 3 to 2 (when manually deleted a KCP Mahcine ?): Replicas -> 2 & UnavailableReplicas -> -1 (should be 1)
The root cause is line 93 controlPlane.KCP.Status.UnavailableReplicas = replicas - status.ReadyNodes (and line 47 as well) should be controlPlane.KCP.Status.UnavailableReplicas = desiredReplicas- status.ReadyNodes.

Good point. Let's do the following:

in the create case: let's set .status.replicas, .status.unavailableReplicas, .status.readyReplicas (we'll set readyReplicas to 0 but let's do it anyway for clarity)

Then we set .stauts.version & conditions (l.55-82)

Then let's get the cluster status and only afterwards modify the status replica fields (let's also move down the update of status updatedReplicas)

I think it makes sense to update the conditions in 2 as they are all based on information from the mgmt cluster.

Overall we're just missing a way to signal that we can't communicate with the wl cluster anymore, but we can follow-up on that

@sbueringer Got it. I pushed a commit to use Status.UnavailableReplicas = desiredReplicas- status.ReadyNodes which should solve the problems we discussed above. Please take a look if it is. If not, I will make the changes as you described, but not sure why we need to move down the update of status updatedReplicas because it doesn't depend on the workload cluste status: controlPlane.KCP.Status.UpdatedReplicas = int32(len(controlPlane.UpToDateMachines()))

The idea would have been to not update any status replica fields if we can't communicate with the workload cluster to not get them into an inconsistent state with each other. I'll think a bit about it.

Yup, I agree that the idea way would have been to not update any status replica fields and conditions fields if we can't communicate with the workload cluster.

jessehu · 2024-03-06T10:05:23Z

/area machineset

neolit123

general idea for the change LGTM
approval is up to CAPI maintainers.

neolit123

LGTM
leaving lgtm/approve to active maintainers

internal/controllers/machineset/machineset_controller.go

neolit123

/lgtm

k8s-ci-robot · 2024-03-28T06:05:34Z

LGTM label has been added.

Git tree hash: d4abd1501740aab050ed6f27e26ac81af5433520

sbueringer · 2024-04-04T14:03:16Z

/ok-to-test

internal/controllers/machineset/machineset_controller.go

k8s-ci-robot · 2024-04-05T07:40:22Z

New changes are detected. LGTM label has been removed.

k8s-ci-robot · 2024-04-05T07:42:02Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please ask for approval from neolit123. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

…uster

jessehu · 2024-04-05T08:15:11Z

/retest

sbueringer · 2024-04-05T15:08:20Z

controlplane/kubeadm/internal/controllers/status_test.go

+	g.Expect(kcp.Status.ReadyReplicas).To(BeEquivalentTo(3))
+	g.Expect(kcp.Status.UnavailableReplicas).To(BeEquivalentTo(0))
+	g.Expect(kcp.Status.Ready).To(BeTrue())
+}


Let's add an additional call to updateStatus here:

drop one controlPlane.machines (or add another one)

call updateStatus

check that status.Replicas changed and the other ones stayed the same

Sure. Will add the case after the discussion is resolved: #10229 (comment)

fabriziopandini

@jessehu sorry but I don't think we should rush this fix in in the current state for a couple of reasons:

the problem description in this issue is about the MD controller handling cluster cacher tracker lock while restarting the controllers.
the problem description in the issue also talks about kcp flicking from 3 to 2 and back to 3 while restarting controllers. This seems unrelated to the lock above (there is only one reconciler running in KCP, so there are no other reconcilers that could hold the lock for the same cluster); it means this is probably a different issue, but probably we don't have enough data to triage
the discussion on the issue lead to a more wide problem about how to handle "we cannot get the node state", and somehow shifted to the semantic of the replica fields, but this IMO requires more thinking / design.

I admit this is not ideal. We should try to do better as a maintainer when providing guidance.

Unfortunately, the effect of this lack of analysis/scope creeping on the issue, is that the current PR might introduce more confusion (and not fix the original issue). I have tried to explain this in the comments below, but probably it is more productive to think about a way forward.

My suggestion is to focus on the specific issue about the MD controller handling cluster cacher tracker lock while restarting the controllers.
We can probably mitigate this with a simple change I have discussed with @sbueringer (at least for the happy case where there is connectivity to the workload cluster)

controlplane/kubeadm/internal/controllers/status.go

fabriziopandini · 2024-04-09T13:19:51Z

internal/controllers/machineset/machineset_controller.go

@@ -956,6 +960,9 @@ func (r *Reconciler) getMachineNode(ctx context.Context, cluster *clusterv1.Clus
 	}
 	node := &corev1.Node{}
 	if err := remoteClient.Get(ctx, client.ObjectKey{Name: machine.Status.NodeRef.Name}, node); err != nil {
+		if apierrors.IsNotFound(err) {


I'm not sure why are we ignoring not found.
Is that for the case when someone manually deletes a node? If yes, please add a comment (but it also seems unrelated to the issue we are trying to fix)

Just for clarification.

The idea behind this one was to signal to the calling function that the node doesn't exist (vs. that it's just an error). Because not finding the Node is also a valuable information (which basically comes down to that the node is not ready)

This was added to preserve the previous behavior one level above (where we only logged the error before but still considered a not found node a node that is not ready)

But yeah it becomes pretty had to understand

fabriziopandini · 2024-04-09T13:30:24Z

internal/controllers/machineset/machineset_controller.go

@@ -882,8 +887,7 @@ func (r *Reconciler) updateStatus(ctx context.Context, cluster *clusterv1.Cluste

 		node, err := r.getMachineNode(ctx, cluster, machine)
 		if err != nil && machine.GetDeletionTimestamp().IsZero() {
-			log.Error(err, "Unable to retrieve Node status", "node", klog.KObj(node))
-			continue
+			return errors.Wrapf(err, "unable to retrieve the status of Node %s", klog.KObj(node))


According to the issue

We probably also want to have a timeout on this behaviour. If we haven't seen the Node in x time then we assume its status is unknown

But if I'm not wrong with the current implementation we are going to freeze the replica count indefinetly, which could be confusing or eventually also wrong, given that the number of replicas/machine might change in the meantime.

Frankly speaking, in order to properly fix this issue I think that we first need to figure out how to get/store the "last seen" information for each node.

Given that info, we can decide if it is ok to flip the replica status or if to wait (but I don't have a concrete idea on how to do so, I need some time to research into it)

Important point that I discussed with Fabrizio. We're not only freezing the replica fields, we would also freeze other status fields like ObservedGeneration and conditions.

sbueringer · 2024-04-09T15:12:44Z

internal/controllers/machineset/machineset_controller.go

@@ -844,7 +848,8 @@ func (r *Reconciler) shouldAdopt(ms *clusterv1.MachineSet) bool {
 }

 // updateStatus updates the Status field for the MachineSet
-// It checks for the current state of the replicas and updates the Status of the MachineSet.
+// It checks for the current state of the replicas and updates the Status field of the MachineSet.
+// When unable to retrieve the Node status, it returns error and won't update the Status field of the MachineSet.
 func (r *Reconciler) updateStatus(ctx context.Context, cluster *clusterv1.Cluster, ms *clusterv1.MachineSet, filteredMachines []*clusterv1.Machine) error {


Here a proposed solution to the ErrClusterLocked issue: (that Fabrizio was referring to)

// Retry getting a remote client. // Note: This is to ensure that we don't run into errors here just // because multiple reconcilers try to create the client at the same time // (for the case where the workload cluster is reachable). var canCommunicateWithWorkloadCluster bool var remoteClient client.Client err = retry.OnError(wait.Backoff{ Steps: 5, Duration: 200 * time.Millisecond, Factor: 1.0, }, func(err error) bool { // Retry as long as we get remote.ErrClusterLocked errors. return errors.Is(err, remote.ErrClusterLocked) }, func() error { remoteClient, err = r.Tracker.GetClient(ctx, util.ObjectKey(cluster)) if err != nil { return err } canCommunicateWithWorkloadCluster = true return nil }) if err != nil { log.Error(err, "Unable to retrieve status of Nodes: failed to create remote client") } for _, machine := range filteredMachines { log := log.WithValues("Machine", klog.KObj(machine)) if templateLabel.Matches(labels.Set(machine.Labels)) { fullyLabeledReplicasCount++ } if machine.Status.NodeRef == nil { log.V(4).Info("Waiting for the machine controller to set status.nodeRef on the Machine") continue } if !canCommunicateWithWorkloadCluster { // Skip the rest of the for loop if we can't communicate with the workload cluster. continue } node, err := r.getMachineNode(ctx, remoteClient, machine) if err != nil && machine.GetDeletionTimestamp().IsZero() { log.Error(err, "Unable to retrieve Node status", "Node", klog.KObj(node)) continue } if noderefutil.IsNodeReady(node) { readyReplicasCount++ if noderefutil.IsNodeAvailable(node, ms.Spec.MinReadySeconds, metav1.Now()) { availableReplicasCount++ } } else if machine.GetDeletionTimestamp().IsZero() { log.V(4).Info("Waiting for the Kubernetes node on the machine to report ready state") } }

The idea is that when the workload cluster is reachable, we only get ErrClusterLocked for a very short amount of time (the time it takes to create a client). For this case it is good enough to simply retry creating the client.

We will fallback to the previous behavior only if the workload cluster is actually not reachable.

Cool！I'm not quite sure if it's worthy to add this bulk of code to resolve the temporary inconsistency of some status replicas and conditions fields caused by ErrClusterLocked error. The simple change in current PR can solve this problem while there are inconsistency but acceptable and won't cause issues per my thinking (maybe I missed some cases).
cc @fabriziopandini for awareness.

@sbueringer and I are in total agreement, and we are trying to help in re-focusing the PR to the original issue. We also took some additional steps to help in finding a way forward by proposing an alternative solution.

With regards to the current PR, I already tried to explain my concern and I will be happy to add more if you have any doubts (and I already answered one).

But given the concern above, I'm personally -1 to merge the PR in the current state.

Instead, we both think the change proposed by @sbueringer solves the original issues, but ultimately it is up to you to accept it or not

Thanks @fabriziopandini for the details. Here are my two cents:

Should this new if block calls return err instead of continue since canCommunicateWithWorkloadCluster won't change in the for loop.

if !canCommunicateWithWorkloadCluster { // Skip the rest of the for loop if we can't communicate with the workload cluster. continue }

There are many places that return ErrClusterLocked (not only in MS controller and KCP controller). If we want this retry logic, shall we add it in other places as well? Or just here to resolve the original status updating issue?

Current PR code returns err in the for loop in updateStatus() in MS controller when hitting ErrClusterLocked or other error. In this case, it won't update MS.Status because the status update code is after the for loop at lin 903 newStatus.Replicas = int32(len(filteredMachines)).
However KCP controller is different and already updates some status fields before hitting ErrClusterLocked.

machineset_controller.go: if err != nil && machine.GetDeletionTimestamp().IsZero() { - log.Error(err, "Unable to retrieve Node status", "node", klog.KObj(node)) - continue + return errors.Wrapf(err, "unable to retrieve the status of Node %s", klog.KObj(node)) }

One more thing: not sure if canCommunicateWithWorkloadCluster == true can 100% guarentee getMachineNode() won't hit ErrClusterLocked since they are two serial func calls but not atom calls ?

Should this new if block calls return err instead of continue since canCommunicateWithWorkloadCluster won't change in the for loop.

Our main concern is basically that if we return an error here as soon as we hit ErrClusterLocked we don't update the status at all anymore. This should be okay in the happy path, which is that we can actually communicate with the workload cluster and it's just a matter of waiting until another goroutine successfully created the client so we can use it. The problem we see is if the workload cluster is actually not reachable. Because in that case we will just continuously return the error forever. In this case we "freeze" the values of status: FullyLabeledReplicas, ReadyReplicas, AvailableReplicas, ObservedGeneration and a few conditions.

The concern Fabrizio raised (and I didn't have on my radar before) is that if we freeze the status indefinitely in these cases (or rather until the workload cluster is reachable) this can be pretty confusing for users. So for this case we should actually have a more extensive solution which also covers signalling to users that we can't communicate with the workload cluster anymore.

What we were trying to suggest was a mitigation for the issue for the happy path, where we actually can talk to the workload cluster, but replicas are flipping only because of the way we create the client initially for a cluster.

There are many places that return ErrClusterLocked (not only in MS controller and KCP controller). If we want this retry logic, shall we add it in other places as well? Or just here to resolve the original status updating issue?

I'm not really sure about this one. Basically we introduced the current "TryLock" over a "Lock" to make sure we don't block all reconcile for a cluster when a workload cluster is not reachable. The "Lock" we had before lead to really problematic performance issues when some workload clusters (with a lot of Machines) were unreachable.
So I think we should be careful with introducing these retries as they can lead to serious performance degradation (basically by just introducing this here, every reconcile of a MachineSet of an unreachable cluster will take 1 second).

sbueringer · 2024-04-11T12:10:44Z

@jessehu

Just trying to summarize the overall direction of my and Fabrizio's recent comments:

We realized that by not updating status at all as soon as we get ErrClusterLocked we freeze the status of unreachable clusters indefinitely. We think we need a more extensive solution for that which should include in some way signalling to the user via status that we can't communicate with the workload cluster anymore (e.g. just like Node conditions can flip to unknown).
But it probably requires more work and more extensive changes to figure out how to do this correctly
Because of this we were trying to figure out who we can solve your concrete issue for the "happy path" (== wl is actually reachable, cluster cache tracker just temporarily returns ErrClusterLocked) in the mean time
For KCP, I don't understand yet how it is possible that we get ErrClusterLocked errors. I would like to understand what I'm missing there. (if we understand what is happening there we might be able to just avoid ErrClusterLocked errors in KCP altogether)

Sorry for the confusion overall, it's just a fairly complex topic and we're trying to figure out what the right thing to do is

jessehu · 2024-04-11T12:36:54Z

@sbueringer Thanks very much for the summary. I totally agree on bullet 1/2/3.

To archive 1 (the ideal solution), it probably requires more work and more extensive change (as you mentioned in 2);
To archive 3 (the workable solution in most cases), current PR is almost enough except the KCP status update patch that @fabriziopandini and you have converns.
For 4, I'm not 100% sure that KCP.Status.ReadyReplicas updated unexpectedly is caused by ErrClusterLocked error. However, the main branch code gives this possibility when hitting ErrClusterLocked or any other errors at line 86 or 90:

sbueringer · 2024-04-11T12:48:23Z

For 4, I'm not 100% sure that KCP.Status.ReadyReplicas updated unexpectedly is caused by ErrClusterLocked error. However, the main branch code gives this possibility when hitting ErrClusterLocked or any other errors at line 86 or 90.

Yup absolutely no question that if we hit ErrClusterLocked, this happens. What I don't understand is how we can hit ErrClusterLocked. This should only be possible by concurrent access to the ClusterCacheTracker for the same cluster. But we only have one reconciler (KubeadmControlPlaneReconciler) in this controller and this one can't run concurrently for the same KCP/Cluster object.

I think what could help to figure this out is to add a log to print the call stack whenever we enter ClusterCacheTracker.GetClient (+ for which cluster GetClient is called)

Q: Are you also getting ErrClusterLocked printed in the KCP controller logs? (IIRC the issue had a log from the MachineSet controller)

jessehu · 2024-04-11T12:53:34Z

I observed the issue when upgrading CAPI components. Would it be able to produce the ClusterCacheTracker lock scenario when rolling update KCP controller deployment?

sbueringer · 2024-04-11T12:55:45Z

To archive 3 (the workable solution in most cases), current PR is almost enough

For MachineSet specifically: It definitely solves the "happy path", but it makes the "unhappy path" worse by freezing the status indefinitely. Without the changes the ready and available replicas were dropping to 0 when the communication broke down. Probably a matter of preference if in case of the communication breakdown we prefer to keep ready and available replicas as is or drop them to 0. But another side effect is also that we don't update ObservedGeneration anymore, which definitely seems wrong (similar for the conditions)

sbueringer · 2024-04-11T12:57:21Z

I observed the issue when upgrading CAPI components. Would it be able to produce the ClusterCacheTracker lock scenario when rolling update KCP controller deployment?

No the lock is purely in-memory, basically it just ensures that only one worker (within a controller) at a time can call GetClient (which times out in 10 seconds when the workload cluster is unreachable). The goal there was that one unreachable cluster doesn't block a lot of workers (that would be especially bad if you have an unreachable cluster with a lot of Machines).

jessehu · 2024-04-14T12:39:59Z

I'm sorry for the confusion that the forked repo was deleted by others today by mistake. I will create a new PR to handle the MS status update issue first as suggested.

k8s-ci-robot added this to the v1.5 milestone Mar 6, 2024

k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. do-not-merge/needs-area PR is missing an area label labels Mar 6, 2024

k8s-ci-robot added needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Mar 6, 2024

k8s-ci-robot requested review from fabriziopandini and stmcginnis March 6, 2024 05:04

killianmuldoon reviewed Mar 6, 2024

View reviewed changes

k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Mar 6, 2024

jessehu commented Mar 6, 2024

View reviewed changes

k8s-ci-robot added area/machineset Issues or PRs related to machinesets and removed do-not-merge/needs-area PR is missing an area label labels Mar 6, 2024

sbueringer removed this from the v1.5 milestone Mar 12, 2024

neolit123 reviewed Mar 27, 2024

View reviewed changes

neolit123 reviewed Mar 28, 2024

View reviewed changes

internal/controllers/machineset/machineset_controller.go Outdated Show resolved Hide resolved

neolit123 reviewed Mar 28, 2024

View reviewed changes

k8s-ci-robot assigned neolit123 Mar 28, 2024

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Mar 28, 2024

sbueringer reviewed Apr 4, 2024

View reviewed changes

internal/controllers/machineset/machineset_controller.go Outdated Show resolved Hide resolved

internal/controllers/machineset/machineset_controller.go Outdated Show resolved Hide resolved

sbueringer reviewed Apr 4, 2024

View reviewed changes

internal/controllers/machineset/machineset_controller.go Outdated Show resolved Hide resolved

sbueringer reviewed Apr 4, 2024

View reviewed changes

internal/controllers/machineset/machineset_controller.go Show resolved Hide resolved

k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Apr 5, 2024

k8s-ci-robot requested a review from neolit123 April 5, 2024 07:40

k8s-ci-robot added needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. and removed size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Apr 5, 2024

jessehu changed the base branch from release-1.5 to main April 5, 2024 07:40

k8s-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. labels Apr 5, 2024

This comment was marked as outdated.

Sign in to view

k8s-ci-robot removed needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Apr 5, 2024

k8s-ci-robot added the size/M Denotes a PR that changes 30-99 lines, ignoring generated files. label Apr 5, 2024

Do not update KCP.Status and MS.Status when unable to get workload cl…

0561c98

…uster

sbueringer reviewed Apr 5, 2024

View reviewed changes

jessehu added 2 commits April 9, 2024 18:42

Print the reconcile errors if it's not only ErrClusterLocked

132bdbc

Use UnavailableReplicas = desiredReplicas - status.ReadyNodes

dbb056b

k8s-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Apr 9, 2024

fabriziopandini reviewed Apr 9, 2024

View reviewed changes

sbueringer reviewed Apr 9, 2024

View reviewed changes

smartxworks closed this by deleting the head repository Apr 14, 2024

jessehu mentioned this pull request Apr 15, 2024

🐛 Do not update MS status when unable to get workload cluster or machine node #10436

Open

🐛 Do not update KCP and MS status when unable to get workload cluster #10229

🐛 Do not update KCP and MS status when unable to get workload cluster #10229

Conversation

jessehu commented Mar 6, 2024 • edited by sbueringer Loading

k8s-ci-robot commented Mar 6, 2024

k8s-ci-robot commented Mar 6, 2024

killianmuldoon left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jessehu Apr 6, 2024 • edited Loading

Choose a reason for hiding this comment

sbueringer Apr 8, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jessehu commented Mar 6, 2024

neolit123 left a comment

Choose a reason for hiding this comment

neolit123 left a comment

Choose a reason for hiding this comment

neolit123 left a comment

Choose a reason for hiding this comment

k8s-ci-robot commented Mar 28, 2024

sbueringer commented Apr 4, 2024

k8s-ci-robot commented Apr 5, 2024

This comment was marked as outdated.

k8s-ci-robot commented Apr 5, 2024

jessehu commented Apr 5, 2024

sbueringer Apr 5, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

fabriziopandini left a comment

Choose a reason for hiding this comment

fabriziopandini Apr 9, 2024 • edited Loading

Choose a reason for hiding this comment

sbueringer Apr 9, 2024 • edited Loading

Choose a reason for hiding this comment

fabriziopandini Apr 9, 2024 • edited Loading

Choose a reason for hiding this comment

sbueringer Apr 9, 2024 • edited Loading

Choose a reason for hiding this comment

sbueringer Apr 9, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jessehu Apr 10, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jessehu Apr 11, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sbueringer commented Apr 11, 2024 • edited Loading

jessehu commented Apr 11, 2024

sbueringer commented Apr 11, 2024 • edited Loading

jessehu commented Apr 11, 2024

sbueringer commented Apr 11, 2024

sbueringer commented Apr 11, 2024 • edited Loading

jessehu commented Apr 14, 2024 • edited Loading

jessehu commented Mar 6, 2024 •

edited by sbueringer

Loading

jessehu Apr 6, 2024 •

edited

Loading

sbueringer Apr 8, 2024 •

edited

Loading

sbueringer Apr 5, 2024 •

edited

Loading

fabriziopandini Apr 9, 2024 •

edited

Loading

sbueringer Apr 9, 2024 •

edited

Loading

fabriziopandini Apr 9, 2024 •

edited

Loading

sbueringer Apr 9, 2024 •

edited

Loading

sbueringer Apr 9, 2024 •

edited

Loading

jessehu Apr 10, 2024 •

edited

Loading

jessehu Apr 11, 2024 •

edited

Loading

sbueringer commented Apr 11, 2024 •

edited

Loading

sbueringer commented Apr 11, 2024 •

edited

Loading

sbueringer commented Apr 11, 2024 •

edited

Loading

jessehu commented Apr 14, 2024 •

edited

Loading