-
Notifications
You must be signed in to change notification settings - Fork 35
Delete target machinesets #214
Delete target machinesets #214
Conversation
pkg/apis/clusteroperator/types.go
Outdated
@@ -312,6 +312,10 @@ type ClusterStatus struct { | |||
// ClusterVersionRef references the resolved clusterversion the cluster should be running. | |||
// +optional | |||
ClusterVersionRef *corev1.ObjectReference | |||
|
|||
// DeprovisionedComputeMachinesets is true of the compute machinesets of this cluster |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
s/of/if/
@@ -313,6 +313,10 @@ type ClusterStatus struct { | |||
// ClusterVersionRef references the resolved clusterversion the cluster should be running. | |||
// +optional | |||
ClusterVersionRef *corev1.ObjectReference `json:"clusterVersionRef,omitempty"` | |||
|
|||
// DeprovisionedComputeMachinesets is true of the compute machinesets of this cluster |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
s/of/if/
return clusterAPIClient, nil | ||
} | ||
|
||
func (c *Controller) ensureRemoteMachineSetsAreDeleted(cluster *clusteroperator.Cluster) error { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have some concern that trying to run operations on a remote cluster to which the connection is slow or inaccessible will adversely effect the ability of the controller to timely deal with other tasks. Is there a way that we can run these remote operations in a separate goroutine and have the controller pick up the results later? I am fine tackling that as later work.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sgtm, I'll run these in their own goroutine.
|
||
for _, ms := range machineSets.Items { | ||
if ms.DeletionTimestamp != nil { | ||
err := clusterAPIClient.ClusterV1alpha1().MachineSets(remoteClusterNamespace).Delete(ms.Name, &metav1.DeleteOptions{}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we use DeleteCollection
to delete all the relevant machine sets with one call rather than making separate Delete
calls?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yup, will fix
if err != nil { | ||
return err | ||
} | ||
machineSets, err := clusterAPIClient.ClusterV1alpha1().MachineSets(remoteClusterNamespace).List(metav1.ListOptions{}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we be absolutely sure that all of the machine sets in "kube-cluster" are relevant machine sets? Is it possible that a user could create their own machine sets in "kube-cluster" that are associated with a different cluster?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For now, we don't have anything linking machinesets to the cluster. I would say that we will need to make kube-cluster (or whatever other namespace we choose) special in that it represents the machinesets that belong to that cluster. I would not expect other machinesets (that do not need to be deprovisioned) to exist in that cluster.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry in that namespace
@staebler just fyi... this code will become irrelevant when we switch to syncing clusterapi clusters instead of c-o clusters. However, whatever problems we run into with this one will still need to be addressed in the other. |
fdaebf4
to
74fb55e
Compare
/test unit |
bf6943f
to
2229e4f
Compare
…e been deprovisioned
2229e4f
to
8752783
Compare
/test unit |
@staebler this should be ready for another pass. The reason integration was passing for me locally and not on the CI server is that in CI this code was getting merged with master. Now it's rebased to the latest and I've fixed the expected type for the Infra controller owner. |
} | ||
|
||
func (c *Controller) ensureRemoteMachineSetsAreDeleted(cluster *clusteroperator.Cluster) error { | ||
clusterAPIClient, err := c.buildRemoteClusterClient(cluster) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we rename this variable to include the term "remote" so that as I read the code I immediately know that the client is communicating with the remote cluster and not the local cluster?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yup, will fix
masterMachineSet, err := c.machineSetsLister.MachineSets(cluster.Namespace).Get(cluster.Status.MasterMachineSetName) | ||
if err != nil { | ||
if errors.IsNotFound(err) { | ||
return false, nil |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are you sure that we want to swallow this error? The Cluster status indicates that there is a master machine set. I would think that it would be an error worth bubbling up if it turns out that said master machine set does not actually exist.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So the problem is that let's say that for some reason the master machine set is deleted, then the cluster itself is marked for deletion. The cluster reconcile loop will not recreate the master machineset and if I return an error from here, we will keep retrying to find a master machineset that doesn't exist.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK. Makes sense.
if shouldDeleteRemote { | ||
return c.ensureRemoteMachineSetsAreDeleted(cluster) | ||
} | ||
if !cluster.Status.DeprovisionedComputeMachinesets { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am concerned about the potential for a race condition here where other controllers may be in the process of installing the cluster-api and provisioning the compute machinesets in the remote cluster.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Valid concern. One thing I can do is check the Conditions instead of the simple bool flag to ensure that the cluster api is not in the process of installing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That wouldn't be sufficient. That would only decrease the window of the race condition. There is still the chance that this controller uses a revision of the Cluster that does not have the status updated to reflect that the cluster-api is installing. We could avoid this if we did not patch and retry on conflicts.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We talked before about not doing the patch/retry thing, but came back to the conclusion that it could cause more problems having several controllers trying to update the same resource. Something that maybe we could explore doing is having at most one controller per resource. Either change I think would be beyond the scope of this PR :) Would you be ok with creating an issue and looking into fixing it separately?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not really comfortable with leaving this as is here. In other cases, it was more palatable because the processing and undoing was performed by the same controller. In this case, we have the responsibilities distributed such that one controller is provisioning and another controller is deprovisioning. There is no protection against both controllers trying to work on the same resource concurrently.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could a fix be to move this logic to the controller that provisions the machinesets on to the remote cluster? We would be guaranteed that we're not trying to delete the remote machinesets at the same time that we're provisioning them.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In reconsidering this, I am fine with kicking this can down the road in the interest of pushing forward the migration to using the upstream API fully.
pkg/controller/jobsyncstrategy.go
Outdated
|
||
// CheckBeforeDeprovision should be implemented by a strategy to have the sync loop check | ||
// whether a particular owner resource is ready for deprovisioning when a DeletionTimestamp is detected. | ||
type CheckBeforeDeprovision interface { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should use the same terminology here that we use elsewhere in JobSync
. Specifically, the inverse of process is undo. JobSync
does not have a concept of deprovision. Either CheckBeforeDeprovision
should be CheckBeforeUndo
or we should change the undo terminology to something else in JobSync
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ack
3a491d0
to
9539e9c
Compare
Additional review comments addressed |
if clusterAPIInstalling != nil && clusterAPIInstalling.Status == corev1.ConditionTrue { | ||
// If cluster API is in the middle of installing, return an error so we can | ||
// retry when it completes installation. | ||
return false, fmt.Errorf("cluster API is in the process of installing") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't want this to cause an automatic requeue of the cluster. We want to wait to requeue the cluster until a change is made to the MachineSet.
@@ -507,6 +507,14 @@ func (s *jobSyncStrategy) UpdateOwnerStatus(original, owner metav1.Object) error | |||
} | |||
} | |||
|
|||
func (s *jobSyncStrategy) CanDeprovision(owner metav1.Object) bool { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be CanUndo
now.
9539e9c
to
44ead65
Compare
44ead65
to
d1f725b
Compare
/test unit |
@staebler tests are passing for this now |
/lgtm |
No description provided.