Skip to content
This repository has been archived by the owner on Jan 28, 2020. It is now read-only.

Delete target machinesets #214

Merged

Conversation

csrwng
Copy link
Contributor

@csrwng csrwng commented May 10, 2018

No description provided.

@openshift-ci-robot openshift-ci-robot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels May 10, 2018
@@ -312,6 +312,10 @@ type ClusterStatus struct {
// ClusterVersionRef references the resolved clusterversion the cluster should be running.
// +optional
ClusterVersionRef *corev1.ObjectReference

// DeprovisionedComputeMachinesets is true of the compute machinesets of this cluster
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/of/if/

@@ -313,6 +313,10 @@ type ClusterStatus struct {
// ClusterVersionRef references the resolved clusterversion the cluster should be running.
// +optional
ClusterVersionRef *corev1.ObjectReference `json:"clusterVersionRef,omitempty"`

// DeprovisionedComputeMachinesets is true of the compute machinesets of this cluster
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/of/if/

return clusterAPIClient, nil
}

func (c *Controller) ensureRemoteMachineSetsAreDeleted(cluster *clusteroperator.Cluster) error {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have some concern that trying to run operations on a remote cluster to which the connection is slow or inaccessible will adversely effect the ability of the controller to timely deal with other tasks. Is there a way that we can run these remote operations in a separate goroutine and have the controller pick up the results later? I am fine tackling that as later work.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sgtm, I'll run these in their own goroutine.


for _, ms := range machineSets.Items {
if ms.DeletionTimestamp != nil {
err := clusterAPIClient.ClusterV1alpha1().MachineSets(remoteClusterNamespace).Delete(ms.Name, &metav1.DeleteOptions{})
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we use DeleteCollection to delete all the relevant machine sets with one call rather than making separate Delete calls?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup, will fix

if err != nil {
return err
}
machineSets, err := clusterAPIClient.ClusterV1alpha1().MachineSets(remoteClusterNamespace).List(metav1.ListOptions{})
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we be absolutely sure that all of the machine sets in "kube-cluster" are relevant machine sets? Is it possible that a user could create their own machine sets in "kube-cluster" that are associated with a different cluster?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For now, we don't have anything linking machinesets to the cluster. I would say that we will need to make kube-cluster (or whatever other namespace we choose) special in that it represents the machinesets that belong to that cluster. I would not expect other machinesets (that do not need to be deprovisioned) to exist in that cluster.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry in that namespace

@csrwng
Copy link
Contributor Author

csrwng commented May 10, 2018

@staebler just fyi... this code will become irrelevant when we switch to syncing clusterapi clusters instead of c-o clusters. However, whatever problems we run into with this one will still need to be addressed in the other.

@csrwng csrwng changed the title WIP: Delete target machinesets Delete target machinesets May 11, 2018
@openshift-ci-robot openshift-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label May 11, 2018
@csrwng
Copy link
Contributor Author

csrwng commented May 11, 2018

@staebler comments addressed. Created 2 follow-up issues: #215 and #216

@csrwng csrwng force-pushed the delete_target_machinesets branch 4 times, most recently from fdaebf4 to 74fb55e Compare May 11, 2018 21:12
@csrwng
Copy link
Contributor Author

csrwng commented May 14, 2018

/test unit

@csrwng csrwng force-pushed the delete_target_machinesets branch from bf6943f to 2229e4f Compare May 14, 2018 18:11
@csrwng csrwng force-pushed the delete_target_machinesets branch from 2229e4f to 8752783 Compare May 15, 2018 15:17
@csrwng
Copy link
Contributor Author

csrwng commented May 15, 2018

/test unit

@csrwng
Copy link
Contributor Author

csrwng commented May 15, 2018

@staebler this should be ready for another pass. The reason integration was passing for me locally and not on the CI server is that in CI this code was getting merged with master. Now it's rebased to the latest and I've fixed the expected type for the Infra controller owner.

}

func (c *Controller) ensureRemoteMachineSetsAreDeleted(cluster *clusteroperator.Cluster) error {
clusterAPIClient, err := c.buildRemoteClusterClient(cluster)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we rename this variable to include the term "remote" so that as I read the code I immediately know that the client is communicating with the remote cluster and not the local cluster?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup, will fix

masterMachineSet, err := c.machineSetsLister.MachineSets(cluster.Namespace).Get(cluster.Status.MasterMachineSetName)
if err != nil {
if errors.IsNotFound(err) {
return false, nil
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you sure that we want to swallow this error? The Cluster status indicates that there is a master machine set. I would think that it would be an error worth bubbling up if it turns out that said master machine set does not actually exist.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So the problem is that let's say that for some reason the master machine set is deleted, then the cluster itself is marked for deletion. The cluster reconcile loop will not recreate the master machineset and if I return an error from here, we will keep retrying to find a master machineset that doesn't exist.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK. Makes sense.

if shouldDeleteRemote {
return c.ensureRemoteMachineSetsAreDeleted(cluster)
}
if !cluster.Status.DeprovisionedComputeMachinesets {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am concerned about the potential for a race condition here where other controllers may be in the process of installing the cluster-api and provisioning the compute machinesets in the remote cluster.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Valid concern. One thing I can do is check the Conditions instead of the simple bool flag to ensure that the cluster api is not in the process of installing.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That wouldn't be sufficient. That would only decrease the window of the race condition. There is still the chance that this controller uses a revision of the Cluster that does not have the status updated to reflect that the cluster-api is installing. We could avoid this if we did not patch and retry on conflicts.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We talked before about not doing the patch/retry thing, but came back to the conclusion that it could cause more problems having several controllers trying to update the same resource. Something that maybe we could explore doing is having at most one controller per resource. Either change I think would be beyond the scope of this PR :) Would you be ok with creating an issue and looking into fixing it separately?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not really comfortable with leaving this as is here. In other cases, it was more palatable because the processing and undoing was performed by the same controller. In this case, we have the responsibilities distributed such that one controller is provisioning and another controller is deprovisioning. There is no protection against both controllers trying to work on the same resource concurrently.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could a fix be to move this logic to the controller that provisions the machinesets on to the remote cluster? We would be guaranteed that we're not trying to delete the remote machinesets at the same time that we're provisioning them.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In reconsidering this, I am fine with kicking this can down the road in the interest of pushing forward the migration to using the upstream API fully.


// CheckBeforeDeprovision should be implemented by a strategy to have the sync loop check
// whether a particular owner resource is ready for deprovisioning when a DeletionTimestamp is detected.
type CheckBeforeDeprovision interface {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should use the same terminology here that we use elsewhere in JobSync. Specifically, the inverse of process is undo. JobSync does not have a concept of deprovision. Either CheckBeforeDeprovision should be CheckBeforeUndo or we should change the undo terminology to something else in JobSync.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ack

@csrwng csrwng force-pushed the delete_target_machinesets branch 2 times, most recently from 3a491d0 to 9539e9c Compare May 15, 2018 19:22
@csrwng
Copy link
Contributor Author

csrwng commented May 15, 2018

Additional review comments addressed

if clusterAPIInstalling != nil && clusterAPIInstalling.Status == corev1.ConditionTrue {
// If cluster API is in the middle of installing, return an error so we can
// retry when it completes installation.
return false, fmt.Errorf("cluster API is in the process of installing")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't want this to cause an automatic requeue of the cluster. We want to wait to requeue the cluster until a change is made to the MachineSet.

@@ -507,6 +507,14 @@ func (s *jobSyncStrategy) UpdateOwnerStatus(original, owner metav1.Object) error
}
}

func (s *jobSyncStrategy) CanDeprovision(owner metav1.Object) bool {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be CanUndo now.

@csrwng csrwng force-pushed the delete_target_machinesets branch from 9539e9c to 44ead65 Compare May 16, 2018 02:27
@csrwng csrwng force-pushed the delete_target_machinesets branch from 44ead65 to d1f725b Compare May 16, 2018 13:34
@staebler
Copy link
Contributor

/test unit

@csrwng
Copy link
Contributor Author

csrwng commented May 16, 2018

@staebler tests are passing for this now

@staebler
Copy link
Contributor

/lgtm

@openshift-ci-robot openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label May 16, 2018
@openshift-merge-robot openshift-merge-robot merged commit 60218ae into openshift:master May 16, 2018
@csrwng csrwng deleted the delete_target_machinesets branch August 2, 2018 22:17
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
lgtm Indicates that a PR is ready to be merged. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants