Delete target machinesets #214

csrwng · 2018-05-10T18:57:08Z

No description provided.

staebler · 2018-05-10T19:05:33Z

pkg/apis/clusteroperator/types.go

@@ -312,6 +312,10 @@ type ClusterStatus struct {
 	// ClusterVersionRef references the resolved clusterversion the cluster should be running.
 	// +optional
 	ClusterVersionRef *corev1.ObjectReference
+
+	// DeprovisionedComputeMachinesets is true of the compute machinesets of this cluster


staebler · 2018-05-10T19:05:50Z

pkg/apis/clusteroperator/v1alpha1/types.go

@@ -313,6 +313,10 @@ type ClusterStatus struct {
 	// ClusterVersionRef references the resolved clusterversion the cluster should be running.
 	// +optional
 	ClusterVersionRef *corev1.ObjectReference `json:"clusterVersionRef,omitempty"`
+
+	// DeprovisionedComputeMachinesets is true of the compute machinesets of this cluster


staebler · 2018-05-10T19:24:18Z

pkg/controller/cluster/cluster_controller.go

+	return clusterAPIClient, nil
+}
+
+func (c *Controller) ensureRemoteMachineSetsAreDeleted(cluster *clusteroperator.Cluster) error {


I have some concern that trying to run operations on a remote cluster to which the connection is slow or inaccessible will adversely effect the ability of the controller to timely deal with other tasks. Is there a way that we can run these remote operations in a separate goroutine and have the controller pick up the results later? I am fine tackling that as later work.

Sgtm, I'll run these in their own goroutine.

staebler · 2018-05-10T19:25:18Z

pkg/controller/cluster/cluster_controller.go

+
+	for _, ms := range machineSets.Items {
+		if ms.DeletionTimestamp != nil {
+			err := clusterAPIClient.ClusterV1alpha1().MachineSets(remoteClusterNamespace).Delete(ms.Name, &metav1.DeleteOptions{})


Can we use DeleteCollection to delete all the relevant machine sets with one call rather than making separate Delete calls?

Yup, will fix

staebler · 2018-05-10T19:29:49Z

pkg/controller/cluster/cluster_controller.go

+	if err != nil {
+		return err
+	}
+	machineSets, err := clusterAPIClient.ClusterV1alpha1().MachineSets(remoteClusterNamespace).List(metav1.ListOptions{})


Can we be absolutely sure that all of the machine sets in "kube-cluster" are relevant machine sets? Is it possible that a user could create their own machine sets in "kube-cluster" that are associated with a different cluster?

For now, we don't have anything linking machinesets to the cluster. I would say that we will need to make kube-cluster (or whatever other namespace we choose) special in that it represents the machinesets that belong to that cluster. I would not expect other machinesets (that do not need to be deprovisioned) to exist in that cluster.

Sorry in that namespace

csrwng · 2018-05-10T20:48:37Z

@staebler just fyi... this code will become irrelevant when we switch to syncing clusterapi clusters instead of c-o clusters. However, whatever problems we run into with this one will still need to be addressed in the other.

csrwng · 2018-05-11T20:07:36Z

@staebler comments addressed. Created 2 follow-up issues: #215 and #216

csrwng · 2018-05-14T15:46:50Z

/test unit

…e been deprovisioned

csrwng · 2018-05-15T15:26:32Z

/test unit

csrwng · 2018-05-15T15:41:36Z

@staebler this should be ready for another pass. The reason integration was passing for me locally and not on the CI server is that in CI this code was getting merged with master. Now it's rebased to the latest and I've fixed the expected type for the Infra controller owner.

staebler · 2018-05-15T16:23:14Z

pkg/controller/cluster/cluster_controller.go

+}
+
+func (c *Controller) ensureRemoteMachineSetsAreDeleted(cluster *clusteroperator.Cluster) error {
+	clusterAPIClient, err := c.buildRemoteClusterClient(cluster)


Can we rename this variable to include the term "remote" so that as I read the code I immediately know that the client is communicating with the remote cluster and not the local cluster?

Yup, will fix

staebler · 2018-05-15T16:26:42Z

pkg/controller/cluster/cluster_controller.go

+	masterMachineSet, err := c.machineSetsLister.MachineSets(cluster.Namespace).Get(cluster.Status.MasterMachineSetName)
+	if err != nil {
+		if errors.IsNotFound(err) {
+			return false, nil


Are you sure that we want to swallow this error? The Cluster status indicates that there is a master machine set. I would think that it would be an error worth bubbling up if it turns out that said master machine set does not actually exist.

So the problem is that let's say that for some reason the master machine set is deleted, then the cluster itself is marked for deletion. The cluster reconcile loop will not recreate the master machineset and if I return an error from here, we will keep retrying to find a master machineset that doesn't exist.

OK. Makes sense.

staebler · 2018-05-15T16:32:31Z

pkg/controller/cluster/cluster_controller.go

+	if shouldDeleteRemote {
+		return c.ensureRemoteMachineSetsAreDeleted(cluster)
+	}
+	if !cluster.Status.DeprovisionedComputeMachinesets {


I am concerned about the potential for a race condition here where other controllers may be in the process of installing the cluster-api and provisioning the compute machinesets in the remote cluster.

Valid concern. One thing I can do is check the Conditions instead of the simple bool flag to ensure that the cluster api is not in the process of installing.

That wouldn't be sufficient. That would only decrease the window of the race condition. There is still the chance that this controller uses a revision of the Cluster that does not have the status updated to reflect that the cluster-api is installing. We could avoid this if we did not patch and retry on conflicts.

We talked before about not doing the patch/retry thing, but came back to the conclusion that it could cause more problems having several controllers trying to update the same resource. Something that maybe we could explore doing is having at most one controller per resource. Either change I think would be beyond the scope of this PR :) Would you be ok with creating an issue and looking into fixing it separately?

I'm not really comfortable with leaving this as is here. In other cases, it was more palatable because the processing and undoing was performed by the same controller. In this case, we have the responsibilities distributed such that one controller is provisioning and another controller is deprovisioning. There is no protection against both controllers trying to work on the same resource concurrently.

Could a fix be to move this logic to the controller that provisions the machinesets on to the remote cluster? We would be guaranteed that we're not trying to delete the remote machinesets at the same time that we're provisioning them.

In reconsidering this, I am fine with kicking this can down the road in the interest of pushing forward the migration to using the upstream API fully.

staebler · 2018-05-15T16:37:54Z

pkg/controller/jobsyncstrategy.go

+
+// CheckBeforeDeprovision should be implemented by a strategy to have the sync loop check
+// whether a particular owner resource is ready for deprovisioning when a DeletionTimestamp is detected.
+type CheckBeforeDeprovision interface {


We should use the same terminology here that we use elsewhere in JobSync. Specifically, the inverse of process is undo. JobSync does not have a concept of deprovision. Either CheckBeforeDeprovision should be CheckBeforeUndo or we should change the undo terminology to something else in JobSync.

csrwng · 2018-05-15T19:35:58Z

Additional review comments addressed

staebler · 2018-05-16T01:18:04Z

pkg/controller/cluster/cluster_controller.go

+	if clusterAPIInstalling != nil && clusterAPIInstalling.Status == corev1.ConditionTrue {
+		// If cluster API is in the middle of installing, return an error so we can
+		// retry when it completes installation.
+		return false, fmt.Errorf("cluster API is in the process of installing")


We don't want this to cause an automatic requeue of the cluster. We want to wait to requeue the cluster until a change is made to the MachineSet.

staebler · 2018-05-16T01:20:11Z

pkg/controller/infra/infra_controller.go

@@ -507,6 +507,14 @@ func (s *jobSyncStrategy) UpdateOwnerStatus(original, owner metav1.Object) error
 	}
 }

+func (s *jobSyncStrategy) CanDeprovision(owner metav1.Object) bool {


This should be CanUndo now.

staebler · 2018-05-16T14:59:46Z

/test unit

csrwng · 2018-05-16T18:39:49Z

@staebler tests are passing for this now

staebler · 2018-05-16T20:25:44Z

/lgtm

openshift-ci-robot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels May 10, 2018

staebler reviewed May 10, 2018

View reviewed changes

csrwng changed the title ~~WIP: Delete target machinesets~~ Delete target machinesets May 11, 2018

openshift-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label May 11, 2018

csrwng force-pushed the delete_target_machinesets branch 4 times, most recently from fdaebf4 to 74fb55e Compare May 11, 2018 21:12

csrwng force-pushed the delete_target_machinesets branch from bf6943f to 2229e4f Compare May 14, 2018 18:11

Add flag to ClusterStatus to indicate whether compute machinesets hav…

25efc54

…e been deprovisioned

csrwng force-pushed the delete_target_machinesets branch from 2229e4f to 8752783 Compare May 15, 2018 15:17

staebler reviewed May 15, 2018

View reviewed changes

csrwng force-pushed the delete_target_machinesets branch 2 times, most recently from 3a491d0 to 9539e9c Compare May 15, 2018 19:22

staebler reviewed May 16, 2018

View reviewed changes

csrwng force-pushed the delete_target_machinesets branch from 9539e9c to 44ead65 Compare May 16, 2018 02:27

Delete remote machinesets in cluster operator cluster controller

d1f725b

csrwng force-pushed the delete_target_machinesets branch from 44ead65 to d1f725b Compare May 16, 2018 13:34

openshift-ci-robot assigned staebler May 16, 2018

openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label May 16, 2018

openshift-merge-robot merged commit 60218ae into openshift:master May 16, 2018

staebler mentioned this pull request May 18, 2018

cluster-api for all master-installing controllers #224

Merged

csrwng deleted the delete_target_machinesets branch August 2, 2018 22:17

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Delete target machinesets #214

Delete target machinesets #214

csrwng commented May 10, 2018

staebler May 10, 2018

staebler May 10, 2018

staebler May 10, 2018

csrwng May 10, 2018

staebler May 10, 2018

csrwng May 10, 2018

staebler May 10, 2018

csrwng May 10, 2018

csrwng May 10, 2018

csrwng commented May 10, 2018

csrwng commented May 11, 2018

csrwng commented May 14, 2018

csrwng commented May 15, 2018

csrwng commented May 15, 2018

staebler May 15, 2018

csrwng May 15, 2018

staebler May 15, 2018

csrwng May 15, 2018

staebler May 15, 2018

staebler May 15, 2018

csrwng May 15, 2018

staebler May 15, 2018

csrwng May 15, 2018

staebler May 15, 2018

csrwng May 15, 2018

staebler May 15, 2018

staebler May 15, 2018

csrwng May 15, 2018

csrwng commented May 15, 2018

staebler May 16, 2018

staebler May 16, 2018

staebler commented May 16, 2018

csrwng commented May 16, 2018

staebler commented May 16, 2018

Delete target machinesets #214

Delete target machinesets #214

Conversation

csrwng commented May 10, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

csrwng commented May 10, 2018

csrwng commented May 11, 2018

csrwng commented May 14, 2018

csrwng commented May 15, 2018

csrwng commented May 15, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

csrwng commented May 15, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

staebler commented May 16, 2018

csrwng commented May 16, 2018

staebler commented May 16, 2018