New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

controller: Wait for template and kubelet before starting renderer #385

Open
wants to merge 1 commit into
base: master
from

Conversation

Projects
None yet
6 participants
@cgwalters
Copy link
Contributor

cgwalters commented Feb 5, 2019

Avoid e.g. rendering a MC without the kubelet config.

Closes: #338

@cgwalters cgwalters force-pushed the cgwalters:controller-phased-startup branch from b4b7de1 to 37fac58 Feb 5, 2019

@jlebon

This comment has been minimized.

Copy link
Member

jlebon commented Feb 5, 2019

This looks plausible to me. Will let others have a look and test it out.
/approve

@cgwalters cgwalters force-pushed the cgwalters:controller-phased-startup branch from 37fac58 to 4c6732e Feb 5, 2019

@cgwalters cgwalters force-pushed the cgwalters:controller-phased-startup branch from 4c6732e to 0be8861 Feb 5, 2019

@runcom

This comment has been minimized.

Copy link
Member

runcom commented Feb 5, 2019

Approving code wise but I won't be able to test it on a cluster till tomorrow nor tests seem to be running anyway :(

/approve

@openshift-merge-robot

This comment has been minimized.

Copy link
Contributor

openshift-merge-robot commented Feb 6, 2019

/retest

1 similar comment
@openshift-merge-robot

This comment has been minimized.

Copy link
Contributor

openshift-merge-robot commented Feb 6, 2019

/retest

@cgwalters

This comment has been minimized.

Copy link
Contributor Author

cgwalters commented Feb 6, 2019

/test e2e-aws-op

@cgwalters cgwalters force-pushed the cgwalters:controller-phased-startup branch from 0be8861 to 646fa63 Feb 6, 2019

@runcom

This comment has been minimized.

Copy link
Member

runcom commented Feb 7, 2019

/retest

1 similar comment
@runcom

This comment has been minimized.

Copy link
Member

runcom commented Feb 7, 2019

/retest

@runcom

This comment has been minimized.

Copy link
Member

runcom commented Feb 7, 2019

/lgtm

@runcom

This comment has been minimized.

Copy link
Member

runcom commented Feb 7, 2019

/retest

@runcom

This comment has been minimized.

Copy link
Member

runcom commented Feb 7, 2019

tests are failing because of:

time="2019-02-07T11:12:12Z" level=debug msg="Still waiting for the cluster to initialize: Could not update clusteroperator \"monitoring\" (195 of 264)"
time="2019-02-07T11:18:27Z" level=debug msg="Still waiting for the cluster to initialize: Cluster operator monitoring is reporting a failure: Failed to rollout the stack. Error: running task Updating Prometheus Operator failed: reconciling Prometheus Operator Deployment failed: creating deployment object failed: timed out waiting for the condition"
time="2019-02-07T11:22:57Z" level=debug msg="Still waiting for the cluster to initialize: Cluster operator monitoring is reporting a failure: Failed to rollout the stack. Error: running task Updating Prometheus Operator failed: reconciling Prometheus Operator Deployment failed: updating deployment object failed: timed out waiting for the condition"
time="2019-02-07T11:36:04Z" level=fatal msg="failed to initialize the cluster: timed out waiting for the condition"
@runcom

This comment has been minimized.

Copy link
Member

runcom commented Feb 7, 2019

/retest

@runcom

This comment has been minimized.

Copy link
Member

runcom commented Feb 7, 2019

monitoring flake again

/retest

@jlebon

This comment has been minimized.

Copy link
Member

jlebon commented Feb 7, 2019

/lgtm

@openshift-ci-robot

This comment has been minimized.

Copy link

openshift-ci-robot commented Feb 7, 2019

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: cgwalters, jlebon, runcom

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:
  • OWNERS [cgwalters,jlebon,runcom]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@runcom

This comment has been minimized.

Copy link
Member

runcom commented Feb 7, 2019

wtf, still the very same issue with monitoring operator, reported to them:

Sergiusz Urbaniak [3 hours ago]
this smells like a communication issue with the API server :thinking_face:

@cgwalters cgwalters force-pushed the cgwalters:controller-phased-startup branch from 646fa63 to 26085f3 Feb 7, 2019

@openshift-ci-robot

This comment has been minimized.

Copy link

openshift-ci-robot commented Feb 7, 2019

New changes are detected. LGTM label has been removed.

@cgwalters

This comment has been minimized.

Copy link
Contributor Author

cgwalters commented Feb 7, 2019

MCC logs in that run had:

I0207 14:35:20.448940       1 start.go:47] Version: 3.11.0-582-ga0853bdc-dirty
I0207 14:35:20.649338       1 leaderelection.go:187] attempting to acquire leader lease  openshift-machine-config-operator/machine-config-controller...
E0207 14:35:21.249891       1 event.go:259] Could not construct reference to: '&v1.ConfigMap{TypeMeta:v1.TypeMeta{Kind:"", APIVersion:""}, ObjectMeta:v1.ObjectMeta{Name:"machine-config-controller", GenerateName:"", Namespace:"openshift-machine-config-operator", SelfLink:"/api/v1/namespaces/openshift-machine-config-operator/configmaps/machine-config-controller", UID:"98d5f51f-2ae5-11e9-b5b4-0a77865eaf18", ResourceVersion:"4601", Generation:0, CreationTimestamp:v1.Time{Time:time.Time{wall:0x0, ext:63685146921, loc:(*time.Location)(0x1e12080)}}, DeletionTimestamp:(*v1.Time)(nil), DeletionGracePeriodSeconds:(*int64)(nil), Labels:map[string]string(nil), Annotations:map[string]string{"control-plane.alpha.kubernetes.io/leader":"{\"holderIdentity\":\"machine-config-controller-766dd5c856-pg2tb_986a83c0-2ae5-11e9-b3a0-0a580a82000e\",\"leaseDurationSeconds\":90,\"acquireTime\":\"2019-02-07T14:35:20Z\",\"renewTime\":\"2019-02-07T14:35:20Z\",\"leaderTransitions\":0}"}, OwnerReferences:[]v1.OwnerReference(nil), Initializers:(*v1.Initializers)(nil), Finalizers:[]string(nil), ClusterName:""}, Data:map[string]string(nil), BinaryData:map[string][]uint8(nil)}' due to: 'no kind is registered for the type v1.ConfigMap in scheme "github.com/openshift/machine-config-operator/cmd/common/helpers.go:39"'. Will not report event: 'Normal' 'LeaderElection' 'machine-config-controller-766dd5c856-pg2tb_986a83c0-2ae5-11e9-b3a0-0a580a82000e became leader'
I0207 14:35:21.250001       1 leaderelection.go:196] successfully acquired lease openshift-machine-config-operator/machine-config-controller
I0207 14:35:21.648942       1 template_controller.go:122] Starting MachineConfigController-TemplateController
I0207 14:35:21.750524       1 kubelet_config_controller.go:149] Starting MachineConfigController-KubeletConfigController
E0207 14:40:23.067785       1 leaderelection.go:252] error retrieving resource lock openshift-machine-config-operator/machine-config-controller: Unauthorized

which is interesting. Didn't actually get to the point of starting the other sub-controllers, clearly waiting on the lock. But not sure whether that's because of some other issue.

go template.New(
rootOpts.templates,
ctx.InformerFactory.Machineconfiguration().V1().ControllerConfigs(),
ctx.InformerFactory.Machineconfiguration().V1().MachineConfigs(),
ctx.ClientBuilder.KubeClientOrDie("template-controller"),
ctx.ClientBuilder.MachineConfigClientOrDie("template-controller"),
readyFlag,

This comment has been minimized.

@runcom

runcom Feb 7, 2019

Member

mmm I believe we need to pass the reference to the wg here, the WaitGroup itself contains a Mutex (iirc) and passing by value would get it copied (causing all sort of subtle effects because the .Done() method is called on the wrong object)

@runcom

This comment has been minimized.

Copy link
Member

runcom commented Feb 7, 2019

@cgwalters cgwalters force-pushed the cgwalters:controller-phased-startup branch 2 times, most recently from 36bcc49 to d20b04a Feb 8, 2019

@runcom

This comment has been minimized.

Copy link
Member

runcom commented Feb 8, 2019

flake I guess:

time="2019-02-08T16:26:57Z" level=debug msg="Still waiting for the cluster to initialize: Cluster operator openshift-cluster-samples-operator is reporting a failure: Samples installation in error at 4.0.0-alpha1-a10652d41: "
time="2019-02-08T16:32:27Z" level=debug msg="Still waiting for the cluster to initialize: Cluster operator monitoring is reporting a failure: Failed to rollout the stack. Error: running task Updating Prometheus Operator failed: reconciling Prometheus Operator Deployment failed: creating deployment object failed: timed out waiting for the condition"
time="2019-02-08T16:36:42Z" level=debug msg="Still waiting for the cluster to initialize: Cluster operator monitoring is reporting a failure: Failed to rollout the stack. Error: running task Updating Prometheus Operator failed: reconciling Prometheus Operator Deployment failed: updating deployment object failed: timed out waiting for the condition"
time="2019-02-08T16:50:54Z" level=fatal msg="failed to initialize the cluster: timed out waiting for the condition"
  • aws stuff in the other job...
@ashcrow

This comment has been minimized.

Copy link
Member

ashcrow commented Feb 8, 2019

/retest

1 similar comment
@runcom

This comment has been minimized.

Copy link
Member

runcom commented Feb 8, 2019

/retest

@runcom

This comment has been minimized.

Copy link
Member

runcom commented Feb 8, 2019

not sure why installer keeps failing on the prometheus operator here (from logs)

/retest

@cgwalters

This comment has been minimized.

Copy link
Contributor Author

cgwalters commented Feb 8, 2019

Looking at the logs here, we're still stalling out in the MCC. So clearly this code isn't working as I thought.

I need to try again to build a local release payload and test locally to more easily debug.

@cgwalters

This comment has been minimized.

Copy link
Contributor Author

cgwalters commented Feb 8, 2019

/hold

@runcom

This comment has been minimized.

Copy link
Member

runcom commented Feb 8, 2019

ok, looking at this PR from my editor again now, I see there are many code paths in syncKubeletConfig and upward in the call stack which may return early and just leave the WaitGroup undone, thus deadlocking...

@runcom

This comment has been minimized.

Copy link
Member

runcom commented Feb 8, 2019

So, just from my understanding of the flow, we should close the WaitGroup anyway even if we're going through another code path right? (even an error one).
I believe this would do the trick:

diff --git a/pkg/controller/kubelet-config/kubelet_config_controller.go b/pkg/controller/kubelet-config/kubelet_config_controller.go
index 9acc10c..9d368f6 100644
--- a/pkg/controller/kubelet-config/kubelet_config_controller.go
+++ b/pkg/controller/kubelet-config/kubelet_config_controller.go
@@ -304,6 +304,9 @@ func (ctrl *Controller) syncKubeletConfig(key string) error {
 	glog.V(4).Infof("Started syncing kubeletconfig %q (%v)", key, startTime)
 	defer func() {
 		glog.V(4).Infof("Finished syncing kubeletconfig %q (%v)", key, time.Since(startTime))
+		ctrl.firstSync.Do(func() {
+			ctrl.readyFlag.Done()
+		})
 	}()
 
 	_, name, err := cache.SplitMetaNamespaceKey(key)
@@ -420,10 +423,6 @@ func (ctrl *Controller) syncKubeletConfig(key string) error {
 		glog.Infof("Applied KubeletConfig %v on MachineConfigPool %v", key, pool.Name)
 	}
 
-	ctrl.firstSync.Do(func() {
-		ctrl.readyFlag.Done()
-	})
-
 	return ctrl.syncStatusOnly(cfg, nil)
 }
 
diff --git a/pkg/controller/template/template_controller.go b/pkg/controller/template/template_controller.go
index ec5681b..7bafbdf 100644
--- a/pkg/controller/template/template_controller.go
+++ b/pkg/controller/template/template_controller.go
@@ -324,6 +324,9 @@ func (ctrl *Controller) syncControllerConfig(key string) error {
 	glog.V(4).Infof("Started syncing controllerconfig %q (%v)", key, startTime)
 	defer func() {
 		glog.V(4).Infof("Finished syncing controllerconfig %q (%v)", key, time.Since(startTime))
+		ctrl.firstSync.Do(func() {
+			ctrl.readyFlag.Done()
+		})
 	}()
 
 	_, name, err := cache.SplitMetaNamespaceKey(key)
@@ -376,10 +379,6 @@ func (ctrl *Controller) syncControllerConfig(key string) error {
 		}
 	}
 
-	ctrl.firstSync.Do(func() {
-		ctrl.readyFlag.Done()
-	})
-
 	return ctrl.syncCompletedStatus(cfg)
 }
@runcom

This comment has been minimized.

Copy link
Member

runcom commented Feb 8, 2019

we can enhance that patch above to also return an error through a channel and completely fail the initialization of the other dependant controllers if we really want to (not sure that's desiderable though)

@cgwalters cgwalters force-pushed the cgwalters:controller-phased-startup branch from d20b04a to 494c5c2 Feb 11, 2019

@cgwalters

This comment has been minimized.

Copy link
Contributor Author

cgwalters commented Feb 11, 2019

Looking at this again the problem seems obvious, we can't block at that point or we won't start the kube informers, etc. Changed to start a new goroutine to do an async wait.

@@ -409,6 +420,10 @@ func (ctrl *Controller) syncKubeletConfig(key string) error {
glog.Infof("Applied KubeletConfig %v on MachineConfigPool %v", key, pool.Name)
}

ctrl.firstSync.Do(func() {
ctrl.readyFlag.Done()
})

This comment has been minimized.

@runcom

runcom Feb 11, 2019

Member

shouldn't this still be moved to the defer above to avoid not being called on errors (leaking the other goroutine at this point) or other successful paths?

This comment has been minimized.

@cgwalters

cgwalters Feb 11, 2019

Author Contributor

I don't think so, because we want to only say we're ready on success right?

This comment has been minimized.

@runcom

runcom Feb 11, 2019

Member

how do we spot that we're blocked on the wait() though? logs? is that a condition we must wait forever or do we need to error?

This comment has been minimized.

@runcom

runcom Feb 11, 2019

Member

if that's the case, cool

@cgwalters

This comment has been minimized.

Copy link
Contributor Author

cgwalters commented Feb 12, 2019

OK finally did a bit more of a dive into the codebase here to understand things. First, I was totally confused; the 01-{master,worker}-kubelet MCs are generated by the template controller, not the kubelet sub-controller.

The reason this PR is hanging is because by default

$ oc get --all-namespaces kubeletconfig
No resources found.

And we were waiting for one.

For now...all we need to do is block on the template controller having done a full sync. Going to rewrok this PR.

@cgwalters cgwalters force-pushed the cgwalters:controller-phased-startup branch from 494c5c2 to d517532 Feb 12, 2019

@cgwalters

This comment has been minimized.

Copy link
Contributor Author

cgwalters commented Feb 12, 2019

Looks like this worked, there's a clear delay now between when we finish the templates and then start the renderer:

I0212 18:17:02.595914       1 leaderelection.go:196] successfully acquired lease openshift-machine-config-operator/machine-config-controller
I0212 18:17:02.796845       1 template_controller.go:122] Starting MachineConfigController-TemplateController
I0212 18:17:02.996081       1 kubelet_config_controller.go:138] Starting MachineConfigController-KubeletConfigController
I0212 18:17:07.218526       1 template_controller.go:380] Initial template sync done
I0212 18:17:07.395747       1 render_controller.go:112] Starting MachineConfigController-RenderController
I0212 18:17:07.396296       1 node_controller.go:121] Starting MachineConfigController-NodeController
I0212 18:17:12.996365       1 render_controller.go:456] Generated machineconfig master-761cbffcf67ae93c8fde4cb77438e3da from 3 configs: [{MachineConfig  00-master  machineconfiguration.openshift.io/v1  } {MachineConfig  00-master-ssh  machineconfiguration.openshift.io/v1  } {MachineConfig  01-master-kubelet  machineconfiguration.openshift.io/v1  }]
I0212 18:17:13.296207       1 render_controller.go:456] Generated machineconfig worker-fb4c0a0f9bd8c642a941ea3d55db1e7b from 3 configs: [{MachineConfig  00-worker  machineconfiguration.openshift.io/v1  } {MachineConfig  00-worker-ssh  machineconfiguration.openshift.io/v1  } {MachineConfig  01-worker-kubelet  machineconfiguration.openshift.io/v1  }]
@cgwalters

This comment has been minimized.

Copy link
Contributor Author

cgwalters commented Feb 12, 2019

Something went wrong operator side though, we never exited init mode in that run:

E0212 18:16:38.428010       1 reflector.go:134] github.com/openshift/machine-config-operator/pkg/generated/informers/externalversions/factory.go:101: Failed to list *v1.MachineConfigPool: the server could not find the requested resource (get machineconfigpools.machineconfiguration.openshift.io)
I0212 18:16:40.640806       1 sync.go:48] [init mode] synced pools in 21.804511ms
I0212 18:17:07.726846       1 sync.go:48] [init mode] synced mcc in 27.077026444s
I0212 18:17:13.840892       1 sync.go:48] [init mode] synced mcs in 6.103865369s
I0212 18:17:19.936551       1 sync.go:48] [init mode] synced mcd in 6.081135133s
I0212 18:27:19.953801       1 sync.go:48] [init mode] synced required-pools in 10m0.000482382s
I0212 18:27:20.053636       1 sync.go:48] [init mode] synced pools in 9.189432ms
I0212 18:27:22.105326       1 sync.go:48] [init mode] synced mcc in 2.042414571s
I0212 18:27:23.174254       1 sync.go:48] [init mode] synced mcs in 1.058831684s
I0212 18:27:24.535234       1 sync.go:48] [init mode] synced mcd in 1.351046342s
I0212 18:37:24.545491       1 sync.go:48] [init mode] synced required-pools in 10m0.000319228s
I0212 18:37:24.646409       1 sync.go:48] [init mode] synced pools in 8.973161ms
I0212 18:37:26.693497       1 sync.go:48] [init mode] synced mcc in 2.038344699s
I0212 18:37:27.759663       1 sync.go:48] [init mode] synced mcs in 1.055972849s
I0212 18:37:29.125584       1 sync.go:48] [init mode] synced mcd in 1.354875482s
I0212 18:47:29.136629       1 sync.go:48] [init mode] synced required-pools in 10m0.000314036s
I0212 18:47:29.236215       1 sync.go:48] [init mode] synced pools in 9.119727ms
I0212 18:47:31.283226       1 sync.go:48] [init mode] synced mcc in 2.037753306s
I0212 18:47:32.354976       1 sync.go:48] [init mode] synced mcs in 1.061297401s
I0212 18:47:33.715687       1 sync.go:48] [init mode] synced mcd in 1.349676074s

Both the master/worker pools are:

                "machineCount": 0,
                "readyMachineCount": 0,
                "unavailableMachineCount": 0,
                "updatedMachineCount": 0

Hmm. Something interesting about the MCC logs are that I don't see any output from the node controller.

@cgwalters cgwalters force-pushed the cgwalters:controller-phased-startup branch from d517532 to a2d9d63 Feb 12, 2019

@cgwalters

This comment has been minimized.

Copy link
Contributor Author

cgwalters commented Feb 12, 2019

Not sure what happened in that run. Added 2 commits which tweak logging that should be helpful in general.

controller: Wait for template sub-controller before starting renderer
Avoid e.g. rendering a MC without the `01-master-kubelet` config template.

Closes: #338

Co-authored-by: Antonio Murdaca <runcom@linux.com>

@cgwalters cgwalters force-pushed the cgwalters:controller-phased-startup branch from a2d9d63 to 4e4fa04 Feb 14, 2019

@openshift-ci-robot

This comment has been minimized.

Copy link

openshift-ci-robot commented Feb 14, 2019

@cgwalters: The following tests failed, say /retest to rerun them all:

Test name Commit Details Rerun command
ci/prow/e2e-aws 4e4fa04 link /test e2e-aws
ci/prow/e2e-aws-op 4e4fa04 link /test e2e-aws-op

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@cgwalters

This comment has been minimized.

Copy link
Contributor Author

cgwalters commented Feb 14, 2019

OK I totally don't understand why the node sub-controller apparently isn't reacting with this PR. We're starting it...

Testing this out on a local libvirt run...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment