Join GitHub today
GitHub is home to over 31 million developers working together to host and review code, manage projects, and build software together.
Sign upcontroller: Wait for template and kubelet before starting renderer #385
Conversation
jlebon
reviewed
Feb 5, 2019
cgwalters
force-pushed the
cgwalters:controller-phased-startup
branch
from
b4b7de1
to
37fac58
Feb 5, 2019
This comment has been minimized.
This comment has been minimized.
This looks plausible to me. Will let others have a look and test it out. |
openshift-ci-robot
added
the
approved
label
Feb 5, 2019
runcom
reviewed
Feb 5, 2019
cgwalters
force-pushed the
cgwalters:controller-phased-startup
branch
from
37fac58
to
4c6732e
Feb 5, 2019
openshift-ci-robot
added
the
size/M
label
Feb 5, 2019
cgwalters
force-pushed the
cgwalters:controller-phased-startup
branch
from
4c6732e
to
0be8861
Feb 5, 2019
This comment has been minimized.
This comment has been minimized.
Approving code wise but I won't be able to test it on a cluster till tomorrow nor tests seem to be running anyway :( /approve |
This comment has been minimized.
This comment has been minimized.
/retest |
1 similar comment
This comment has been minimized.
This comment has been minimized.
/retest |
This comment has been minimized.
This comment has been minimized.
/test e2e-aws-op |
cgwalters
force-pushed the
cgwalters:controller-phased-startup
branch
from
0be8861
to
646fa63
Feb 6, 2019
This comment has been minimized.
This comment has been minimized.
/retest |
1 similar comment
This comment has been minimized.
This comment has been minimized.
/retest |
This comment has been minimized.
This comment has been minimized.
/lgtm |
openshift-ci-robot
assigned
runcom
Feb 7, 2019
openshift-ci-robot
added
the
lgtm
label
Feb 7, 2019
This comment has been minimized.
This comment has been minimized.
/retest |
This comment has been minimized.
This comment has been minimized.
tests are failing because of:
|
runcom
referenced this pull request
Feb 7, 2019
Merged
pkg/operator: lay down ClusterOperator for MCO #386
This comment has been minimized.
This comment has been minimized.
/retest |
This comment has been minimized.
This comment has been minimized.
monitoring flake again /retest |
This comment has been minimized.
This comment has been minimized.
/lgtm |
openshift-ci-robot
assigned
jlebon
Feb 7, 2019
This comment has been minimized.
This comment has been minimized.
openshift-ci-robot
commented
Feb 7, 2019
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: cgwalters, jlebon, runcom The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
This comment has been minimized.
This comment has been minimized.
wtf, still the very same issue with monitoring operator, reported to them:
|
cgwalters
force-pushed the
cgwalters:controller-phased-startup
branch
from
646fa63
to
26085f3
Feb 7, 2019
This comment has been minimized.
This comment has been minimized.
openshift-ci-robot
commented
Feb 7, 2019
New changes are detected. LGTM label has been removed. |
This comment has been minimized.
This comment has been minimized.
MCC logs in that run had:
which is interesting. Didn't actually get to the point of starting the other sub-controllers, clearly waiting on the lock. But not sure whether that's because of some other issue. |
runcom
reviewed
Feb 7, 2019
go template.New( | ||
rootOpts.templates, | ||
ctx.InformerFactory.Machineconfiguration().V1().ControllerConfigs(), | ||
ctx.InformerFactory.Machineconfiguration().V1().MachineConfigs(), | ||
ctx.ClientBuilder.KubeClientOrDie("template-controller"), | ||
ctx.ClientBuilder.MachineConfigClientOrDie("template-controller"), | ||
readyFlag, |
This comment has been minimized.
This comment has been minimized.
runcom
Feb 7, 2019
•
Member
mmm I believe we need to pass the reference to the wg here, the WaitGroup itself contains a Mutex (iirc) and passing by value would get it copied (causing all sort of subtle effects because the .Done()
method is called on the wrong object)
This comment has been minimized.
This comment has been minimized.
the deadlock could be because of https://github.com/openshift/machine-config-operator/pull/385/files#r254901533 |
cgwalters
force-pushed the
cgwalters:controller-phased-startup
branch
2 times, most recently
from
36bcc49
to
d20b04a
Feb 8, 2019
This comment has been minimized.
This comment has been minimized.
flake I guess:
|
This comment has been minimized.
This comment has been minimized.
/retest |
1 similar comment
This comment has been minimized.
This comment has been minimized.
/retest |
This comment has been minimized.
This comment has been minimized.
not sure why installer keeps failing on the prometheus operator here (from logs) /retest |
This comment has been minimized.
This comment has been minimized.
Looking at the logs here, we're still stalling out in the MCC. So clearly this code isn't working as I thought. I need to try again to build a local release payload and test locally to more easily debug. |
This comment has been minimized.
This comment has been minimized.
/hold |
openshift-ci-robot
added
the
do-not-merge/hold
label
Feb 8, 2019
This comment has been minimized.
This comment has been minimized.
ok, looking at this PR from my editor again now, I see there are many code paths in |
This comment has been minimized.
This comment has been minimized.
So, just from my understanding of the flow, we should close the WaitGroup anyway even if we're going through another code path right? (even an error one). diff --git a/pkg/controller/kubelet-config/kubelet_config_controller.go b/pkg/controller/kubelet-config/kubelet_config_controller.go
index 9acc10c..9d368f6 100644
--- a/pkg/controller/kubelet-config/kubelet_config_controller.go
+++ b/pkg/controller/kubelet-config/kubelet_config_controller.go
@@ -304,6 +304,9 @@ func (ctrl *Controller) syncKubeletConfig(key string) error {
glog.V(4).Infof("Started syncing kubeletconfig %q (%v)", key, startTime)
defer func() {
glog.V(4).Infof("Finished syncing kubeletconfig %q (%v)", key, time.Since(startTime))
+ ctrl.firstSync.Do(func() {
+ ctrl.readyFlag.Done()
+ })
}()
_, name, err := cache.SplitMetaNamespaceKey(key)
@@ -420,10 +423,6 @@ func (ctrl *Controller) syncKubeletConfig(key string) error {
glog.Infof("Applied KubeletConfig %v on MachineConfigPool %v", key, pool.Name)
}
- ctrl.firstSync.Do(func() {
- ctrl.readyFlag.Done()
- })
-
return ctrl.syncStatusOnly(cfg, nil)
}
diff --git a/pkg/controller/template/template_controller.go b/pkg/controller/template/template_controller.go
index ec5681b..7bafbdf 100644
--- a/pkg/controller/template/template_controller.go
+++ b/pkg/controller/template/template_controller.go
@@ -324,6 +324,9 @@ func (ctrl *Controller) syncControllerConfig(key string) error {
glog.V(4).Infof("Started syncing controllerconfig %q (%v)", key, startTime)
defer func() {
glog.V(4).Infof("Finished syncing controllerconfig %q (%v)", key, time.Since(startTime))
+ ctrl.firstSync.Do(func() {
+ ctrl.readyFlag.Done()
+ })
}()
_, name, err := cache.SplitMetaNamespaceKey(key)
@@ -376,10 +379,6 @@ func (ctrl *Controller) syncControllerConfig(key string) error {
}
}
- ctrl.firstSync.Do(func() {
- ctrl.readyFlag.Done()
- })
-
return ctrl.syncCompletedStatus(cfg)
} |
This comment has been minimized.
This comment has been minimized.
we can enhance that patch above to also return an error through a channel and completely fail the initialization of the other dependant controllers if we really want to (not sure that's desiderable though) |
cgwalters
force-pushed the
cgwalters:controller-phased-startup
branch
from
d20b04a
to
494c5c2
Feb 11, 2019
This comment has been minimized.
This comment has been minimized.
Looking at this again the problem seems obvious, we can't block at that point or we won't start the kube informers, etc. Changed to start a new goroutine to do an async wait. |
runcom
reviewed
Feb 11, 2019
@@ -409,6 +420,10 @@ func (ctrl *Controller) syncKubeletConfig(key string) error { | |||
glog.Infof("Applied KubeletConfig %v on MachineConfigPool %v", key, pool.Name) | |||
} | |||
|
|||
ctrl.firstSync.Do(func() { | |||
ctrl.readyFlag.Done() | |||
}) |
This comment has been minimized.
This comment has been minimized.
runcom
Feb 11, 2019
Member
shouldn't this still be moved to the defer above to avoid not being called on errors (leaking the other goroutine at this point) or other successful paths?
This comment has been minimized.
This comment has been minimized.
cgwalters
Feb 11, 2019
Author
Contributor
I don't think so, because we want to only say we're ready on success right?
This comment has been minimized.
This comment has been minimized.
runcom
Feb 11, 2019
Member
how do we spot that we're blocked on the wait() though? logs? is that a condition we must wait forever or do we need to error?
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
OK finally did a bit more of a dive into the codebase here to understand things. First, I was totally confused; the The reason this PR is hanging is because by default
And we were waiting for one. For now...all we need to do is block on the template controller having done a full sync. Going to rewrok this PR. |
cgwalters
force-pushed the
cgwalters:controller-phased-startup
branch
from
494c5c2
to
d517532
Feb 12, 2019
This comment has been minimized.
This comment has been minimized.
Looks like this worked, there's a clear delay now between when we finish the templates and then start the renderer:
|
This comment has been minimized.
This comment has been minimized.
Something went wrong operator side though, we never exited init mode in that run:
Both the master/worker pools are:
Hmm. Something interesting about the MCC logs are that I don't see any output from the node controller. |
cgwalters
force-pushed the
cgwalters:controller-phased-startup
branch
from
d517532
to
a2d9d63
Feb 12, 2019
This comment has been minimized.
This comment has been minimized.
Not sure what happened in that run. Added 2 commits which tweak logging that should be helpful in general. |
Feb 13, 2019
This was referenced
cgwalters
force-pushed the
cgwalters:controller-phased-startup
branch
from
a2d9d63
to
4e4fa04
Feb 14, 2019
This comment has been minimized.
This comment has been minimized.
openshift-ci-robot
commented
Feb 14, 2019
@cgwalters: The following tests failed, say
Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here. |
This comment has been minimized.
This comment has been minimized.
OK I totally don't understand why the node sub-controller apparently isn't reacting with this PR. We're starting it... Testing this out on a local libvirt run... |
cgwalters commentedFeb 5, 2019
Avoid e.g. rendering a MC without the kubelet config.
Closes: #338