New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
sdn: fix initialization order to prevent crash on node startup #13766
Conversation
I assume this will be backported to 1.5? |
@smarterclayton yeah, it should be |
57a7167
to
fc95b86
Compare
pkg/sdn/plugin/node.go
Outdated
// podManager must be created early because other goroutines | ||
// may call into it before its started due to event watches | ||
log.V(5).Infof("Creating openshift-sdn pod manager") | ||
node.podManager = newPodManager(node.kClient, node.policy, node.mtu, node.oc) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"it's" (in the comment)
should this be in NewNodePlugin() rather than Start()?
also, maybe move node.localSubnetCIDR, err = node.getLocalSubnet()
to before the node_iptables calls, and then add a comment between the two indicating that that's the point after which all of node's fields have been initialized?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@danwinship all great points. Done.
fc95b86
to
9b5f066
Compare
pkg/sdn/plugin/node.go
Outdated
return err | ||
} | ||
if err := node.podManager.Start(cniserver.CNIServerSocketPath); err != nil { | ||
// Kubelet has initialized, now we have a valid node.host |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oops, ok, so then the comment above about "all OsdnNode fields have been initialized" is a lie then. Fix that in some way. Then LGTM.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@danwinship PTAL, I think we can just move the kubelet init block above the "everything is initilaized" comment; no reason it has to be that far down I don't think.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Erm... I think you're right, but given the fun we've had with startup ordering in the past (and that we're continuing to have now) I'd feel a little bit sketchy merging that to release-1.5 without it getting any testing in master first. How about we make that change for master but for release-1.5 leave it where it is and just clarify in the comment that everything except .host and .kubeletCniPlugin are initialized by that point.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If this isn't happening often on 1.5 i'm ok waiting.
[test] |
9b5f066
to
0e01ff0
Compare
OsdnNode.Start() (node.pm == nil at this point) -> node.policy.Start() (which is multitenant policy) -> mp.vnids.Start() -> go vmap.watchNetNamespaces() -> (net namespace event happens) -> watchNetNamespaces() -> vmap.policy.AddNetNamespace() (policy is multitenant) -> mp.updatePodNetwork() -> mp.node.podManager.UpdateLocalMulticastRules() (and podManager is still nil) Create the PodManager earlier so it's not nil if we get early events. Fixes: openshift#13742
0e01ff0
to
94fb0f4
Compare
test failure was previous run... [test] |
re-[test] issue #13650 |
re-[test] issue #13827 |
[test]
…On Wed, Apr 19, 2017 at 8:21 PM, OpenShift Bot ***@***.***> wrote:
The Origin test job could not be run again for this pull request.
- If the proposed changes in this pull request caused the job to fail,
update this pull request with new code to fix the issue(s).
- If flaky tests caused the job to fail, leave a comment with links to
the GitHub issue(s) in the openshift/origin repository with the
kind/test-flake label
<https://github.com/openshift/origin/issues?q=label%3Akind%2Ftest-flake>
that are tracking the flakes. If no issue already exists for the flake you
encountered, create one
<https://github.com/openshift/origin/issues/new>.
- If something else like CI system downtime or maintenance caused the
job to fail, contact a member of the Team Project Committers
<https://github.com/orgs/openshift/teams/team-project-committers>
group to trigger the job again.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#13766 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/ABG_pwVA2jhtgc1JQXbgNPzheCKC7EUfks5rxqULgaJpZM4M9K6I>
.
|
re-[test] issue #13831 |
[merge] once the tests stop flaking... |
[merge]
…On Fri, Apr 21, 2017 at 5:53 PM, OpenShift Bot ***@***.***> wrote:
continuous-integration/openshift-jenkins/merge FAILURE (
https://ci.openshift.redhat.com/jenkins/job/merge_pull_request_origin/413/)
(Base Commit: c106caf
<c106caf>
)
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#13766 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/ABG_pwRDcpReKEwUf2rnhmWyvErpCknlks5rySVYgaJpZM4M9K6I>
.
|
[merge]
…On Sun, Apr 23, 2017 at 6:05 PM, OpenShift Bot ***@***.***> wrote:
continuous-integration/openshift-jenkins/merge FAILURE (
https://ci.openshift.redhat.com/jenkins/job/merge_pull_request_origin/439/)
(Base Commit: 53a4d1b
<53a4d1b>
)
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#13766 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/ABG_p6K3f3pb2ptKbokrRQLPZ4-r6DhZks5ry8sigaJpZM4M9K6I>
.
|
Evaluated for origin merge up to 94fb0f4 |
continuous-integration/openshift-jenkins/merge FAILURE (https://ci.openshift.redhat.com/jenkins/job/merge_pull_request_origin/465/) (Base Commit: 184b859) |
[merge] |
[test] |
(if it comes back green lets merge by hand, since we already merged in older releases) |
Evaluated for origin test up to 94fb0f4 |
continuous-integration/openshift-jenkins/test SUCCESS (https://ci.openshift.redhat.com/jenkins/job/test_pull_request_origin/956/) (Base Commit: c2e95f5) |
OsdnNode.Start()
(node.pm == nil at this point)
-> node.policy.Start() (which is multitenant policy)
-> mp.vnids.Start()
-> go vmap.watchNetNamespaces()
-> (net namespace event happens)
-> watchNetNamespaces()
-> vmap.policy.AddNetNamespace() (policy is multitenant)
-> mp.updatePodNetwork()
-> mp.node.podManager.UpdateLocalMulticastRules() (and podManager is still nil)
Create the PodManager earlier so it's not nil if we get early events.
Fixes: #13742
@openshift/networking