-
Notifications
You must be signed in to change notification settings - Fork 4.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
sdn: kill containers that fail to update on node restart #14665
Conversation
I did try various iterations of updating the PodStatus to mark the containers as terminated and stuff like that, but unfortunately kubelet also keeps updating status and overwrites the terminated status with its own good status because it has no idea that networking is busted for those pods. |
114160f
to
85001dd
Compare
please file a github issue or trello card
That potentially runs into problems with containerized installs so this is probably better |
pkg/sdn/plugin/node.go
Outdated
} | ||
dockerClient := dockertools.ConnectToDockerOrDie(endpoint, time.Minute, time.Minute) | ||
|
||
// Wait until docker has restarted since kubelet will exit it docker isn't running |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"if"
pkg/sdn/plugin/node.go
Outdated
return fmt.Sprintf("%s/%s", pod.Namespace, pod.Name) | ||
} | ||
|
||
func dockerSandboxNameToInfraPodNamePrefix(name string) (string, error) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is clearly copied from somewhere in the guts of k8s and should probably indicate where, so if it breaks in the future we know where to look
pkg/sdn/plugin/node.go
Outdated
return nil | ||
} | ||
|
||
func (node *OsdnNode) killUpdateFailedPods(pods []kapi.Pod) error { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe move all this stuff to a new file? (It shouldn't even need to be part of OsdnNode.)
@danwinship updated, and now with a testcase |
85001dd
to
206e666
Compare
pkg/sdn/plugin/node.go
Outdated
if err != nil { | ||
return fmt.Errorf("failed to connect to docker after SDN cleanup restart: %v", err) | ||
}); err != nil { | ||
return nil, fmt.Errorf("failed to connect to docker after SDN cleanup restart: %v", err) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That error message is not correct for the kill-on-update case
pkg/sdn/plugin/node.go
Outdated
@@ -270,24 +278,38 @@ func (node *OsdnNode) Start() error { | |||
return err | |||
} | |||
|
|||
var podsToKill []kapi.Pod | |||
if networkChanged { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we can only have podsToKill if the network changed so maybe all the killing code should be inside the if body?
pkg/sdn/plugin/update.go
Outdated
if c.ID == cid.ID { | ||
infraPrefix, err = dockerSandboxNameToInfraPodNamePrefix(c.Names[0]) | ||
if err != nil { | ||
return fmt.Errorf("unparsable container ID %q", c.ID) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
shouldn't you return err?
} | ||
// Find and kill the infra container | ||
for _, c := range containers { | ||
if strings.HasPrefix(c.Names[0], infraPrefix) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so this seems fragile... do we have some long-term plan to make this code unnecessary or is this expected to stick around forever?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@danwinship the long-term plan is to make kubelet ask for network status on restart, and if some pod fails, kill and restart that pod for us, instead of assuming everything is groovy.
206e666
to
0177154
Compare
@danwinship PTAL, thanks... |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I really don't like this, but I guess it is the least bad option available :-(
I'll merge this on Monday. |
[Test]ing while waiting on the merge queue |
With the move to remote runtimes, we can no longer get the pod's network namespace from kubelet (since we cannot insert ourselves into the remote runtime's plugin list and intercept network plugin calls). As kubelet does not call network plugins in any way on startup if a container is already running, we have no way to ensure the container is using the correct NetNamespace (as it may have changed while openshift-node was down) at startup, unless we encode the required information into OVS flows. But if OVS was restarted around the same time OpenShift was, those flows are lost, and we have no information with which to recover the pod's networking on node startup. In this case, kill the infra container underneath kubelet so that it will be restarted and we can set its network up again. NOTE: this is somewhat hacky and will not work with other remote runtimes like CRI-O, but OpenShift 3.6 hardcodes dockershim so this isn't a problem yet. The "correct" solution is to either checkpoint our network configuration at container setup time and recover that ourselves, or to add a GET/STATUS call to CNI and make Kubelet call that operation on startup when recovering running containers. Fixes: https://bugzilla.redhat.com/show_bug.cgi?id=1453113
0177154
to
fd6dee8
Compare
re[test] |
// Find and kill the infra container | ||
for _, c := range containers { | ||
if strings.HasPrefix(c.Names[0], infraPrefix) { | ||
if err := docker.StopContainer(c.ID, 10); err != nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ignoring grace period is bad - this could result in silent data corruption for some containers.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@smarterclayton this matches what dockershim/docker_sandbox.go uses for sandbox containers, which is defaultSandboxGracePeriod int = 10
. Do you think we need to honor the grace period for the entire pod while stopping in infra container, rather than using the default sandbox grace period?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, as long as kubernetes will apply graceful on the rest this is acceptable.
[merge]
Removing merge briefly while comment is addressed. |
re-[merge] |
[test] |
re-[test] flake #14929 |
Evaluated for origin test up to fd6dee8 |
continuous-integration/openshift-jenkins/test SUCCESS (https://ci.openshift.redhat.com/jenkins/job/test_pull_request_origin/2740/) (Base Commit: a67ac87) (PR Branch Commit: fd6dee8) |
flake #14043, [merge] |
continuous-integration/openshift-jenkins/merge Waiting: You are in the build queue at position: 2 |
flake #14043, [merge] |
Evaluated for origin merge up to fd6dee8 |
Automatic merge from submit-queue sdn: move sandbox kill-on-failed-update to runtime socket from direct docker Follow-on to #14892 that reworks #14665 to use the new runtime socket stuff. @openshift/networking @danwinship @pravisankar
With the move to remote runtimes, we can no longer get the pod's
network namespace from kubelet (since we cannot insert ourselves
into the remote runtime's plugin list and intercept network plugin
calls). As kubelet does not call network plugins in any way on
startup if a container is already running, we have no way to ensure
the container is using the correct NetNamespace (as it may have
changed while openshift-node was down) at startup, unless we encode
the required information into OVS flows.
But if OVS was restarted around the same time OpenShift was,
those flows are lost, and we have no information with which to
recover the pod's networking on node startup. In this case, kill
the infra container underneath kubelet so that it will be restarted
and we can set its network up again.
NOTE: this is somewhat hacky and will not work with other remote
runtimes like CRI-O, but OpenShift 3.6 hardcodes dockershim so this
isn't a problem yet. The "correct" solution is to either checkpoint
our network configuration at container setup time and recover that
ourselves, or to add a GET/STATUS call to CNI and make Kubelet call
that operation on startup when recovering running containers.
Fixes: https://bugzilla.redhat.com/show_bug.cgi?id=1453113
@danwinship @openshift/networking @knobunc @eparis
Alternative: we just restart docker which will kill all the pods anyway.