sdn: kill containers that fail to update on node restart #14665

dcbw · 2017-06-15T02:34:51Z

With the move to remote runtimes, we can no longer get the pod's
network namespace from kubelet (since we cannot insert ourselves
into the remote runtime's plugin list and intercept network plugin
calls). As kubelet does not call network plugins in any way on
startup if a container is already running, we have no way to ensure
the container is using the correct NetNamespace (as it may have
changed while openshift-node was down) at startup, unless we encode
the required information into OVS flows.

But if OVS was restarted around the same time OpenShift was,
those flows are lost, and we have no information with which to
recover the pod's networking on node startup. In this case, kill
the infra container underneath kubelet so that it will be restarted
and we can set its network up again.

NOTE: this is somewhat hacky and will not work with other remote
runtimes like CRI-O, but OpenShift 3.6 hardcodes dockershim so this
isn't a problem yet. The "correct" solution is to either checkpoint
our network configuration at container setup time and recover that
ourselves, or to add a GET/STATUS call to CNI and make Kubelet call
that operation on startup when recovering running containers.

Fixes: https://bugzilla.redhat.com/show_bug.cgi?id=1453113

@danwinship @openshift/networking @knobunc @eparis

Alternative: we just restart docker which will kill all the pods anyway.

dcbw · 2017-06-15T02:36:19Z

I did try various iterations of updating the PodStatus to mark the containers as terminated and stuff like that, but unfortunately kubelet also keeps updating status and overwrites the terminated status with its own good status because it has no idea that networking is busted for those pods.

danwinship · 2017-06-15T11:26:57Z

The "correct" solution is...

please file a github issue or trello card

Alternative: we just restart docker which will kill all the pods anyway.

That potentially runs into problems with containerized installs so this is probably better

danwinship · 2017-06-15T11:29:35Z

pkg/sdn/plugin/node.go

+	}
+	dockerClient := dockertools.ConnectToDockerOrDie(endpoint, time.Minute, time.Minute)
+
+	// Wait until docker has restarted since kubelet will exit it docker isn't running


danwinship · 2017-06-15T11:31:35Z

pkg/sdn/plugin/node.go

+	return fmt.Sprintf("%s/%s", pod.Namespace, pod.Name)
+}
+
+func dockerSandboxNameToInfraPodNamePrefix(name string) (string, error) {


this is clearly copied from somewhere in the guts of k8s and should probably indicate where, so if it breaks in the future we know where to look

danwinship · 2017-06-15T11:35:00Z

pkg/sdn/plugin/node.go

+	return nil
+}
+
+func (node *OsdnNode) killUpdateFailedPods(pods []kapi.Pod) error {


Maybe move all this stuff to a new file? (It shouldn't even need to be part of OsdnNode.)

dcbw · 2017-06-21T04:10:26Z

@danwinship updated, and now with a testcase

danwinship · 2017-06-21T14:20:42Z

pkg/sdn/plugin/node.go

-	if err != nil {
-		return fmt.Errorf("failed to connect to docker after SDN cleanup restart: %v", err)
+		}); err != nil {
+		return nil, fmt.Errorf("failed to connect to docker after SDN cleanup restart: %v", err)


That error message is not correct for the kill-on-update case

danwinship · 2017-06-21T14:21:55Z

pkg/sdn/plugin/node.go

@@ -270,24 +278,38 @@ func (node *OsdnNode) Start() error {
 		return err
 	}

+	var podsToKill []kapi.Pod
 	if networkChanged {


we can only have podsToKill if the network changed so maybe all the killing code should be inside the if body?

danwinship · 2017-06-21T14:24:33Z

pkg/sdn/plugin/update.go

+		if c.ID == cid.ID {
+			infraPrefix, err = dockerSandboxNameToInfraPodNamePrefix(c.Names[0])
+			if err != nil {
+				return fmt.Errorf("unparsable container ID %q", c.ID)


shouldn't you return err?

danwinship · 2017-06-21T14:25:44Z

pkg/sdn/plugin/update.go

+	}
+	// Find and kill the infra container
+	for _, c := range containers {
+		if strings.HasPrefix(c.Names[0], infraPrefix) {


so this seems fragile... do we have some long-term plan to make this code unnecessary or is this expected to stick around forever?

@danwinship the long-term plan is to make kubelet ask for network status on restart, and if some pod fails, kill and restart that pod for us, instead of assuming everything is groovy.

dcbw · 2017-06-21T20:10:31Z

@danwinship PTAL, thanks...

danwinship

lgtm

knobunc

I really don't like this, but I guess it is the least bad option available :-(

knobunc · 2017-06-22T13:11:12Z

I'll merge this on Monday.

openshift-bot · 2017-06-26T14:23:24Z

[Test]ing while waiting on the merge queue

With the move to remote runtimes, we can no longer get the pod's network namespace from kubelet (since we cannot insert ourselves into the remote runtime's plugin list and intercept network plugin calls). As kubelet does not call network plugins in any way on startup if a container is already running, we have no way to ensure the container is using the correct NetNamespace (as it may have changed while openshift-node was down) at startup, unless we encode the required information into OVS flows. But if OVS was restarted around the same time OpenShift was, those flows are lost, and we have no information with which to recover the pod's networking on node startup. In this case, kill the infra container underneath kubelet so that it will be restarted and we can set its network up again. NOTE: this is somewhat hacky and will not work with other remote runtimes like CRI-O, but OpenShift 3.6 hardcodes dockershim so this isn't a problem yet. The "correct" solution is to either checkpoint our network configuration at container setup time and recover that ourselves, or to add a GET/STATUS call to CNI and make Kubelet call that operation on startup when recovering running containers. Fixes: https://bugzilla.redhat.com/show_bug.cgi?id=1453113

deads2k · 2017-06-26T20:24:45Z

re[test]

smarterclayton · 2017-06-27T14:54:35Z

pkg/sdn/plugin/update.go

+	// Find and kill the infra container
+	for _, c := range containers {
+		if strings.HasPrefix(c.Names[0], infraPrefix) {
+			if err := docker.StopContainer(c.ID, 10); err != nil {


Ignoring grace period is bad - this could result in silent data corruption for some containers.

@smarterclayton this matches what dockershim/docker_sandbox.go uses for sandbox containers, which is defaultSandboxGracePeriod int = 10. Do you think we need to honor the grace period for the entire pod while stopping in infra container, rather than using the default sandbox grace period?

No, as long as kubernetes will apply graceful on the rest this is acceptable.

[merge]

smarterclayton · 2017-06-27T14:54:51Z

Removing merge briefly while comment is addressed.

dcbw · 2017-06-27T22:24:20Z

re-[merge]

smarterclayton · 2017-06-27T22:58:46Z

[test]

dcbw · 2017-06-28T01:33:07Z

re-[test] flake #14929

openshift-bot · 2017-06-28T01:39:32Z

Evaluated for origin test up to fd6dee8

openshift-bot · 2017-06-28T03:32:22Z

continuous-integration/openshift-jenkins/test SUCCESS (https://ci.openshift.redhat.com/jenkins/job/test_pull_request_origin/2740/) (Base Commit: a67ac87) (PR Branch Commit: fd6dee8)

danwinship · 2017-06-28T12:22:07Z

flake #14043, [merge]

openshift-bot · 2017-06-28T21:31:41Z

continuous-integration/openshift-jenkins/merge Waiting: You are in the build queue at position: 2

dcbw · 2017-06-28T21:36:24Z

flake #14043, [merge]

openshift-bot · 2017-06-28T21:39:47Z

Evaluated for origin merge up to fd6dee8

@danwinship

Automatic merge from submit-queue sdn: move sandbox kill-on-failed-update to runtime socket from direct docker Follow-on to #14892 that reworks #14665 to use the new runtime socket stuff. @openshift/networking @danwinship @pravisankar

dcbw force-pushed the recover-pods-on-restart branch from 114160f to 85001dd Compare June 15, 2017 02:38

danwinship reviewed Jun 15, 2017

View reviewed changes

dcbw force-pushed the recover-pods-on-restart branch from 85001dd to 206e666 Compare June 21, 2017 04:10

danwinship reviewed Jun 21, 2017

View reviewed changes

dcbw force-pushed the recover-pods-on-restart branch from 206e666 to 0177154 Compare June 21, 2017 20:10

danwinship approved these changes Jun 21, 2017

View reviewed changes

knobunc approved these changes Jun 22, 2017

View reviewed changes

dcbw force-pushed the recover-pods-on-restart branch from 0177154 to fd6dee8 Compare June 26, 2017 18:09

smarterclayton reviewed Jun 27, 2017

View reviewed changes

openshift deleted a comment from knobunc Jun 27, 2017

pravisankar mentioned this pull request Jun 27, 2017

Bug 1453190 - Fix pod update operation #14892

Merged

smarterclayton merged commit b8eb455 into openshift:master Jun 28, 2017

dcbw mentioned this pull request Jun 30, 2017

sdn: move sandbox kill-on-failed-update to runtime socket from direct docker #14985

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

sdn: kill containers that fail to update on node restart #14665

sdn: kill containers that fail to update on node restart #14665

dcbw commented Jun 15, 2017

dcbw commented Jun 15, 2017

danwinship commented Jun 15, 2017

danwinship Jun 15, 2017

danwinship Jun 15, 2017

danwinship Jun 15, 2017

dcbw commented Jun 21, 2017

danwinship Jun 21, 2017

danwinship Jun 21, 2017

danwinship Jun 21, 2017

danwinship Jun 21, 2017

dcbw Jun 21, 2017

dcbw commented Jun 21, 2017

danwinship left a comment

knobunc left a comment

knobunc commented Jun 22, 2017

openshift-bot commented Jun 26, 2017

deads2k commented Jun 26, 2017

smarterclayton Jun 27, 2017

dcbw Jun 27, 2017

smarterclayton Jun 27, 2017

smarterclayton commented Jun 27, 2017

dcbw commented Jun 27, 2017

smarterclayton commented Jun 27, 2017

dcbw commented Jun 28, 2017

openshift-bot commented Jun 28, 2017

openshift-bot commented Jun 28, 2017

danwinship commented Jun 28, 2017

openshift-bot commented Jun 28, 2017 •

edited

Loading

dcbw commented Jun 28, 2017

openshift-bot commented Jun 28, 2017

sdn: kill containers that fail to update on node restart #14665

sdn: kill containers that fail to update on node restart #14665

Conversation

dcbw commented Jun 15, 2017

dcbw commented Jun 15, 2017

danwinship commented Jun 15, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dcbw commented Jun 21, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dcbw commented Jun 21, 2017

danwinship left a comment

Choose a reason for hiding this comment

knobunc left a comment

Choose a reason for hiding this comment

knobunc commented Jun 22, 2017

openshift-bot commented Jun 26, 2017

deads2k commented Jun 26, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

smarterclayton commented Jun 27, 2017

dcbw commented Jun 27, 2017

smarterclayton commented Jun 27, 2017

dcbw commented Jun 28, 2017

openshift-bot commented Jun 28, 2017

openshift-bot commented Jun 28, 2017

danwinship commented Jun 28, 2017

openshift-bot commented Jun 28, 2017 • edited Loading

dcbw commented Jun 28, 2017

openshift-bot commented Jun 28, 2017

openshift-bot commented Jun 28, 2017 •

edited

Loading