Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sdn: kill containers that fail to update on node restart #14665

Merged
merged 1 commit into from
Jun 28, 2017

Conversation

dcbw
Copy link
Contributor

@dcbw dcbw commented Jun 15, 2017

With the move to remote runtimes, we can no longer get the pod's
network namespace from kubelet (since we cannot insert ourselves
into the remote runtime's plugin list and intercept network plugin
calls). As kubelet does not call network plugins in any way on
startup if a container is already running, we have no way to ensure
the container is using the correct NetNamespace (as it may have
changed while openshift-node was down) at startup, unless we encode
the required information into OVS flows.

But if OVS was restarted around the same time OpenShift was,
those flows are lost, and we have no information with which to
recover the pod's networking on node startup. In this case, kill
the infra container underneath kubelet so that it will be restarted
and we can set its network up again.

NOTE: this is somewhat hacky and will not work with other remote
runtimes like CRI-O, but OpenShift 3.6 hardcodes dockershim so this
isn't a problem yet. The "correct" solution is to either checkpoint
our network configuration at container setup time and recover that
ourselves, or to add a GET/STATUS call to CNI and make Kubelet call
that operation on startup when recovering running containers.

Fixes: https://bugzilla.redhat.com/show_bug.cgi?id=1453113

@danwinship @openshift/networking @knobunc @eparis

Alternative: we just restart docker which will kill all the pods anyway.

@dcbw
Copy link
Contributor Author

dcbw commented Jun 15, 2017

I did try various iterations of updating the PodStatus to mark the containers as terminated and stuff like that, but unfortunately kubelet also keeps updating status and overwrites the terminated status with its own good status because it has no idea that networking is busted for those pods.

@danwinship
Copy link
Contributor

The "correct" solution is...

please file a github issue or trello card

Alternative: we just restart docker which will kill all the pods anyway.

That potentially runs into problems with containerized installs so this is probably better

}
dockerClient := dockertools.ConnectToDockerOrDie(endpoint, time.Minute, time.Minute)

// Wait until docker has restarted since kubelet will exit it docker isn't running
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"if"

return fmt.Sprintf("%s/%s", pod.Namespace, pod.Name)
}

func dockerSandboxNameToInfraPodNamePrefix(name string) (string, error) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is clearly copied from somewhere in the guts of k8s and should probably indicate where, so if it breaks in the future we know where to look

return nil
}

func (node *OsdnNode) killUpdateFailedPods(pods []kapi.Pod) error {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe move all this stuff to a new file? (It shouldn't even need to be part of OsdnNode.)

@dcbw
Copy link
Contributor Author

dcbw commented Jun 21, 2017

@danwinship updated, and now with a testcase

if err != nil {
return fmt.Errorf("failed to connect to docker after SDN cleanup restart: %v", err)
}); err != nil {
return nil, fmt.Errorf("failed to connect to docker after SDN cleanup restart: %v", err)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That error message is not correct for the kill-on-update case

@@ -270,24 +278,38 @@ func (node *OsdnNode) Start() error {
return err
}

var podsToKill []kapi.Pod
if networkChanged {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we can only have podsToKill if the network changed so maybe all the killing code should be inside the if body?

if c.ID == cid.ID {
infraPrefix, err = dockerSandboxNameToInfraPodNamePrefix(c.Names[0])
if err != nil {
return fmt.Errorf("unparsable container ID %q", c.ID)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shouldn't you return err?

}
// Find and kill the infra container
for _, c := range containers {
if strings.HasPrefix(c.Names[0], infraPrefix) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so this seems fragile... do we have some long-term plan to make this code unnecessary or is this expected to stick around forever?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@danwinship the long-term plan is to make kubelet ask for network status on restart, and if some pod fails, kill and restart that pod for us, instead of assuming everything is groovy.

@dcbw
Copy link
Contributor Author

dcbw commented Jun 21, 2017

@danwinship PTAL, thanks...

Copy link
Contributor

@danwinship danwinship left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

Copy link
Contributor

@knobunc knobunc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I really don't like this, but I guess it is the least bad option available :-(

@knobunc
Copy link
Contributor

knobunc commented Jun 22, 2017

I'll merge this on Monday.

@openshift-bot
Copy link
Contributor

[Test]ing while waiting on the merge queue

With the move to remote runtimes, we can no longer get the pod's
network namespace from kubelet (since we cannot insert ourselves
into the remote runtime's plugin list and intercept network plugin
calls).  As kubelet does not call network plugins in any way on
startup if a container is already running, we have no way to ensure
the container is using the correct NetNamespace (as it may have
changed while openshift-node was down) at startup, unless we encode
the required information into OVS flows.

But if OVS was restarted around the same time OpenShift was,
those flows are lost, and we have no information with which to
recover the pod's networking on node startup.  In this case, kill
the infra container underneath kubelet so that it will be restarted
and we can set its network up again.

NOTE: this is somewhat hacky and will not work with other remote
runtimes like CRI-O, but OpenShift 3.6 hardcodes dockershim so this
isn't a problem yet.  The "correct" solution is to either checkpoint
our network configuration at container setup time and recover that
ourselves, or to add a GET/STATUS call to CNI and make Kubelet call
that operation on startup when recovering running containers.

Fixes: https://bugzilla.redhat.com/show_bug.cgi?id=1453113
@deads2k
Copy link
Contributor

deads2k commented Jun 26, 2017

re[test]

// Find and kill the infra container
for _, c := range containers {
if strings.HasPrefix(c.Names[0], infraPrefix) {
if err := docker.StopContainer(c.ID, 10); err != nil {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ignoring grace period is bad - this could result in silent data corruption for some containers.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@smarterclayton this matches what dockershim/docker_sandbox.go uses for sandbox containers, which is defaultSandboxGracePeriod int = 10. Do you think we need to honor the grace period for the entire pod while stopping in infra container, rather than using the default sandbox grace period?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, as long as kubernetes will apply graceful on the rest this is acceptable.

[merge]

@openshift openshift deleted a comment from knobunc Jun 27, 2017
@smarterclayton
Copy link
Contributor

Removing merge briefly while comment is addressed.

@dcbw
Copy link
Contributor Author

dcbw commented Jun 27, 2017

re-[merge]

@smarterclayton
Copy link
Contributor

[test]

@dcbw
Copy link
Contributor Author

dcbw commented Jun 28, 2017

re-[test] flake #14929

@openshift-bot
Copy link
Contributor

Evaluated for origin test up to fd6dee8

@openshift-bot
Copy link
Contributor

continuous-integration/openshift-jenkins/test SUCCESS (https://ci.openshift.redhat.com/jenkins/job/test_pull_request_origin/2740/) (Base Commit: a67ac87) (PR Branch Commit: fd6dee8)

@danwinship
Copy link
Contributor

flake #14043, [merge]

@openshift-bot
Copy link
Contributor

openshift-bot commented Jun 28, 2017

continuous-integration/openshift-jenkins/merge Waiting: You are in the build queue at position: 2

@dcbw
Copy link
Contributor Author

dcbw commented Jun 28, 2017

flake #14043, [merge]

@openshift-bot
Copy link
Contributor

Evaluated for origin merge up to fd6dee8

@smarterclayton smarterclayton merged commit b8eb455 into openshift:master Jun 28, 2017
openshift-merge-robot added a commit that referenced this pull request Aug 7, 2017
Automatic merge from submit-queue

sdn: move sandbox kill-on-failed-update to runtime socket from direct docker

Follow-on to #14892 that reworks #14665 to use the new runtime socket stuff.

@openshift/networking @danwinship @pravisankar
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants