Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kubelet won't retry PodSandbox creation for pods with restart policy "Never" #79398

Closed
yujuhong opened this issue Jun 26, 2019 · 5 comments · Fixed by #79451
Closed

kubelet won't retry PodSandbox creation for pods with restart policy "Never" #79398

yujuhong opened this issue Jun 26, 2019 · 5 comments · Fixed by #79451
Labels
kind/bug Categorizes issue or PR as related to a bug. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. sig/node Categorizes an issue or PR as relevant to SIG Node.
Milestone

Comments

@yujuhong
Copy link
Contributor

yujuhong commented Jun 26, 2019

What happened:

If a pod's restart policy is "Never", and kubelet failed to create the PodSandbox (for whatever reason) once, the pod will be stuck at "ContainerCreating" forever.

What you expected to happen:
Since the user/application containers were never created, kubelet should retry the PodSandbox creation.

The fact that the pod remains in a ContainerCreating state while kubelet explicitly does nothing to resolve the situation also confuses the user.

How to reproduce it (as minimally and precisely as possible):
I don't have an easy way to trigger a podsandbox creation failure off the top of my head, but this should be doable by messing with the CNI binaries/configurations.

Anything else we need to know?:

Relevant code snippet:
https://github.com/kubernetes/kubernetes/blob/v1.16.0-alpha.0/pkg/kubelet/kuberuntime/kuberuntime_manager.go#L474-L482

Environment:

  • Kubernetes version (use kubectl version): Only checked the master branch (1.16), but all supported release branches should have the issue
  • Cloud provider or hardware configuration: not relevant
  • OS (e.g: cat /etc/os-release): not relevant
  • Kernel (e.g. uname -a): not relevant
  • Install tools:
  • Network plugin and version (if this is a network-related bug): not relevant; though the initial PodSandbox creation failure could be caused by networking issues.
  • Others:
@yujuhong yujuhong added kind/bug Categorizes issue or PR as related to a bug. sig/node Categorizes an issue or PR as relevant to SIG Node. labels Jun 26, 2019
@mattjmcnaughton
Copy link
Contributor

@yujuhong thanks for posting!

I'm looking at the code, and it seems like one fix could be checking if this pod has an associated sandbox - if it does, we want to respect the current behavior. If it doesn't, we want to attempt to create the sandbox.

I was thinking something similar to:

if createPodSandbox {
    sandboxUID, _ := m.getSandboxIDByPodUID(pod.UID, nil)
    if !shouldRestartOnFailure(pod) && sandboxUID != nil {
        ...
    }
    ...
}

Thoughts? Is there a simplex fix?

@edmorley
Copy link

Hi! Thank you for filing this - it matches the behaviour we've been seeing.

Am I correct that this is a regression from 5f473bc (#68980), and so only affects Kubernetes >=1.13 ?

@yujuhong yujuhong added the priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. label Jun 26, 2019
@yujuhong yujuhong added this to the v1.16 milestone Jun 26, 2019
@yujuhong
Copy link
Contributor Author

I'm looking at the code, and it seems like one fix could be checking if this pod has an associated sandbox - if it does, we want to respect the current behavior. If it doesn't, we want to attempt to create the sandbox.

Your solution may not work for the one specific failure I checked, where the PodSandbox container was created, but call to the CNI plugin failed (for some reason; could be caused by a race condition). This left us a not-ready PodSandbox with no IP.

Probably have to actually determine whether the user containers were created before. Need to check the code closer though.

@yujuhong
Copy link
Contributor Author

Am I correct that this is a regression from 5f473bc (#68980), and so only affects Kubernetes >=1.13 ?

Thanks. That's probably correct. I saw this in happening in 1.13

@mattjmcnaughton
Copy link
Contributor

mattjmcnaughton commented Jun 27, 2019 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. sig/node Categorizes an issue or PR as relevant to SIG Node.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants