Join GitHub today
GitHub is home to over 20 million developers working together to host and review code, manage projects, and build software together.
Pod that failed to bind, stuck in Pending state forever #49314
Comments
k8s-ci-robot
added
the
kind/bug
label
Jul 20, 2017
|
@alena1108
Note: Method 1 will trigger an email to the group. You can find the group list here and label list here. |
k8s-merge-robot
added
the
needs-sig
label
Jul 20, 2017
|
/sig scheduling |
k8s-ci-robot
added
the
sig/scheduling
label
Jul 20, 2017
k8s-merge-robot
removed
the
needs-sig
label
Jul 20, 2017
|
Can you provide the details steps for this issue? And the output of |
|
@alena1108 The log just showed the node was notReady or unreachable.
Seems the kubelet node is not quite stable that the connections with the master always get lost. |
alena1108
commented
Jul 21, 2017
•
|
@jianglingxia @dixudx Another pod of the same RC (replicas=2) was started successfully on one of the nodes. I'd expect the failed one to at least be attempted to start again, may be on the node where its peer is running (as there are no anti affinity rules defined on it). But it was stuck in pending state forever (and there was no other reference for the failed pod besides the original log I've attached). I was able to create new RCs of the same kind on the same set of nodes successfully after the Pod scheduling failure, so it looks like some nodes were available for allocation. Perhaps there are some other logs I can look at /provide? @jianglingxia no definite steps. We just extensive validation tests for kubernetes were a bunch of RCs/Deployments/Services/Ingress controllers get created. And starting k8s 1.7.x started observing this failure. Once we see it again, I'll fetch the node status and update the bug. |
moelsayed
referenced this issue
in rancher/rancher
Jul 22, 2017
Closed
Pods get stuck in Pending state. #9433
|
@alena1108 From the log, it seemed that Would you please append |
alena1108
commented
Jul 24, 2017
|
@moelsayed ^^ could you start the schedule with the above param next time for validation test run? |
|
/cc @k82cn |
moelsayed
commented
Jul 25, 2017
|
I ran our tests again, and hit this issue with several pods:
There were a few more. Since none of them is assigned to a node. I included logs for all kubelet's. I also included scheduler log with kubelet1.log.txt
|
|
Seeing this as well. hypothesis: ecb962e#diff-67f2b61521299ca8d8687b0933bbfb19R223 broke the error handling when After that commit, when I'm doing some experiments with patching that and I think there's more to this bug than just that (for example why is |
|
/cc |
julia-stripe
referenced this issue
Jul 26, 2017
Closed
Fix pods stuck in Pending state forever #49661
|
@moelsayed Can you post the description of per namespace?( |
alena1108
commented
Jul 27, 2017
there are no specific parameters passed on the scheduler start besides address and cloudconfig |
|
I posted a patch that i believe fixes this issue at #49661 -- for us it's resolved the issue so far, but I'm not very reliably able to reproduce so it's a bit hard to check. @alena1108 could you apply #49661 and see if your validation tests pass? |
alena1108
commented
Jul 28, 2017
|
@julia-stripe thanks!! We will be able to validate it either today or early next week |
This was referenced Jul 31, 2017
|
@alena1108 Are you able to validate the fix provided by @julia-stripe through #50028? Just saw this issue, and I think it is a serious regression in 1.7 release and we should patch it. cc/ @kubernetes/kubernetes-release-managers @bsalamat @kubernetes/sig-scheduling-bugs @wojtek-t the patch manager for 1.7 patch release. |
alena1108
commented
Aug 3, 2017
|
@dchen1107 our QA have been busy with the current release, and didn't have a chance to validate the fix yet. Hopefully this week. Patching sounds like a great idea. It is a regression indeed as we haven't hit this issue in the previous versions of k8s with the same set of validation tests. |
|
Thanks everyone who helped investigate this! |
dchen1107
added this to the v1.7 milestone
Aug 3, 2017
dchen1107
added
the
status/approved-for-milestone
label
Aug 3, 2017
alena1108
commented
Aug 3, 2017
|
@julia-stripe @dchen1107 just ran validation test against the patched branch - no pods stuck in Pending any more, so the fix did the job! thank you @julia-stripe |
julia-stripe
referenced this issue
Aug 3, 2017
Merged
Retry scheduling pods after errors more consistently in scheduler #50106
k8s-merge-robot
closed this
in
#50028
Aug 4, 2017
added a commit
that referenced
this issue
Aug 4, 2017
added a commit
that referenced
this issue
Aug 7, 2017
guillelb
commented
Sep 22, 2017
|
@julia-stripe, I have this problem with k8s v1.7.0 . Thank you! |
alena1108 commentedJul 20, 2017
•
Edited 1 time
-
alena1108
Jul 20, 2017
Is this a BUG REPORT or FEATURE REQUEST?:
/kind bug
What happened:
Pod got stuck in Pending state forever when failed to bind to host.
What you expected to happen:
For it to be rescheduled on another host
How to reproduce it (as minimally and precisely as possible):
Not always reproducible. But simply created an RC replicas=2 on a 3 host setup, and one of the Pods got stuck in pending state for a while. Following error messages were found in the scheduler logs:
Describe pod output:
RC yml:
Anything else we need to know?:
Environment:
kubectl version): 1.7.1uname -a):