New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pod that failed to bind, stuck in Pending state forever #49314

Closed
alena1108 opened this Issue Jul 20, 2017 · 23 comments

Comments

Projects
None yet
@alena1108

alena1108 commented Jul 20, 2017

Is this a BUG REPORT or FEATURE REQUEST?:

/kind bug

What happened:

Pod got stuck in Pending state forever when failed to bind to host.

What you expected to happen:

For it to be rescheduled on another host

How to reproduce it (as minimally and precisely as possible):

Not always reproducible. But simply created an RC replicas=2 on a 3 host setup, and one of the Pods got stuck in pending state for a while. Following error messages were found in the scheduler logs:

7/20/2017 8:30:40 AME0720 15:30:40.435703       1 scheduler.go:282] Internal error binding pod: (scheduler cache ForgetPod failed: pod test-677550-rc-edit-namespace/nginx-jvn09 state was assumed on a different node)
7/20/2017 8:31:11 AMW0720 15:31:11.119885       1 cache.go:371] Pod test-677550-rc-edit-namespace/nginx-jvn09 expired

Describe pod output:

nginx-jvn09   0/1       Pending   0          2h
> kubectl describe pod nginx-jvn09 --namespace=test-677550-rc-edit-namespace
Name:           nginx-jvn09
Namespace:      test-677550-rc-edit-namespace
Node:           /
Labels:         name=nginx
Annotations:    kubernetes.io/created-by={"kind":"SerializedReference","apiVersion":"v1","reference":{"kind":"Replicatio
nController","namespace":"test-677550-rc-edit-namespace","name":"nginx","uid":"6313354d-6d60-11e...
Status:         Pending
IP:
Controllers:    ReplicationController/nginx
Containers:
  nginx:
    Image:              nginx
    Port:               80/TCP
    Environment:        <none>
    Mounts:
      /var/run/secrets/kubernetes.io/serviceaccount from default-token-vz8x6 (ro)
Volumes:
  default-token-vz8x6:
    Type:       Secret (a volume populated by a Secret)
    SecretName: default-token-vz8x6
    Optional:   false
QoS Class:      BestEffort
Node-Selectors: <none>
Tolerations:    node.alpha.kubernetes.io/notReady=:Exists:NoExecute for 300s
                node.alpha.kubernetes.io/unreachable=:Exists:NoExecute for 300s
Events:         <none>

RC yml:

apiVersion: v1
kind: ReplicationController
metadata:
  name: nginx
spec:
  replicas: 2
  selector:
    name: nginx
  template:
    metadata:
      labels:
        name: nginx
    spec:
      containers:
        - name: nginx
          image: nginx
          ports:
            - containerPort: 80

Anything else we need to know?:

Environment:

  • Kubernetes version (use kubectl version): 1.7.1
  • Cloud provider or hardware configuration**:
  • OS (e.g. from /etc/os-release): RHEL7, Docker native 1.12.6
  • Kernel (e.g. uname -a):
  • Install tools:
  • Others:
@k8s-merge-robot

This comment has been minimized.

Show comment
Hide comment
@k8s-merge-robot

k8s-merge-robot Jul 20, 2017

Contributor

@alena1108
There are no sig labels on this issue. Please add a sig label by:

  1. mentioning a sig: @kubernetes/sig-<group-name>-<group-suffix>
    e.g., @kubernetes/contributor-experience-<group-suffix> to notify the contributor experience sig, OR

  2. specifying the label manually: /sig <label>
    e.g., /sig scalability to apply the sig/scalability label

Note: Method 1 will trigger an email to the group. You can find the group list here and label list here.
The <group-suffix> in the method 1 has to be replaced with one of these: bugs, feature-requests, pr-reviews, test-failures, proposals

Contributor

k8s-merge-robot commented Jul 20, 2017

@alena1108
There are no sig labels on this issue. Please add a sig label by:

  1. mentioning a sig: @kubernetes/sig-<group-name>-<group-suffix>
    e.g., @kubernetes/contributor-experience-<group-suffix> to notify the contributor experience sig, OR

  2. specifying the label manually: /sig <label>
    e.g., /sig scalability to apply the sig/scalability label

Note: Method 1 will trigger an email to the group. You can find the group list here and label list here.
The <group-suffix> in the method 1 has to be replaced with one of these: bugs, feature-requests, pr-reviews, test-failures, proposals

@xiangpengzhao

This comment has been minimized.

Show comment
Hide comment
@xiangpengzhao

xiangpengzhao Jul 20, 2017

Member

/sig scheduling

Member

xiangpengzhao commented Jul 20, 2017

/sig scheduling

@jianzhangbjz

This comment has been minimized.

Show comment
Hide comment
@jianzhangbjz

jianzhangbjz Jul 21, 2017

Contributor

Can you provide the details steps for this issue? And the output of kubectl describe nodes?

Contributor

jianzhangbjz commented Jul 21, 2017

Can you provide the details steps for this issue? And the output of kubectl describe nodes?

@dixudx

This comment has been minimized.

Show comment
Hide comment
@dixudx

dixudx Jul 21, 2017

Member

@alena1108 The log just showed the node was notReady or unreachable.

node.alpha.kubernetes.io/notReady=:Exists:NoExecute for 300s
node.alpha.kubernetes.io/unreachable=:Exists:NoExecute for 300s

Seems the kubelet node is not quite stable that the connections with the master always get lost.

Member

dixudx commented Jul 21, 2017

@alena1108 The log just showed the node was notReady or unreachable.

node.alpha.kubernetes.io/notReady=:Exists:NoExecute for 300s
node.alpha.kubernetes.io/unreachable=:Exists:NoExecute for 300s

Seems the kubelet node is not quite stable that the connections with the master always get lost.

@alena1108

This comment has been minimized.

Show comment
Hide comment
@alena1108

alena1108 Jul 21, 2017

@jianglingxia @dixudx Another pod of the same RC (replicas=2) was started successfully on one of the nodes. I'd expect the failed one to at least be attempted to start again, may be on the node where its peer is running (as there are no anti affinity rules defined on it). But it was stuck in pending state forever (and there was no other reference for the failed pod besides the original log I've attached). I was able to create new RCs of the same kind on the same set of nodes successfully after the Pod scheduling failure, so it looks like some nodes were available for allocation. Perhaps there are some other logs I can look at /provide?

@jianglingxia no definite steps. We just extensive validation tests for kubernetes were a bunch of RCs/Deployments/Services/Ingress controllers get created. And starting k8s 1.7.x started observing this failure. Once we see it again, I'll fetch the node status and update the bug.

alena1108 commented Jul 21, 2017

@jianglingxia @dixudx Another pod of the same RC (replicas=2) was started successfully on one of the nodes. I'd expect the failed one to at least be attempted to start again, may be on the node where its peer is running (as there are no anti affinity rules defined on it). But it was stuck in pending state forever (and there was no other reference for the failed pod besides the original log I've attached). I was able to create new RCs of the same kind on the same set of nodes successfully after the Pod scheduling failure, so it looks like some nodes were available for allocation. Perhaps there are some other logs I can look at /provide?

@jianglingxia no definite steps. We just extensive validation tests for kubernetes were a bunch of RCs/Deployments/Services/Ingress controllers get created. And starting k8s 1.7.x started observing this failure. Once we see it again, I'll fetch the node status and update the bug.

@dixudx

This comment has been minimized.

Show comment
Hide comment
@dixudx

dixudx Jul 23, 2017

Member

@alena1108 From the log, it seemed that scheduler failed to bind your pod to the node and also failed to forget the pod, which led to unable to reschedule the pod to other nodes.

Would you please append -v==10 when you start kube-scheduler and reproduce this issue. Also the kubelet log of that erroneous node.

Member

dixudx commented Jul 23, 2017

@alena1108 From the log, it seemed that scheduler failed to bind your pod to the node and also failed to forget the pod, which led to unable to reschedule the pod to other nodes.

Would you please append -v==10 when you start kube-scheduler and reproduce this issue. Also the kubelet log of that erroneous node.

@alena1108

This comment has been minimized.

Show comment
Hide comment
@alena1108

alena1108 Jul 24, 2017

@moelsayed ^^ could you start the schedule with the above param next time for validation test run?

alena1108 commented Jul 24, 2017

@moelsayed ^^ could you start the schedule with the above param next time for validation test run?

@xiangpengzhao

This comment has been minimized.

Show comment
Hide comment
@xiangpengzhao
Member

xiangpengzhao commented Jul 24, 2017

/cc @k82cn

@moelsayed

This comment has been minimized.

Show comment
Hide comment
@moelsayed

moelsayed Jul 25, 2017

I ran our tests again, and hit this issue with several pods:

$ kubectl get pods -n test-300885-testingress20
NAME                     READY     STATUS    RESTARTS   AGE
k8testrc20-one-4wqmw     1/1       Running   0          1h
k8testrc20-one-wwzmv     1/1       Running   0          1h
k8testrc20-three-hth69   1/1       Running   0          1h
**k8testrc20-three-q56k9**   0/1       Pending   0          1h
k8testrc20-two-78r3g     1/1       Running   0          1h
k8testrc20-two-ndjzm     1/1       Running   0          1h
kubectl get pods -n test-300885-testingress17
NAME                   READY     STATUS    RESTARTS   AGE
**k8testrc17-one-d2cr0**   0/1       Pending   0          1h
k8testrc17-one-xq39j   1/1       Running   0          1h
$ kubectl get pods -n test-459920-create-namespace
NAME              READY     STATUS    RESTARTS   AGE
testnginx-9xqk6   0/1       Pending   0          1h
testnginx-gngxf   1/1       Running   0          1h

There were a few more. Since none of them is assigned to a node. I included logs for all kubelet's. I also included scheduler log with --v=1:

kubelet1.log.txt
kubelet2.log.txt
kubelet3.log.txt
sched.log.txt

kubectl describe nodes:

describe_nodes.txt

moelsayed commented Jul 25, 2017

I ran our tests again, and hit this issue with several pods:

$ kubectl get pods -n test-300885-testingress20
NAME                     READY     STATUS    RESTARTS   AGE
k8testrc20-one-4wqmw     1/1       Running   0          1h
k8testrc20-one-wwzmv     1/1       Running   0          1h
k8testrc20-three-hth69   1/1       Running   0          1h
**k8testrc20-three-q56k9**   0/1       Pending   0          1h
k8testrc20-two-78r3g     1/1       Running   0          1h
k8testrc20-two-ndjzm     1/1       Running   0          1h
kubectl get pods -n test-300885-testingress17
NAME                   READY     STATUS    RESTARTS   AGE
**k8testrc17-one-d2cr0**   0/1       Pending   0          1h
k8testrc17-one-xq39j   1/1       Running   0          1h
$ kubectl get pods -n test-459920-create-namespace
NAME              READY     STATUS    RESTARTS   AGE
testnginx-9xqk6   0/1       Pending   0          1h
testnginx-gngxf   1/1       Running   0          1h

There were a few more. Since none of them is assigned to a node. I included logs for all kubelet's. I also included scheduler log with --v=1:

kubelet1.log.txt
kubelet2.log.txt
kubelet3.log.txt
sched.log.txt

kubectl describe nodes:

describe_nodes.txt

@julia-stripe

This comment has been minimized.

Show comment
Hide comment
@julia-stripe

julia-stripe Jul 26, 2017

Contributor

Seeing this as well.

hypothesis: ecb962e#diff-67f2b61521299ca8d8687b0933bbfb19R223 broke the error handling when ForgetPod fails. Before that commit, when ForgetPod failed it would log an error and pass the pod to the error handler (sched.config.Error(pod, err)), which is in charge of retrying scheduling the pod.

After that commit, when ForgetPod fails it skips the error handling and scheduling is never retried.

I'm doing some experiments with patching that and I think there's more to this bug than just that (for example why is ForgetPod failing in the first place? is the scheduler cache corrupted? why?), but that's what I've got so far.

Contributor

julia-stripe commented Jul 26, 2017

Seeing this as well.

hypothesis: ecb962e#diff-67f2b61521299ca8d8687b0933bbfb19R223 broke the error handling when ForgetPod fails. Before that commit, when ForgetPod failed it would log an error and pass the pod to the error handler (sched.config.Error(pod, err)), which is in charge of retrying scheduling the pod.

After that commit, when ForgetPod fails it skips the error handling and scheduling is never retried.

I'm doing some experiments with patching that and I think there's more to this bug than just that (for example why is ForgetPod failing in the first place? is the scheduler cache corrupted? why?), but that's what I've got so far.

@yastij

This comment has been minimized.

Show comment
Hide comment
@yastij

yastij Jul 26, 2017

Member

/cc

Member

yastij commented Jul 26, 2017

/cc

@jianzhangbjz

This comment has been minimized.

Show comment
Hide comment
@jianzhangbjz

jianzhangbjz Jul 27, 2017

Contributor

@moelsayed Can you post the description of per namespace?(kubectl describe ns xxx) Are you enabled ResourceQuota? And what's the config of scheduler component?

Contributor

jianzhangbjz commented Jul 27, 2017

@moelsayed Can you post the description of per namespace?(kubectl describe ns xxx) Are you enabled ResourceQuota? And what's the config of scheduler component?

@alena1108

This comment has been minimized.

Show comment
Hide comment
@alena1108

alena1108 Jul 27, 2017

@jianglingxia

> kubectl describe namespace test-504053-deploymentsrollback-namespace
Name:           test-504053-deploymentsrollback-namespace
Labels:         <none>
Annotations:    <none>
Status:         Active

No resource quota.

No resource limits.
>

there are no specific parameters passed on the scheduler start besides address and cloudconfig

alena1108 commented Jul 27, 2017

@jianglingxia

> kubectl describe namespace test-504053-deploymentsrollback-namespace
Name:           test-504053-deploymentsrollback-namespace
Labels:         <none>
Annotations:    <none>
Status:         Active

No resource quota.

No resource limits.
>

there are no specific parameters passed on the scheduler start besides address and cloudconfig

@julia-stripe

This comment has been minimized.

Show comment
Hide comment
@julia-stripe

julia-stripe Jul 28, 2017

Contributor

I posted a patch that i believe fixes this issue at #49661 -- for us it's resolved the issue so far, but I'm not very reliably able to reproduce so it's a bit hard to check. @alena1108 could you apply #49661 and see if your validation tests pass?

Contributor

julia-stripe commented Jul 28, 2017

I posted a patch that i believe fixes this issue at #49661 -- for us it's resolved the issue so far, but I'm not very reliably able to reproduce so it's a bit hard to check. @alena1108 could you apply #49661 and see if your validation tests pass?

@alena1108

This comment has been minimized.

Show comment
Hide comment
@alena1108

alena1108 Jul 28, 2017

@julia-stripe thanks!! We will be able to validate it either today or early next week

alena1108 commented Jul 28, 2017

@julia-stripe thanks!! We will be able to validate it either today or early next week

@dchen1107

This comment has been minimized.

Show comment
Hide comment
@dchen1107

dchen1107 Aug 3, 2017

Member

@alena1108 Are you able to validate the fix provided by @julia-stripe through #50028? Just saw this issue, and I think it is a serious regression in 1.7 release and we should patch it.

cc/ @kubernetes/kubernetes-release-managers @bsalamat @kubernetes/sig-scheduling-bugs @wojtek-t the patch manager for 1.7 patch release.

Member

dchen1107 commented Aug 3, 2017

@alena1108 Are you able to validate the fix provided by @julia-stripe through #50028? Just saw this issue, and I think it is a serious regression in 1.7 release and we should patch it.

cc/ @kubernetes/kubernetes-release-managers @bsalamat @kubernetes/sig-scheduling-bugs @wojtek-t the patch manager for 1.7 patch release.

@alena1108

This comment has been minimized.

Show comment
Hide comment
@alena1108

alena1108 Aug 3, 2017

@dchen1107 our QA have been busy with the current release, and didn't have a chance to validate the fix yet. Hopefully this week.

Patching sounds like a great idea. It is a regression indeed as we haven't hit this issue in the previous versions of k8s with the same set of validation tests.

alena1108 commented Aug 3, 2017

@dchen1107 our QA have been busy with the current release, and didn't have a chance to validate the fix yet. Hopefully this week.

Patching sounds like a great idea. It is a regression indeed as we haven't hit this issue in the previous versions of k8s with the same set of validation tests.

@davidopp

This comment has been minimized.

Show comment
Hide comment
@davidopp

davidopp Aug 3, 2017

Member

Thanks everyone who helped investigate this!

Member

davidopp commented Aug 3, 2017

Thanks everyone who helped investigate this!

@dchen1107 dchen1107 added this to the v1.7 milestone Aug 3, 2017

@alena1108

This comment has been minimized.

Show comment
Hide comment
@alena1108

alena1108 Aug 3, 2017

@julia-stripe @dchen1107 just ran validation test against the patched branch - no pods stuck in Pending any more, so the fix did the job! thank you @julia-stripe

alena1108 commented Aug 3, 2017

@julia-stripe @dchen1107 just ran validation test against the patched branch - no pods stuck in Pending any more, so the fix did the job! thank you @julia-stripe

k8s-merge-robot added a commit that referenced this issue Aug 4, 2017

Merge pull request #50028 from julia-stripe/fix-incorrect-scheduler-b…
…ind-call

Automatic merge from submit-queue

Fix incorrect call to 'bind' in scheduler

I previously submitted #49661 -- I'm not sure if that PR is too big or what, but this is an attempt at a smaller PR that makes progress on the same issue and is easier to review.

**What this PR does / why we need it**:

In this refactor (ecb962e#diff-67f2b61521299ca8d8687b0933bbfb19R223) the scheduler code was refactored into separate `bind` and `assume` functions. When that happened, `bind` was called with `pod` as an argument. The argument to `bind` should be the assumed pod, not the original pod. Evidence that `assumedPod` is the correct argument bind and not `pod`: https://github.com/kubernetes/kubernetes/blob/80f26fa8a89ef5863cb19c71a620bb389d025166/plugin/pkg/scheduler/scheduler.go#L229-L234. (and it says `assumed` in the function signature for `bind`, even though it's not called with the assumed pod as an argument).

This is an issue (and causes #49314, where pods that fail to bind to a node get stuck indefinitely) in the following scenario:

1. The pod fails to bind to the node
2. `bind` calls `ForgetPod` with the `pod` argument
3. since `ForgetPod` is expecting the assumed pod as an argument (because that's what's in the scheduler cache), it fails with an error like `scheduler cache ForgetPod failed: pod test-677550-rc-edit-namespace/nginx-jvn09 state was assumed on a different node`
4. The pod gets lost forever because of some incomplete error handling (which I haven't addressed here in the interest of making a simpler PR)

In this PR I've fixed the call to `bind` and modified the tests to make sure that `ForgetPod` gets called with the correct argument (the assumed pod) when binding fails.

**Which issue this PR fixes**: fixes #49314

**Special notes for your reviewer**:

**Release note**:

```release-note
```

k8s-merge-robot added a commit that referenced this issue Aug 7, 2017

Merge pull request #50106 from julia-stripe/improve-scheduler-error-h…
…andling

Automatic merge from submit-queue

Retry scheduling pods after errors more consistently in scheduler

**What this PR does / why we need it**:

This fixes 2 places in the scheduler where pods can get stuck in Pending forever.  In both these places, errors happen and `sched.config.Error` is not called afterwards. This is a problem because `sched.config.Error` is responsible for requeuing pods to retry scheduling when there are issues (see [here](https://github.com/kubernetes/kubernetes/blob/2540b333b263c9c2a127395acecdef2eeb716a8e/plugin/pkg/scheduler/factory/factory.go#L958)), so if we don't call `sched.config.Error` then the pod will never get scheduled (unless the scheduler is restarted).

One of these (where it returns when `ForgetPod` fails instead of continuing and reporting an error) is a regression from [this refactor](ecb962e#diff-67f2b61521299ca8d8687b0933bbfb19L234), and with the [old behavior](https://github.com/kubernetes/kubernetes/blob/80f26fa8a89ef5863cb19c71a620bb389d025166/plugin/pkg/scheduler/scheduler.go#L233-L237) the error was reported correctly. As far as I can tell changing the error handling in that refactor wasn't intentional.

When AssumePod fails there's never been an error reported but I think adding this will help the scheduler recover when something goes wrong instead of letting pods possibly never get scheduled.

This will help prevent issues like #49314 in the future.

**Release note**:

```release-note
Fix incorrect retry logic in scheduler
```
@guillelb

This comment has been minimized.

Show comment
Hide comment
@guillelb

guillelb Sep 22, 2017

@julia-stripe, I have this problem with k8s v1.7.0 .
How can I fix it, upgrading to v1.7.X?

Thank you!

guillelb commented Sep 22, 2017

@julia-stripe, I have this problem with k8s v1.7.0 .
How can I fix it, upgrading to v1.7.X?

Thank you!

@z-oo

This comment has been minimized.

Show comment
Hide comment
@z-oo

z-oo Feb 2, 2018

I am seeing this in 1.9. What steps I need to take to solve this problem?

$ kubectl describe pod k8s-claim-test

Name:                           k8s-claim-test
Namespace:                      default
Node:                           simulator4/10.1.1.114
Start Time:                     Thu, 01 Feb 2018 16:44:58 -0800
Labels:                         <none>
Annotations:                    <none>
Status:                         Terminating (expires Thu, 01 Feb 2018 18:27:26 -0800)
Termination Grace Period:       0s
IP:                             172.16.0.89
Controllers:                    <none>
Containers:
  k8s-claim-test:
    Container ID:       docker://d2dc1aac6fcf775b74640b995d09108dcb5bd83916ce72b3a6f9296464fdb149
    Image:              kubernetes/pause
    Image ID:           docker-pullable://kubernetes/pause@sha256:2088df8eb02f10aae012e6d4bc212cabb0ada93cb05f09e504af0c9811e0ca14
    Port:               
    Command:
      sleep
      60000
    State:              Terminated
      Exit Code:        0
      Started:          Mon, 01 Jan 0001 00:00:00 +0000
      Finished:         Mon, 01 Jan 0001 00:00:00 +0000
    Last State:         Terminated
      Reason:           ContainerCannotRun
      Message:          OCI runtime create failed: container_linux.go:296: starting container process caused "exec: \"sleep\": executable file not found in $PATH": unknown
      Exit Code:        127
      Started:          Thu, 01 Feb 2018 16:48:14 -0800
      Finished:         Thu, 01 Feb 2018 16:48:14 -0800
    Ready:              False
    Restart Count:      3
    Environment:        <none>
    Mounts:
      /mnt/rbd from k8s (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from default-token-8qmfl (ro)
Conditions:
  Type          Status
  Initialized   True 
  Ready         False 
  PodScheduled  True 
Volumes:
  k8s:
    Type:               RBD (a Rados Block Device mount on the host that shares a pod's lifetime)
    CephMonitors:       [10.1.1.41:6789 10.1.1.253:6789 10.1.0.65:6789]
    RBDImage:           fs
    FSType:             ext4
    RBDPool:            k8s
    RadosUser:          k8s
    Keyring:            /etc/ceph/keyring
    SecretRef:          &{woodenbox-k8s}
    ReadOnly:           true
  default-token-8qmfl:
    Type:       Secret (a volume populated by a Secret)
    SecretName: default-token-8qmfl
    Optional:   false
QoS Class:      BestEffort
Node-Selectors: <none>
Tolerations:    node.kubernetes.io/not-ready=:Exists:NoExecute for 300s
                node.kubernetes.io/unreachable=:Exists:NoExecute for 300s
Events:         <none>

z-oo commented Feb 2, 2018

I am seeing this in 1.9. What steps I need to take to solve this problem?

$ kubectl describe pod k8s-claim-test

Name:                           k8s-claim-test
Namespace:                      default
Node:                           simulator4/10.1.1.114
Start Time:                     Thu, 01 Feb 2018 16:44:58 -0800
Labels:                         <none>
Annotations:                    <none>
Status:                         Terminating (expires Thu, 01 Feb 2018 18:27:26 -0800)
Termination Grace Period:       0s
IP:                             172.16.0.89
Controllers:                    <none>
Containers:
  k8s-claim-test:
    Container ID:       docker://d2dc1aac6fcf775b74640b995d09108dcb5bd83916ce72b3a6f9296464fdb149
    Image:              kubernetes/pause
    Image ID:           docker-pullable://kubernetes/pause@sha256:2088df8eb02f10aae012e6d4bc212cabb0ada93cb05f09e504af0c9811e0ca14
    Port:               
    Command:
      sleep
      60000
    State:              Terminated
      Exit Code:        0
      Started:          Mon, 01 Jan 0001 00:00:00 +0000
      Finished:         Mon, 01 Jan 0001 00:00:00 +0000
    Last State:         Terminated
      Reason:           ContainerCannotRun
      Message:          OCI runtime create failed: container_linux.go:296: starting container process caused "exec: \"sleep\": executable file not found in $PATH": unknown
      Exit Code:        127
      Started:          Thu, 01 Feb 2018 16:48:14 -0800
      Finished:         Thu, 01 Feb 2018 16:48:14 -0800
    Ready:              False
    Restart Count:      3
    Environment:        <none>
    Mounts:
      /mnt/rbd from k8s (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from default-token-8qmfl (ro)
Conditions:
  Type          Status
  Initialized   True 
  Ready         False 
  PodScheduled  True 
Volumes:
  k8s:
    Type:               RBD (a Rados Block Device mount on the host that shares a pod's lifetime)
    CephMonitors:       [10.1.1.41:6789 10.1.1.253:6789 10.1.0.65:6789]
    RBDImage:           fs
    FSType:             ext4
    RBDPool:            k8s
    RadosUser:          k8s
    Keyring:            /etc/ceph/keyring
    SecretRef:          &{woodenbox-k8s}
    ReadOnly:           true
  default-token-8qmfl:
    Type:       Secret (a volume populated by a Secret)
    SecretName: default-token-8qmfl
    Optional:   false
QoS Class:      BestEffort
Node-Selectors: <none>
Tolerations:    node.kubernetes.io/not-ready=:Exists:NoExecute for 300s
                node.kubernetes.io/unreachable=:Exists:NoExecute for 300s
Events:         <none>
@obriensystems

This comment has been minimized.

Show comment
Hide comment
@obriensystems

obriensystems Mar 12, 2018

I am seeing this periodically on Azure VMs

obriensystems commented Mar 12, 2018

I am seeing this periodically on Azure VMs

@jdumars

This comment has been minimized.

Show comment
Hide comment
@jdumars

jdumars Mar 12, 2018

Member

@obriensystems any chance you could open a new issue linked to this one with the requisite troubleshooting info (versions, etc.) and label it with /sig azure

Thanks!

Member

jdumars commented Mar 12, 2018

@obriensystems any chance you could open a new issue linked to this one with the requisite troubleshooting info (versions, etc.) and label it with /sig azure

Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment