Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

docker kill hangs, pod stuck in terminating #25456

Closed
bprashanth opened this issue May 11, 2016 · 47 comments
Closed

docker kill hangs, pod stuck in terminating #25456

bprashanth opened this issue May 11, 2016 · 47 comments
Assignees
Labels
area/docker priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. sig/node Categorizes an issue or PR as relevant to SIG Node.

Comments

@bprashanth
Copy link
Contributor

I had a pod stuck in terminating. Similar to some other issues, didn't check for exact dupe.

Logged into the node and debugged a little (container cde46198ade6):

beeps@e2e-test-beeps-minion-cfl3:~$ sudo docker kill cde46198ade6
...
Hung for like 5m
^C
beeps@e2e-test-beeps-minion-cfl3:~$ sudo docker ps
CONTAINER ID        IMAGE                                                                  COMMAND                  CREATED             STATUS              PORTS               NAMES
cde46198ade6        erkules/galera:basic                                                   "/entrypoint.sh --def"   7 minutes ago       Up 7 minutes                            k8s_mysql.47396615_mysql-2_e2e-tests-petset-hy8ki_e8e98ddf-172d-11e6-b810-42010af00002_36392735

beeps@e2e-test-beeps-minion-cfl3:~$ sudo docker kill cde46198ade6
Error response from daemon: Cannot kill container cde46198ade6: [2] Container does not exist: container destroyed
Error: failed to kill containers: [cde46198ade6]

beeps@e2e-test-beeps-minion-cfl3:~$ sudo docker ps
CONTAINER ID        IMAGE                                                                  COMMAND                  CREATED             STATUS              PORTS               NAMES
cde46198ade6        erkules/galera:basic                                                   "/entrypoint.sh --def"   7 minutes ago       Up 7 minutes   

Exec works after the kill failed

beeps@e2e-test-beeps-minion-cfl3:~$ sudo docker exec -it cde46198ade6 /bin/bash
root@mysql-2:/# ps aux
USER       PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
root         1  1.5  0.0  18148  3096 ?        Ss   04:17   0:00 /bin/bash
root        15  0.0  0.0  15572  2176 ?        R+   04:17   0:00 ps aux

beeps@e2e-test-beeps-minion-cfl3:~$ sudo docker kill cde46198ade6
Error response from daemon: Cannot kill container cde46198ade6: [2] Container does not exist: container destroyed
Error: failed to kill containers: [cde46198ade6]

Inspect shows Running true:

beeps@e2e-test-beeps-minion-cfl3:~$ sudo docker inspect cde46198ade6
[
{
    "Id": "cde46198ade62617e84cc59987669d4e674e83475259680d1952fa60ff90565c",
    "Created": "2016-05-11T04:07:54.627038551Z",
    "Path": "/entrypoint.sh",
    "Args": [
        "--defaults-file=/etc/mysql/my-galera.cnf",
        "--user=root"
    ],
    "State": {
        "Status": "running",
        "Running": true,
        "Paused": false,
        "Restarting": false,
        "OOMKilled": false,
        "Dead": false,
        "Pid": 27514,
        "ExitCode": 0,
        "Error": "",
        "StartedAt": "2016-05-11T04:07:54.839755198Z",
        "FinishedAt": "0001-01-01T00:00:00Z"
    },
    "Image": "7108a4321e9900675ba193af33555d0354ab66fc72ff592ae2acd38191db488a",
    "ResolvConfPath": "/var/lib/docker/containers/e3ddae18879c2d6723dd960fecaf32633c726e283a747a5171822622b0ca5236/resolv.conf",
    "HostnamePath": "/var/lib/docker/containers/e3ddae18879c2d6723dd960fecaf32633c726e283a747a5171822622b0ca5236/hostname",
    "HostsPath": "/var/lib/kubelet/pods/e8e98ddf-172d-11e6-b810-42010af00002/etc-hosts",
    "LogPath": "/var/lib/docker/containers/cde46198ade62617e84cc59987669d4e674e83475259680d1952fa60ff90565c/cde46198ade62617e84cc59987669d4e674e83475259680d1952fa60ff90565c-json.log",
    "Name": "/k8s_mysql.47396615_mysql-2_e2e-tests-petset-hy8ki_e8e98ddf-172d-11e6-b810-42010af00002_36392735",
    "RestartCount": 0,
    "Driver": "aufs",
    "ExecDriver": "native-0.2",
    "MountLabel": "",
    "ProcessLabel": "",
    "AppArmorProfile": "",
    "ExecIDs": null,
    "HostConfig": {
        "Binds": [
            "/var/lib/kubelet/pods/e8e98ddf-172d-11e6-b810-42010af00002/volumes/kubernetes.io~gce-pd/pv-gce-qf9p5:/var/lib/",
            "/var/lib/kubelet/pods/e8e98ddf-172d-11e6-b810-42010af00002/volumes/kubernetes.io~empty-dir/config:/etc/mysql",
            "/var/lib/kubelet/pods/e8e98ddf-172d-11e6-b810-42010af00002/volumes/kubernetes.io~secret/default-token-ksbzc:/var/run/secrets/kubernetes.io/serviceaccount:ro",
            "/var/lib/kubelet/pods/e8e98ddf-172d-11e6-b810-42010af00002/etc-hosts:/etc/hosts",
            "/var/lib/kubelet/pods/e8e98ddf-172d-11e6-b810-42010af00002/containers/mysql/36392735:/dev/termination-log"
        ],
        "ContainerIDFile": "",
        "LxcConf": null,
        "Memory": 0,
        "MemoryReservation": 0,
        "MemorySwap": -1,
        "KernelMemory": 0,
        "CpuShares": 2,
        "CpuPeriod": 0,
        "CpusetCpus": "",
        "CpusetMems": "",
        "CpuQuota": 0,
        "BlkioWeight": 0,
        "OomKillDisable": false,
        "MemorySwappiness": null,
        "Privileged": false,
        "PortBindings": null,
        "Links": null,
        "PublishAllPorts": false,
        "Dns": null,
        "DnsOptions": null,
        "DnsSearch": null,
        "ExtraHosts": null,
        "VolumesFrom": null,
        "Devices": null,
        "NetworkMode": "container:e3ddae18879c2d6723dd960fecaf32633c726e283a747a5171822622b0ca5236",
        "IpcMode": "container:e3ddae18879c2d6723dd960fecaf32633c726e283a747a5171822622b0ca5236",
        "PidMode": "",
        "UTSMode": "",
        "CapAdd": null,
        "CapDrop": null,
        "GroupAdd": null,
        "RestartPolicy": {
            "Name": "",
            "MaximumRetryCount": 0
        },
        "SecurityOpt": null,
        "ReadonlyRootfs": false,
        "Ulimits": null,
        "LogConfig": {
            "Type": "json-file",
            "Config": {}
        },
        "CgroupParent": "/",
        "ConsoleSize": [
            0,
            0
        ],
        "VolumeDriver": ""
    },
    "GraphDriver": {
        "Name": "aufs",
        "Data": null
    },
    "Mounts": [
        {
            "Source": "/var/lib/kubelet/pods/e8e98ddf-172d-11e6-b810-42010af00002/volumes/kubernetes.io~secret/default-token-ksbzc",
            "Destination": "/var/run/secrets/kubernetes.io/serviceaccount",
            "Mode": "ro",
            "RW": false
        },
        {
            "Source": "/var/lib/kubelet/pods/e8e98ddf-172d-11e6-b810-42010af00002/etc-hosts",
            "Destination": "/etc/hosts",
            "Mode": "",
            "RW": true
        },
        {
            "Source": "/var/lib/kubelet/pods/e8e98ddf-172d-11e6-b810-42010af00002/containers/mysql/36392735",
            "Destination": "/dev/termination-log",
            "Mode": "",
            "RW": true
        },
        {
            "Source": "/var/lib/kubelet/pods/e8e98ddf-172d-11e6-b810-42010af00002/volumes/kubernetes.io~gce-pd/pv-gce-qf9p5",
            "Destination": "/var/lib",
            "Mode": "",
            "RW": true
        },
        {
            "Source": "/var/lib/kubelet/pods/e8e98ddf-172d-11e6-b810-42010af00002/volumes/kubernetes.io~empty-dir/config",
            "Destination": "/etc/mysql",
            "Mode": "",
            "RW": true
        }
    ],
    "Config": {
        "Hostname": "mysql-2",
        "Domainname": "",
        "User": "",
        "AttachStdin": false,
        "AttachStdout": false,
        "AttachStderr": false,
        "ExposedPorts": {
            "3306/tcp": {}
        },
        "Tty": false,
        "OpenStdin": false,
        "StdinOnce": false,
        "Env": [
            "KUBERNETES_PORT_443_TCP_PROTO=tcp",
            "KUBERNETES_PORT_443_TCP_PORT=443",
            "KUBERNETES_PORT_443_TCP_ADDR=10.0.0.1",
            "KUBERNETES_SERVICE_HOST=10.0.0.1",
            "KUBERNETES_SERVICE_PORT=443",
            "KUBERNETES_SERVICE_PORT_HTTPS=443",
            "KUBERNETES_PORT=tcp://10.0.0.1:443",
            "KUBERNETES_PORT_443_TCP=tcp://10.0.0.1:443",
            "PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin",
            "VERSION=20160303",
            "DEBIAN_FRONTEND=noninteractive"
        ],
        "Cmd": [
            "--defaults-file=/etc/mysql/my-galera.cnf",
            "--user=root"
        ],
        "Image": "erkules/galera:basic",
        "Volumes": null,
        "WorkingDir": "",
        "Entrypoint": [
            "/entrypoint.sh"
        ],
        "OnBuild": null,
        "Labels": {
            "io.kubernetes.container.hash": "47396615",
            "io.kubernetes.container.name": "mysql",
            "io.kubernetes.container.restartCount": "0",
            "io.kubernetes.container.terminationMessagePath": "/dev/termination-log",
            "io.kubernetes.pod.name": "mysql-2",
            "io.kubernetes.pod.namespace": "e2e-tests-petset-hy8ki",
            "io.kubernetes.pod.terminationGracePeriod": "30",
            "io.kubernetes.pod.uid": "e8e98ddf-172d-11e6-b810-42010af00002"
        }
    },
    "NetworkSettings": {
        "Bridge": "",
        "SandboxID": "",
        "HairpinMode": false,
        "LinkLocalIPv6Address": "",
        "LinkLocalIPv6PrefixLen": 0,
        "Ports": null,
        "SandboxKey": "",
        "SecondaryIPAddresses": null,
        "SecondaryIPv6Addresses": null,
        "EndpointID": "",
        "Gateway": "",
        "GlobalIPv6Address": "",
        "GlobalIPv6PrefixLen": 0,
        "IPAddress": "",
        "IPPrefixLen": 0,
        "IPv6Gateway": "",
        "MacAddress": "",
        "Networks": null
    }
}
]

But the pid isn't around:

beeps@e2e-test-beeps-minion-cfl3:~$ ps aux | grep 27514
beeps    31717  0.0  0.0   7852  1948 pts/1    S+   04:20   0:00 grep 27514

And the pod remains in terminating:

21:16:48-beeps~/goproj/src/k8s.io/kubernetes] (petset_e2e)$ kn get po
NAME      READY     STATUS        RESTARTS   AGE
mysql-2   0/1       Terminating   0          9m

This is on 1.9.1, maybe fixed?

beeps@e2e-test-beeps-minion-cfl3:~$ docker version
Client:
 Version:      1.9.1
 API version:  1.21
 Go version:   go1.4.3
 Git commit:   a34a1d5
 Built:        Fri Nov 20 17:56:04 UTC 2015
 OS/Arch:      linux/amd64
Cannot connect to the Docker daemon. Is the docker daemon running on this host?
beeps@e2e-test-beeps-minion-cfl3:~$ sudo docker version
Client:
 Version:      1.9.1
 API version:  1.21
 Go version:   go1.4.3
 Git commit:   a34a1d5
 Built:        Fri Nov 20 17:56:04 UTC 2015
 OS/Arch:      linux/amd64

Server:
 Version:      1.9.1
 API version:  1.21
 Go version:   go1.4.3
 Git commit:   a34a1d5
 Built:        Fri Nov 20 17:56:04 UTC 2015
 OS/Arch:      linux/amd64

@kubernetes/goog-node

@bprashanth bprashanth added the sig/node Categorizes an issue or PR as relevant to SIG Node. label May 11, 2016
@bprashanth
Copy link
Contributor Author

Oh and the pod was in Running/Ready before killing it

@yujuhong
Copy link
Contributor

yujuhong commented May 11, 2016

This looks similar to #21751 (comment), in which case exec during docker stop may caused inconsistent state.
There was some possible explanation in the original docker bug: moby/moby#18758 (comment)
What's the pod spec? Were there any health checks using exec, or any other exec call to the container?

There has also been some new docker issues about "Container does not exist: container destroyed": moby/moby#12738
Users have reported seeing the same issue in 1.10 as well. Docker 1.11 may be different.

@bprashanth
Copy link
Contributor Author

The pod stuck around for 13 hours till I restarted docker.

Whats the pod spec? Were there any health checks using exec, or any other exec call to the container?

Yes.

kind: PetSet
metadata:
  name: mysql
spec:
  serviceName: "galera"
  replicas: 3
  template:
    metadata:
      labels:
        app: mysql
      annotations:
        pod.alpha.kubernetes.io/initialized: "true"
        pod.alpha.kubernetes.io/init-containers: '[
            {
                "name": "install",
                "image": "bprashanth/galera-install:0.1",
                "imagePullPolicy": "Always",
                "args": ["--work-dir=/work-dir"],
                "volumeMounts": [
                    {
                        "name": "workdir",
                        "mountPath": "/work-dir"
                    },
                    {
                        "name": "config",
                        "mountPath": "/etc/mysql"
                    }
                ]
            },
            {
                "name": "bootstrap",
                "image": "debian:jessie",
                "command": ["/work-dir/peer-finder"],
                "args": ["-on-start=\"/work-dir/on-start.sh\"", "-service=galera"],
                "env": [
                  {
                      "name": "POD_NAMESPACE",
                      "valueFrom": {
                          "fieldRef": {
                              "apiVersion": "v1",
                              "fieldPath": "metadata.namespace"
                          }
                      }
                   }
                ],
                "volumeMounts": [
                    {
                        "name": "workdir",
                        "mountPath": "/work-dir"
                    },
                    {
                        "name": "config",
                        "mountPath": "/etc/mysql"
                    }
                ]
            }
        ]'
    spec:
      containers:
      - name: mysql
        image: erkules/galera:basic
        ports:
        - containerPort: 3306
          name: mysql
        - containerPort: 4444
          name: sst
        - containerPort: 4567
          name: replication
        - containerPort: 4568
          name: ist
        args:
        - --defaults-file=/etc/mysql/my-galera.cnf
        - --user=root
        readinessProbe:
          exec:
            command:
            - sh
            - -c
            - "mysql -u root -e 'show databases;'"
          initialDelaySeconds: 15
          timeoutSeconds: 5
        volumeMounts:
        - name: datadir
          mountPath: /var/lib/
        - name: config
          mountPath: /etc/mysql
      volumes:
      - name: config
        emptyDir: {}
      - name: workdir
        emptyDir: {}
  volumeClaimTemplates:
  - metadata:
      name: datadir
      annotations:
        volume.alpha.kubernetes.io/storage-class: anything
    spec:
      accessModes: [ "ReadWriteOnce" ]
      resources:
        requests:
          storage: 10Gi

Is there a workaround or do we have to swap docker exec for the http-exec-bridge?

@yujuhong
Copy link
Contributor

We should check if docker 1.11 fixed this issue (#23397 (comment)).

Besides that, we may be able to add extra plumbing in kubelet to always stop the probers (and stop receiving exec requests) before killing the container. Although this doesn't stop users from running exec directly on the node.

/cc @kubernetes/sig-node

@bprashanth
Copy link
Contributor Author

Actually this probe is gross with http-exec, because I need to install mysql and mount my database in the http-exec sidecar. Can't we use nsenter for exec probe instead of docker exec, if it's available on the host?

@yujuhong
Copy link
Contributor

yujuhong commented May 11, 2016

The exec handler is configurable, e.g, "--docker-exec-handler=nsenter". The default is docker exec. There were more discussions in the bug you filed previously: #6342

Maybe we should reconsider using nsenter as default. @vishh @dchen1107, what do you think?

EDIT: of course we should verify whether nsenter would cause the same problem for docker.

@vishh
Copy link
Contributor

vishh commented May 11, 2016

AFAIK the issue with nsenter was that it is an extra dependency that we had to ship with kubelet somehow. What happens if we issue a docker start once kill fails, but inspect and ps works?

@bprashanth
Copy link
Contributor Author

I'd much rather have an extra dependency and predictable outcomes. Docker exec has just proven to be unstable, repeatedly. Wedging a database in an unknown state is terrible for consistency, and I doubt anyone is going to read the fine print that describes the effects of exec probes on dying containers.

The exec handler is configurable, e.g, "--docker-exec-handler=nsenter". The default is docker exec. There were more discussions in the bug you filed previously: #6342

Consider this my pound of flesh :)
I need it to be configurable at the pod spec level, or always default to the safer option.

@chrislovecnm
Copy link
Contributor

I am guessing that I recreated this with PetSet. Let me know if this is a different issue, but this smells like the same issue.

  1. Created a PetSet with the wrong image.
  2. Swore
  3. Deleted the PetSet
  4. FIxed yaml
  5. Created the same petset again
  6. Same error - cussed again
  7. Rinse wash repeat (did the same think a few times)
  8. kubectl get po - wth I still have a pod

The pod failed on create because I put the wrong damn image url in, but it hangs on the delete. If this is a specific petset issue let me know.

@bprashanth
Copy link
Contributor Author

but it hangs on the delete

@chrislovecnm elaborate? do you mean ssh into node + docker delete, kubectl delete hangs, or you scaled down the petset?

@chrislovecnm
Copy link
Contributor

@bprashanth
kubectl delete -f cassandra-petset-local.yaml petset deleted

bash-3.2$ kubectl get po
NAME          READY     STATUS             RESTARTS   AGE
cassandra-0   0/1       ImagePullBackOff   0          24m
bash-3.2$ kubectl get po

here: https://gist.github.com/chrislovecnm/22ace1559dcb0ba7af64f74123b1fff8

I can still see the pod, and when I create the petset again the pod cassandra-0 is still there.

You on slack?

@chrislovecnm
Copy link
Contributor

Full k8s restart did not clear it.

Also I can see the pause docker. Don't know what it is.

core@k8solo-01 ~ $ docker ps | grep cass
28f99a6e27df        gcr.io/google_containers/pause-amd64:3.0                     "/pause"                 6 minutes ago       Up 6 minutes                                 k8s_POD.98a3788b_cassandra-0_default_4e0f1aee-19eb-11e6-942e-a6fe4615cf32_984984d6
core@k8solo-01 ~ $

@bprashanth
Copy link
Contributor Author

Yeah that's because the petset won't scale+delete like the rc. It goes off safety first, so it won't touch your volumes if you simply delete it while it has replicas: N. Doesn't sound like this bug, lets get on slack.

@ncdc
Copy link
Member

ncdc commented May 15, 2016

It's important to note that nsenter just drops you in as root with a
minimal environment unless we take steps to do otherwise. Docker exec runs
as the same user as the primary container process with the same environment
and presumably the same protections (cap drops, etc). So it's not as easy
as just switching to nsenter and having it be functionally equivalent.

On Thursday, May 12, 2016, Prashanth B notifications@github.com wrote:

I'd much rather have an extra dependency and predictable outcomes. Docker
exec has just proven to be unstable, repeatedly. Wedging a database in an
unknown state is terrible for consistency, and I doubt anyone is going to
read the fine print that describes the effects of exec probes on dying
containers.

The exec handler is configurable, e.g, "--docker-exec-handler=nsenter".
The default is docker exec. There were more discussions in the bug you
filed previously: #6342
#6342

Consider this my pound of flesh :)
I need it to be configurable at the pod spec level, or always default to
the safer option.


You are receiving this because you are on a team that was mentioned.
Reply to this email directly or view it on GitHub
#25456 (comment)

@dchen1107 dchen1107 added area/docker priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. labels May 16, 2016
@dchen1107
Copy link
Member

I marked this as p0 for decision if we should disable docker exec completely? Switch to nsenter? Or suggested the user to use http-exec-bridge?

One caveat for nsenter is that nsenter simply enter the namespace, without proper cgroup, even capabilities, besides of carrying other dependency.

@bprashanth
Copy link
Contributor Author

Or suggested the user to use http-exec-bridge?

So that's a slight pain for the cases I mentioned because I need to at least install mysql or whatever db client into my pet (#25456 (comment))

@dchen1107
Copy link
Member

cc/ @timstclair Can we try to reproduce this failure case with docker 1.11?

@dchen1107
Copy link
Member

@bprashanth docker 1.11.1 image was out a while back, can we reproduce the issue?

@bprashanth
Copy link
Contributor Author

bprashanth commented May 27, 2016

Oh, hmm, i swapped out my exec probe with a http probe + mysql container. I'll try once I've got the other bugs sorted out and report back.

@dchen1107 dchen1107 added priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. and removed priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. labels May 27, 2016
@dchen1107 dchen1107 self-assigned this May 27, 2016
@dchen1107
Copy link
Member

Dropped the priority to p1 based on @bprashanth's comment above. Will try to reproduce against docker 1.11.1.

@ApsOps
Copy link
Contributor

ApsOps commented Dec 15, 2016

I'm hitting this on k8s v1.4.7, docker v1.12.1.

Deleting the pod gets stuck in Terminating forever with:

Error syncing pod, skipping: error killing pod: failed to "KillContainer" for "<container_name>" with KillContainerError: "operation timeout: context deadline exceeded"

Docker says container is Running with these logs, any process starts using 100% CPU.

time="2016-12-15T11:02:30.280574522Z" level=info msg="Container 7d45eb7c499fa82a2e087f0eb20dc41dcd73ed83801a327d35d359b0f0d28d09 failed to exit within 10 seconds of signal 15 - using the force"
time="2016-12-15T11:02:40.281334541Z" level=info msg="Container 7d45eb7c499f failed to exit within 10 seconds of kill - trying direct SIGKILL"
time="2016-12-15T11:03:52.427456023Z" level=info msg="Container 7d45eb7c499fa82a2e087f0eb20dc41dcd73ed83801a327d35d359b0f0d28d09 failed to exit within 30 seconds of signal 15 - using the force"
time="2016-12-15T11:04:02.428313157Z" level=info msg="Container 7d45eb7c499f failed to exit within 10 seconds of kill - trying direct SIGKILL"

@nsidhaye
Copy link

nsidhaye commented Feb 5, 2018

@ApsOps : Did you able to get resolved issue?

I am getting exact same error but root cause is different.

failed to get container status {"" ""}: rpc error: code = 2 desc = json: cannot unmarshal array into Go value of type types.ContainerJSON

@ApsOps
Copy link
Contributor

ApsOps commented Feb 6, 2018

@nsidhaye I had upgraded Kubernetes and docker and haven't seen this issue again. Though I've seen a couple of other docker related issues. It's almost always a docker problem.

@rachirib-zz
Copy link

is this relate to this one?
#52996

@ApsOps
Copy link
Contributor

ApsOps commented Feb 23, 2018

I'm still hitting this on k8s v1.8.6 and Docker version 1.13.1, build 092cba3

Feb 23 05:29:31 ip-10-1-57-70 kubelet[30214]: , failed to "KillPodSandbox" for "2423f1b6-0293-11e8-91ef-1259591cb356" with KillPodSandboxError: "rpc error: code = Unknown desc = Error response from daemon: Cannot stop container 72af6e5d6cceb319873fca55bc5964642a0f2cfed04cca01dc1480e660328f10: Cannot kill container 72af6e5d6cceb319873fca55bc5964642a0f2cfed04cca01dc1480e660328f10: rpc error: code = 14 desc = grpc: the connection is unavailable"
Feb 23 05:29:31 ip-10-1-57-70 kubelet[30214]: ]
Feb 23 05:29:31 ip-10-1-57-70 kubelet[30214]: E0223 05:29:31.973871   30214 docker_sandbox.go:240] Failed to stop sandbox "71905e02c6e37f5e07c0d4b46ade531e50bf2577fdb5ed254b98f5f83886da6e": Error response from daemon: Cannot stop container 71905e02c6e37f5e07c0d4b46ade531e50bf2577fdb5ed254b98f5f83886da6e: Cannot kill container 71905e02c6e37f5e07c0d4b46ade531e50bf2577fdb5ed254b98f5f83886da6e: rpc error: code = 14 desc = grpc: the connection is unavailable
Feb 23 05:29:31 ip-10-1-57-70 kubelet[30214]: E0223 05:29:31.974087   30214 remote_runtime.go:115] StopPodSandbox "71905e02c6e37f5e07c0d4b46ade531e50bf2577fdb5ed254b98f5f83886da6e" from runtime service failed: rpc error: code = Unknown desc = Error response from daemon: Cannot stop container 71905e02c6e37f5e07c0d4b46ade531e50bf2577fdb5ed254b98f5f83886da6e: Cannot kill container 71905e02c6e37f5e07c0d4b46ade531e50bf2577fdb5ed254b98f5f83886da6e: rpc error: code = 14 desc = grpc: the connection is unavailable

Docker runtime is up. I'm able to run docker images and docker ps commands fine.

Restarting the docker service fixes it.

/remove-lifecycle rotten
/lifecycle frozen

@k8s-ci-robot k8s-ci-robot added lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. and removed lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. labels Feb 23, 2018
@ApsOps
Copy link
Contributor

ApsOps commented Jun 6, 2018

I'm still hitting this on k8s v1.9.6 and Docker version 17.03.2-ce, build f5ec1e2 based on kops AMI.

@dchen1107 This is kinda critical issue since it doesn't resolve itself and containers get stuck in Terminating and ContainerCreating. Can we please bump the priority?

/cc @kubernetes/sig-node

@fernandrone
Copy link

fernandrone commented Jul 17, 2018

Hi, I've experienced the same issue when attempting to delete mysql containers with k8s v1.9.3.

$ kubectl version
Client Version: version.Info{Major:"1", Minor:"11", GitVersion:"v1.11.0", GitCommit:"91e7b4fd31fcd3d5f436da26c980becec37ceefe", GitTreeState:"clean", BuildDate:"2018-06-27T20:17:28Z", GoVersion:"go1.10.2", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"9", GitVersion:"v1.9.6", GitCommit:"9f8ebd171479bec0ada837d7ee641dec2f8c6dd1", GitTreeState:"clean", BuildDate:"2018-03-21T15:13:31Z", GoVersion:"go1.9.3", Compiler:"gc", Platform:"linux/amd64"}

I've started the cluster using Kops and I'm running the debian stretch AMI debian-stretch-hvm-x86_64-gp2-2018-06-13-59294 (ami-810b35e4). Disclaimer: this is not the ami supported by kops (they're still using jessie), not sure if this could be related to the issue.

I've tried kubectl delete --now --force and they're still there. Even removing the container with docker rm -f didn't solve it.

mysql                        0/2       Terminating   4          1d
mysql2                       0/2       Terminating   7          23h

The issue seems pretty consistent, as I got it on both mysql containers I tried to run.

Didn't find anything helpful on kubelet logs either (tried to grep for mysql and filter for err messages and got nothing).

@kisshore
Copy link

kisshore commented Sep 3, 2018

Yes, Even in kubernetes version 1.10 docker version 17.03.2-ce this issue still consistent. I used sysbench tool to create 1Gig files on the container, then it went stale. Then i tried to delete pod normally didn't happen then i tried "--grace-period 0" still this was in "Terminating" state. I strongly suspect this might be IO issue.

Pod:

# kubectl get pods -n myname
NAME                                          READY     STATUS        RESTARTS   AGE
benchmark-app-1535722045 		      1/1       Terminating   0          3d

Log:

pod_workers.go:186] Error syncing pod 995fb967-ad21-11e8-8837-a81e847d8f7c ("benchmark-app-1535722045_myname(995fb967-ad21-11e8-8837-a81e847d8f7c)"), skipping: error killing pod: failed to "KillContainer" for "benchmark-app-1535722045" with KillContainerError: "rpc error: code = Unknown desc = operation timeout: context deadline exceeded"

Here is very interesting statistics of docker container
#docker stats 87bb32150c0e

CONTAINER           CPU %               MEM USAGE / LIMIT   MEM %               NET I/O             BLOCK I/O           PIDS
87bb32150c0e        --                  -- / --             --                  --                  --                  --

@florianrusch
Copy link

florianrusch commented Oct 10, 2018

Same error on kubernetes version 1.12.0 and docker version 18.06.1-ce

Log:

Oct 10 14:32:50 xxx kubelet[29519]: E1010 14:32:50.833454   29519 remote_runtime.go:233] StopContainer "8d2c946887ea16a3a85f12868f9908888de399d6b7fc57278d4c7048c49128e8" from runtime service failed: rpc error: code = Unknown desc = operation timeout: context deadline exceeded
Oct 10 14:32:50 xxx kubelet[29519]: E1010 14:32:50.833457   29519 kuberuntime_container.go:577] Container "docker://817c7c1df1934d122baeafc69af53b680a8cb624e17b596ff659bb715f039931" termination failed with gracePeriod 30: rpc error: code = Unknown desc = operation timeout: context deadline exceeded
Oct 10 14:32:50 xxx kubelet[29519]: E1010 14:32:50.835904   29519 kubelet.go:1551] error killing pod: failed to "KillContainer" for "rabbitmq" with KillContainerError: "rpc error: code = Unknown desc = operation timeout: context deadline exceeded"
Oct 10 14:32:50 xxx kubelet[29519]: E1010 14:32:50.836973   29519 pod_workers.go:186] Error syncing pod 06e4f6ba-cc70-11e8-9bec-4437e678cc01 ("rabbitmq-0_dev(06e4f6ba-cc70-11e8-9bec-4437e678cc01)"), skipping: error killing pod: failed to "KillContainer" for "rabbitmq" with KillContainerError: "rpc error: code = Unknown desc = operation timeout: context deadline exceeded"

@fernandrone
Copy link

Hi, I've experienced the same issue when attempting to delete mysql containers with k8s v1.9.3.

$ kubectl version
Client Version: version.Info{Major:"1", Minor:"11", GitVersion:"v1.11.0", GitCommit:"91e7b4fd31fcd3d5f436da26c980becec37ceefe", GitTreeState:"clean", BuildDate:"2018-06-27T20:17:28Z", GoVersion:"go1.10.2", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"9", GitVersion:"v1.9.6", GitCommit:"9f8ebd171479bec0ada837d7ee641dec2f8c6dd1", GitTreeState:"clean", BuildDate:"2018-03-21T15:13:31Z", GoVersion:"go1.9.3", Compiler:"gc", Platform:"linux/amd64"}

I've started the cluster using Kops and I'm running the debian stretch AMI debian-stretch-hvm-x86_64-gp2-2018-06-13-59294 (ami-810b35e4). Disclaimer: this is not the ami supported by kops (they're still using jessie), not sure if this could be related to the issue.

I've tried kubectl delete --now --force and they're still there. Even removing the container with docker rm -f didn't solve it.

mysql                        0/2       Terminating   4          1d
mysql2                       0/2       Terminating   7          23h

The issue seems pretty consistent, as I got it on both mysql containers I tried to run.

Didn't find anything helpful on kubelet logs either (tried to grep for mysql and filter for err messages and got nothing).

Realized I never gave a follow up on my issue above. In the end the problem manifested only when running Istio with the canal CNI. When we switched canal for calico, the problem was gone.

@scruplelesswizard
Copy link

/remove-lifecycle frozen
/lifecycle rotten

@k8s-ci-robot k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. labels Jun 16, 2019
@xudifsd
Copy link
Contributor

xudifsd commented Jun 18, 2019

Same here, hanging for more than 12 hours. I've tried to restart docker daemon and kubelet, not working. But restart of etcd works for me.

@kisshore
Copy link

anyone tried with finalizers?

kubectl patch pod <pod> -p '{"metadata":{"finalizers":null}}'

@fejta-bot
Copy link

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

@k8s-ci-robot
Copy link
Contributor

@fejta-bot: Closing this issue.

In response to this:

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@kongkongk
Copy link

/reopen
/remove-lifecycle rotten

@k8s-ci-robot
Copy link
Contributor

@kongkongk: You can't reopen an issue/PR unless you authored it or you are a collaborator.

In response to this:

/reopen
/remove-lifecycle rotten

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot removed the lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. label Jul 22, 2019
@2rs2ts
Copy link
Contributor

2rs2ts commented Mar 4, 2020

Can this please get reopened?

@easvera
Copy link

easvera commented Jul 23, 2020

-p '{"metadata":{"finalizers":null}}'

You are a life saver.

@shachar-ash
Copy link

shachar-ash commented Sep 7, 2020

I'm facing the same error on k8s 1.17.7, Docker version 19.03.4, build 9013bf583a.

Warning FailedKillPod 10s kubelet, ip-1-2-3-4.region.compute.internal error killing pod: failed to "KillContainer" for "pod-name" with KillContainerError: "rpc error: code = Unknown desc = operation timeout: context deadline exceeded"

@lichaojacobs
Copy link

I'm facing the same error on k8s 1.17.7, Docker version 19.03.4, build 9013bf583a.

Warning FailedKillPod 10s kubelet, ip-1-2-3-4.region.compute.internal error killing pod: failed to "KillContainer" for "pod-name" with KillContainerError: "rpc error: code = Unknown desc = operation timeout: context deadline exceeded"

me too, how do you fix that?

@easvera
Copy link

easvera commented Sep 10, 2020

@lichaojacobs @shachar-ash
Use the below command as mentioned by @kisshore
kubectl patch pod -p '{"metadata":{"finalizers":null}}'

@milieu
Copy link

milieu commented Sep 26, 2020

We are experiencing this bug on Kubernetes 1.17.8

Can this issue please be reopened?

  Warning  FailedCreatePodSandBox  86s (x42 over 92m)  kubelet, ip-12345.some-aws-region.compute.internal  Failed to create pod sandbox: rpc error: code = Unknown desc = failed to create a sandbox for pod "some-job-in-production-1601071200-bg4s9": operation timeout: context deadline exceeded

@JacobHenner
Copy link

For AWS users who are using Amazon's EKS AMI, it looks like this is due to a containerd compatibility issue. See awslabs/amazon-eks-ami#563 for details.

@diskun00
Copy link

diskun00 commented May 8, 2021

Just FYI, if above solutions doesn't work, instead of restarting Docker service, we may just kill the corresponding container process of the stuck pod, i.e.,

  1. go to the corresponding node
  2. use docker ps find the container id for the correpsponding pod
  3. run ps aux|grep $container_id
  4. kill the process kill -9 $process_id
  5. remove the pod from k8s kubectl delete pod $pod_name --grace-period=0 --force --namespace $namespace

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/docker priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. sig/node Categorizes an issue or PR as relevant to SIG Node.
Projects
None yet
Development

No branches or pull requests