docker kill hangs, pod stuck in terminating #25456

bprashanth · 2016-05-11T04:23:12Z

I had a pod stuck in terminating. Similar to some other issues, didn't check for exact dupe.

Logged into the node and debugged a little (container cde46198ade6):

beeps@e2e-test-beeps-minion-cfl3:~$ sudo docker kill cde46198ade6
...
Hung for like 5m
^C
beeps@e2e-test-beeps-minion-cfl3:~$ sudo docker ps
CONTAINER ID        IMAGE                                                                  COMMAND                  CREATED             STATUS              PORTS               NAMES
cde46198ade6        erkules/galera:basic                                                   "/entrypoint.sh --def"   7 minutes ago       Up 7 minutes                            k8s_mysql.47396615_mysql-2_e2e-tests-petset-hy8ki_e8e98ddf-172d-11e6-b810-42010af00002_36392735

beeps@e2e-test-beeps-minion-cfl3:~$ sudo docker kill cde46198ade6
Error response from daemon: Cannot kill container cde46198ade6: [2] Container does not exist: container destroyed
Error: failed to kill containers: [cde46198ade6]

beeps@e2e-test-beeps-minion-cfl3:~$ sudo docker ps
CONTAINER ID        IMAGE                                                                  COMMAND                  CREATED             STATUS              PORTS               NAMES
cde46198ade6        erkules/galera:basic                                                   "/entrypoint.sh --def"   7 minutes ago       Up 7 minutes

Exec works after the kill failed

beeps@e2e-test-beeps-minion-cfl3:~$ sudo docker exec -it cde46198ade6 /bin/bash
root@mysql-2:/# ps aux
USER       PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
root         1  1.5  0.0  18148  3096 ?        Ss   04:17   0:00 /bin/bash
root        15  0.0  0.0  15572  2176 ?        R+   04:17   0:00 ps aux

beeps@e2e-test-beeps-minion-cfl3:~$ sudo docker kill cde46198ade6
Error response from daemon: Cannot kill container cde46198ade6: [2] Container does not exist: container destroyed
Error: failed to kill containers: [cde46198ade6]

Inspect shows Running true:

beeps@e2e-test-beeps-minion-cfl3:~$ sudo docker inspect cde46198ade6
[
{
    "Id": "cde46198ade62617e84cc59987669d4e674e83475259680d1952fa60ff90565c",
    "Created": "2016-05-11T04:07:54.627038551Z",
    "Path": "/entrypoint.sh",
    "Args": [
        "--defaults-file=/etc/mysql/my-galera.cnf",
        "--user=root"
    ],
    "State": {
        "Status": "running",
        "Running": true,
        "Paused": false,
        "Restarting": false,
        "OOMKilled": false,
        "Dead": false,
        "Pid": 27514,
        "ExitCode": 0,
        "Error": "",
        "StartedAt": "2016-05-11T04:07:54.839755198Z",
        "FinishedAt": "0001-01-01T00:00:00Z"
    },
    "Image": "7108a4321e9900675ba193af33555d0354ab66fc72ff592ae2acd38191db488a",
    "ResolvConfPath": "/var/lib/docker/containers/e3ddae18879c2d6723dd960fecaf32633c726e283a747a5171822622b0ca5236/resolv.conf",
    "HostnamePath": "/var/lib/docker/containers/e3ddae18879c2d6723dd960fecaf32633c726e283a747a5171822622b0ca5236/hostname",
    "HostsPath": "/var/lib/kubelet/pods/e8e98ddf-172d-11e6-b810-42010af00002/etc-hosts",
    "LogPath": "/var/lib/docker/containers/cde46198ade62617e84cc59987669d4e674e83475259680d1952fa60ff90565c/cde46198ade62617e84cc59987669d4e674e83475259680d1952fa60ff90565c-json.log",
    "Name": "/k8s_mysql.47396615_mysql-2_e2e-tests-petset-hy8ki_e8e98ddf-172d-11e6-b810-42010af00002_36392735",
    "RestartCount": 0,
    "Driver": "aufs",
    "ExecDriver": "native-0.2",
    "MountLabel": "",
    "ProcessLabel": "",
    "AppArmorProfile": "",
    "ExecIDs": null,
    "HostConfig": {
        "Binds": [
            "/var/lib/kubelet/pods/e8e98ddf-172d-11e6-b810-42010af00002/volumes/kubernetes.io~gce-pd/pv-gce-qf9p5:/var/lib/",
            "/var/lib/kubelet/pods/e8e98ddf-172d-11e6-b810-42010af00002/volumes/kubernetes.io~empty-dir/config:/etc/mysql",
            "/var/lib/kubelet/pods/e8e98ddf-172d-11e6-b810-42010af00002/volumes/kubernetes.io~secret/default-token-ksbzc:/var/run/secrets/kubernetes.io/serviceaccount:ro",
            "/var/lib/kubelet/pods/e8e98ddf-172d-11e6-b810-42010af00002/etc-hosts:/etc/hosts",
            "/var/lib/kubelet/pods/e8e98ddf-172d-11e6-b810-42010af00002/containers/mysql/36392735:/dev/termination-log"
        ],
        "ContainerIDFile": "",
        "LxcConf": null,
        "Memory": 0,
        "MemoryReservation": 0,
        "MemorySwap": -1,
        "KernelMemory": 0,
        "CpuShares": 2,
        "CpuPeriod": 0,
        "CpusetCpus": "",
        "CpusetMems": "",
        "CpuQuota": 0,
        "BlkioWeight": 0,
        "OomKillDisable": false,
        "MemorySwappiness": null,
        "Privileged": false,
        "PortBindings": null,
        "Links": null,
        "PublishAllPorts": false,
        "Dns": null,
        "DnsOptions": null,
        "DnsSearch": null,
        "ExtraHosts": null,
        "VolumesFrom": null,
        "Devices": null,
        "NetworkMode": "container:e3ddae18879c2d6723dd960fecaf32633c726e283a747a5171822622b0ca5236",
        "IpcMode": "container:e3ddae18879c2d6723dd960fecaf32633c726e283a747a5171822622b0ca5236",
        "PidMode": "",
        "UTSMode": "",
        "CapAdd": null,
        "CapDrop": null,
        "GroupAdd": null,
        "RestartPolicy": {
            "Name": "",
            "MaximumRetryCount": 0
        },
        "SecurityOpt": null,
        "ReadonlyRootfs": false,
        "Ulimits": null,
        "LogConfig": {
            "Type": "json-file",
            "Config": {}
        },
        "CgroupParent": "/",
        "ConsoleSize": [
            0,
            0
        ],
        "VolumeDriver": ""
    },
    "GraphDriver": {
        "Name": "aufs",
        "Data": null
    },
    "Mounts": [
        {
            "Source": "/var/lib/kubelet/pods/e8e98ddf-172d-11e6-b810-42010af00002/volumes/kubernetes.io~secret/default-token-ksbzc",
            "Destination": "/var/run/secrets/kubernetes.io/serviceaccount",
            "Mode": "ro",
            "RW": false
        },
        {
            "Source": "/var/lib/kubelet/pods/e8e98ddf-172d-11e6-b810-42010af00002/etc-hosts",
            "Destination": "/etc/hosts",
            "Mode": "",
            "RW": true
        },
        {
            "Source": "/var/lib/kubelet/pods/e8e98ddf-172d-11e6-b810-42010af00002/containers/mysql/36392735",
            "Destination": "/dev/termination-log",
            "Mode": "",
            "RW": true
        },
        {
            "Source": "/var/lib/kubelet/pods/e8e98ddf-172d-11e6-b810-42010af00002/volumes/kubernetes.io~gce-pd/pv-gce-qf9p5",
            "Destination": "/var/lib",
            "Mode": "",
            "RW": true
        },
        {
            "Source": "/var/lib/kubelet/pods/e8e98ddf-172d-11e6-b810-42010af00002/volumes/kubernetes.io~empty-dir/config",
            "Destination": "/etc/mysql",
            "Mode": "",
            "RW": true
        }
    ],
    "Config": {
        "Hostname": "mysql-2",
        "Domainname": "",
        "User": "",
        "AttachStdin": false,
        "AttachStdout": false,
        "AttachStderr": false,
        "ExposedPorts": {
            "3306/tcp": {}
        },
        "Tty": false,
        "OpenStdin": false,
        "StdinOnce": false,
        "Env": [
            "KUBERNETES_PORT_443_TCP_PROTO=tcp",
            "KUBERNETES_PORT_443_TCP_PORT=443",
            "KUBERNETES_PORT_443_TCP_ADDR=10.0.0.1",
            "KUBERNETES_SERVICE_HOST=10.0.0.1",
            "KUBERNETES_SERVICE_PORT=443",
            "KUBERNETES_SERVICE_PORT_HTTPS=443",
            "KUBERNETES_PORT=tcp://10.0.0.1:443",
            "KUBERNETES_PORT_443_TCP=tcp://10.0.0.1:443",
            "PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin",
            "VERSION=20160303",
            "DEBIAN_FRONTEND=noninteractive"
        ],
        "Cmd": [
            "--defaults-file=/etc/mysql/my-galera.cnf",
            "--user=root"
        ],
        "Image": "erkules/galera:basic",
        "Volumes": null,
        "WorkingDir": "",
        "Entrypoint": [
            "/entrypoint.sh"
        ],
        "OnBuild": null,
        "Labels": {
            "io.kubernetes.container.hash": "47396615",
            "io.kubernetes.container.name": "mysql",
            "io.kubernetes.container.restartCount": "0",
            "io.kubernetes.container.terminationMessagePath": "/dev/termination-log",
            "io.kubernetes.pod.name": "mysql-2",
            "io.kubernetes.pod.namespace": "e2e-tests-petset-hy8ki",
            "io.kubernetes.pod.terminationGracePeriod": "30",
            "io.kubernetes.pod.uid": "e8e98ddf-172d-11e6-b810-42010af00002"
        }
    },
    "NetworkSettings": {
        "Bridge": "",
        "SandboxID": "",
        "HairpinMode": false,
        "LinkLocalIPv6Address": "",
        "LinkLocalIPv6PrefixLen": 0,
        "Ports": null,
        "SandboxKey": "",
        "SecondaryIPAddresses": null,
        "SecondaryIPv6Addresses": null,
        "EndpointID": "",
        "Gateway": "",
        "GlobalIPv6Address": "",
        "GlobalIPv6PrefixLen": 0,
        "IPAddress": "",
        "IPPrefixLen": 0,
        "IPv6Gateway": "",
        "MacAddress": "",
        "Networks": null
    }
}
]

But the pid isn't around:

beeps@e2e-test-beeps-minion-cfl3:~$ ps aux | grep 27514
beeps    31717  0.0  0.0   7852  1948 pts/1    S+   04:20   0:00 grep 27514

And the pod remains in terminating:

21:16:48-beeps~/goproj/src/k8s.io/kubernetes] (petset_e2e)$ kn get po
NAME      READY     STATUS        RESTARTS   AGE
mysql-2   0/1       Terminating   0          9m

This is on 1.9.1, maybe fixed?

beeps@e2e-test-beeps-minion-cfl3:~$ docker version
Client:
 Version:      1.9.1
 API version:  1.21
 Go version:   go1.4.3
 Git commit:   a34a1d5
 Built:        Fri Nov 20 17:56:04 UTC 2015
 OS/Arch:      linux/amd64
Cannot connect to the Docker daemon. Is the docker daemon running on this host?
beeps@e2e-test-beeps-minion-cfl3:~$ sudo docker version
Client:
 Version:      1.9.1
 API version:  1.21
 Go version:   go1.4.3
 Git commit:   a34a1d5
 Built:        Fri Nov 20 17:56:04 UTC 2015
 OS/Arch:      linux/amd64

Server:
 Version:      1.9.1
 API version:  1.21
 Go version:   go1.4.3
 Git commit:   a34a1d5
 Built:        Fri Nov 20 17:56:04 UTC 2015
 OS/Arch:      linux/amd64

@kubernetes/goog-node

The text was updated successfully, but these errors were encountered:

bprashanth · 2016-05-11T04:26:59Z

Oh and the pod was in Running/Ready before killing it

yujuhong · 2016-05-11T17:46:22Z

This looks similar to #21751 (comment), in which case exec during docker stop may caused inconsistent state.
There was some possible explanation in the original docker bug: moby/moby#18758 (comment)
What's the pod spec? Were there any health checks using exec, or any other exec call to the container?

There has also been some new docker issues about "Container does not exist: container destroyed": moby/moby#12738
Users have reported seeing the same issue in 1.10 as well. Docker 1.11 may be different.

bprashanth · 2016-05-11T17:54:17Z

The pod stuck around for 13 hours till I restarted docker.

Whats the pod spec? Were there any health checks using exec, or any other exec call to the container?

Yes.

kind: PetSet
metadata:
  name: mysql
spec:
  serviceName: "galera"
  replicas: 3
  template:
    metadata:
      labels:
        app: mysql
      annotations:
        pod.alpha.kubernetes.io/initialized: "true"
        pod.alpha.kubernetes.io/init-containers: '[
            {
                "name": "install",
                "image": "bprashanth/galera-install:0.1",
                "imagePullPolicy": "Always",
                "args": ["--work-dir=/work-dir"],
                "volumeMounts": [
                    {
                        "name": "workdir",
                        "mountPath": "/work-dir"
                    },
                    {
                        "name": "config",
                        "mountPath": "/etc/mysql"
                    }
                ]
            },
            {
                "name": "bootstrap",
                "image": "debian:jessie",
                "command": ["/work-dir/peer-finder"],
                "args": ["-on-start=\"/work-dir/on-start.sh\"", "-service=galera"],
                "env": [
                  {
                      "name": "POD_NAMESPACE",
                      "valueFrom": {
                          "fieldRef": {
                              "apiVersion": "v1",
                              "fieldPath": "metadata.namespace"
                          }
                      }
                   }
                ],
                "volumeMounts": [
                    {
                        "name": "workdir",
                        "mountPath": "/work-dir"
                    },
                    {
                        "name": "config",
                        "mountPath": "/etc/mysql"
                    }
                ]
            }
        ]'
    spec:
      containers:
      - name: mysql
        image: erkules/galera:basic
        ports:
        - containerPort: 3306
          name: mysql
        - containerPort: 4444
          name: sst
        - containerPort: 4567
          name: replication
        - containerPort: 4568
          name: ist
        args:
        - --defaults-file=/etc/mysql/my-galera.cnf
        - --user=root
        readinessProbe:
          exec:
            command:
            - sh
            - -c
            - "mysql -u root -e 'show databases;'"
          initialDelaySeconds: 15
          timeoutSeconds: 5
        volumeMounts:
        - name: datadir
          mountPath: /var/lib/
        - name: config
          mountPath: /etc/mysql
      volumes:
      - name: config
        emptyDir: {}
      - name: workdir
        emptyDir: {}
  volumeClaimTemplates:
  - metadata:
      name: datadir
      annotations:
        volume.alpha.kubernetes.io/storage-class: anything
    spec:
      accessModes: [ "ReadWriteOnce" ]
      resources:
        requests:
          storage: 10Gi

Is there a workaround or do we have to swap docker exec for the http-exec-bridge?

yujuhong · 2016-05-11T18:41:01Z

We should check if docker 1.11 fixed this issue (#23397 (comment)).

Besides that, we may be able to add extra plumbing in kubelet to always stop the probers (and stop receiving exec requests) before killing the container. Although this doesn't stop users from running exec directly on the node.

/cc @kubernetes/sig-node

bprashanth · 2016-05-11T18:43:48Z

Actually this probe is gross with http-exec, because I need to install mysql and mount my database in the http-exec sidecar. Can't we use nsenter for exec probe instead of docker exec, if it's available on the host?

yujuhong · 2016-05-11T20:13:11Z

The exec handler is configurable, e.g, "--docker-exec-handler=nsenter". The default is docker exec. There were more discussions in the bug you filed previously: #6342

Maybe we should reconsider using nsenter as default. @vishh @dchen1107, what do you think?

EDIT: of course we should verify whether nsenter would cause the same problem for docker.

vishh · 2016-05-11T21:08:58Z

AFAIK the issue with nsenter was that it is an extra dependency that we had to ship with kubelet somehow. What happens if we issue a docker start once kill fails, but inspect and ps works?

bprashanth · 2016-05-12T06:41:21Z

I'd much rather have an extra dependency and predictable outcomes. Docker exec has just proven to be unstable, repeatedly. Wedging a database in an unknown state is terrible for consistency, and I doubt anyone is going to read the fine print that describes the effects of exec probes on dying containers.

The exec handler is configurable, e.g, "--docker-exec-handler=nsenter". The default is docker exec. There were more discussions in the bug you filed previously: #6342

Consider this my pound of flesh :)
I need it to be configurable at the pod spec level, or always default to the safer option.

chrislovecnm · 2016-05-14T16:07:14Z

I am guessing that I recreated this with PetSet. Let me know if this is a different issue, but this smells like the same issue.

Created a PetSet with the wrong image.
Swore
Deleted the PetSet
FIxed yaml
Created the same petset again
Same error - cussed again
Rinse wash repeat (did the same think a few times)
kubectl get po - wth I still have a pod

The pod failed on create because I put the wrong damn image url in, but it hangs on the delete. If this is a specific petset issue let me know.

bprashanth · 2016-05-14T16:10:52Z

but it hangs on the delete

@chrislovecnm elaborate? do you mean ssh into node + docker delete, kubectl delete hangs, or you scaled down the petset?

chrislovecnm · 2016-05-14T16:16:01Z

@bprashanth
kubectl delete -f cassandra-petset-local.yaml petset deleted

bash-3.2$ kubectl get po
NAME          READY     STATUS             RESTARTS   AGE
cassandra-0   0/1       ImagePullBackOff   0          24m

bash-3.2$ kubectl get po

here: https://gist.github.com/chrislovecnm/22ace1559dcb0ba7af64f74123b1fff8

I can still see the pod, and when I create the petset again the pod cassandra-0 is still there.

You on slack?

chrislovecnm · 2016-05-14T16:18:00Z

Full k8s restart did not clear it.

Also I can see the pause docker. Don't know what it is.

core@k8solo-01 ~ $ docker ps | grep cass
28f99a6e27df        gcr.io/google_containers/pause-amd64:3.0                     "/pause"                 6 minutes ago       Up 6 minutes                                 k8s_POD.98a3788b_cassandra-0_default_4e0f1aee-19eb-11e6-942e-a6fe4615cf32_984984d6
core@k8solo-01 ~ $

bprashanth · 2016-05-14T16:22:46Z

Yeah that's because the petset won't scale+delete like the rc. It goes off safety first, so it won't touch your volumes if you simply delete it while it has replicas: N. Doesn't sound like this bug, lets get on slack.

ncdc · 2016-05-15T12:42:09Z

It's important to note that nsenter just drops you in as root with a
minimal environment unless we take steps to do otherwise. Docker exec runs
as the same user as the primary container process with the same environment
and presumably the same protections (cap drops, etc). So it's not as easy
as just switching to nsenter and having it be functionally equivalent.

On Thursday, May 12, 2016, Prashanth B notifications@github.com wrote:

I'd much rather have an extra dependency and predictable outcomes. Docker
exec has just proven to be unstable, repeatedly. Wedging a database in an
unknown state is terrible for consistency, and I doubt anyone is going to
read the fine print that describes the effects of exec probes on dying
containers.

The exec handler is configurable, e.g, "--docker-exec-handler=nsenter".
The default is docker exec. There were more discussions in the bug you
filed previously: #6342
#6342

Consider this my pound of flesh :)
I need it to be configurable at the pod spec level, or always default to
the safer option.

—
You are receiving this because you are on a team that was mentioned.
Reply to this email directly or view it on GitHub
#25456 (comment)

dchen1107 · 2016-05-16T21:18:13Z

I marked this as p0 for decision if we should disable docker exec completely? Switch to nsenter? Or suggested the user to use http-exec-bridge?

One caveat for nsenter is that nsenter simply enter the namespace, without proper cgroup, even capabilities, besides of carrying other dependency.

bprashanth · 2016-05-16T21:20:26Z

Or suggested the user to use http-exec-bridge?

So that's a slight pain for the cases I mentioned because I need to at least install mysql or whatever db client into my pet (#25456 (comment))

dchen1107 · 2016-05-16T21:21:53Z

cc/ @timstclair Can we try to reproduce this failure case with docker 1.11?

dchen1107 · 2016-05-27T18:05:51Z

@bprashanth docker 1.11.1 image was out a while back, can we reproduce the issue?

bprashanth · 2016-05-27T18:16:21Z

Oh, hmm, i swapped out my exec probe with a http probe + mysql container. I'll try once I've got the other bugs sorted out and report back.

dchen1107 · 2016-05-28T00:00:22Z

Dropped the priority to p1 based on @bprashanth's comment above. Will try to reproduce against docker 1.11.1.

ApsOps · 2016-12-15T11:12:21Z

I'm hitting this on k8s v1.4.7, docker v1.12.1.

Deleting the pod gets stuck in Terminating forever with:

Error syncing pod, skipping: error killing pod: failed to "KillContainer" for "<container_name>" with KillContainerError: "operation timeout: context deadline exceeded"

Docker says container is Running with these logs, any process starts using 100% CPU.

time="2016-12-15T11:02:30.280574522Z" level=info msg="Container 7d45eb7c499fa82a2e087f0eb20dc41dcd73ed83801a327d35d359b0f0d28d09 failed to exit within 10 seconds of signal 15 - using the force"
time="2016-12-15T11:02:40.281334541Z" level=info msg="Container 7d45eb7c499f failed to exit within 10 seconds of kill - trying direct SIGKILL"
time="2016-12-15T11:03:52.427456023Z" level=info msg="Container 7d45eb7c499fa82a2e087f0eb20dc41dcd73ed83801a327d35d359b0f0d28d09 failed to exit within 30 seconds of signal 15 - using the force"
time="2016-12-15T11:04:02.428313157Z" level=info msg="Container 7d45eb7c499f failed to exit within 10 seconds of kill - trying direct SIGKILL"

nsidhaye · 2018-02-05T17:13:04Z

@ApsOps : Did you able to get resolved issue?

I am getting exact same error but root cause is different.

failed to get container status {"" ""}: rpc error: code = 2 desc = json: cannot unmarshal array into Go value of type types.ContainerJSON

ApsOps · 2018-02-06T06:24:52Z

@nsidhaye I had upgraded Kubernetes and docker and haven't seen this issue again. Though I've seen a couple of other docker related issues. It's almost always a docker problem.

rachirib-zz · 2018-02-15T23:36:02Z

is this relate to this one?
#52996

ApsOps · 2018-02-23T05:37:23Z

I'm still hitting this on k8s v1.8.6 and Docker version 1.13.1, build 092cba3

Feb 23 05:29:31 ip-10-1-57-70 kubelet[30214]: , failed to "KillPodSandbox" for "2423f1b6-0293-11e8-91ef-1259591cb356" with KillPodSandboxError: "rpc error: code = Unknown desc = Error response from daemon: Cannot stop container 72af6e5d6cceb319873fca55bc5964642a0f2cfed04cca01dc1480e660328f10: Cannot kill container 72af6e5d6cceb319873fca55bc5964642a0f2cfed04cca01dc1480e660328f10: rpc error: code = 14 desc = grpc: the connection is unavailable"
Feb 23 05:29:31 ip-10-1-57-70 kubelet[30214]: ]
Feb 23 05:29:31 ip-10-1-57-70 kubelet[30214]: E0223 05:29:31.973871   30214 docker_sandbox.go:240] Failed to stop sandbox "71905e02c6e37f5e07c0d4b46ade531e50bf2577fdb5ed254b98f5f83886da6e": Error response from daemon: Cannot stop container 71905e02c6e37f5e07c0d4b46ade531e50bf2577fdb5ed254b98f5f83886da6e: Cannot kill container 71905e02c6e37f5e07c0d4b46ade531e50bf2577fdb5ed254b98f5f83886da6e: rpc error: code = 14 desc = grpc: the connection is unavailable
Feb 23 05:29:31 ip-10-1-57-70 kubelet[30214]: E0223 05:29:31.974087   30214 remote_runtime.go:115] StopPodSandbox "71905e02c6e37f5e07c0d4b46ade531e50bf2577fdb5ed254b98f5f83886da6e" from runtime service failed: rpc error: code = Unknown desc = Error response from daemon: Cannot stop container 71905e02c6e37f5e07c0d4b46ade531e50bf2577fdb5ed254b98f5f83886da6e: Cannot kill container 71905e02c6e37f5e07c0d4b46ade531e50bf2577fdb5ed254b98f5f83886da6e: rpc error: code = 14 desc = grpc: the connection is unavailable

Docker runtime is up. I'm able to run docker images and docker ps commands fine.

Restarting the docker service fixes it.

/remove-lifecycle rotten
/lifecycle frozen

ApsOps · 2018-06-06T08:21:16Z

I'm still hitting this on k8s v1.9.6 and Docker version 17.03.2-ce, build f5ec1e2 based on kops AMI.

@dchen1107 This is kinda critical issue since it doesn't resolve itself and containers get stuck in Terminating and ContainerCreating. Can we please bump the priority?

/cc @kubernetes/sig-node

fernandrone · 2018-07-17T14:29:31Z

Hi, I've experienced the same issue when attempting to delete mysql containers with k8s v1.9.3.

$ kubectl version
Client Version: version.Info{Major:"1", Minor:"11", GitVersion:"v1.11.0", GitCommit:"91e7b4fd31fcd3d5f436da26c980becec37ceefe", GitTreeState:"clean", BuildDate:"2018-06-27T20:17:28Z", GoVersion:"go1.10.2", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"9", GitVersion:"v1.9.6", GitCommit:"9f8ebd171479bec0ada837d7ee641dec2f8c6dd1", GitTreeState:"clean", BuildDate:"2018-03-21T15:13:31Z", GoVersion:"go1.9.3", Compiler:"gc", Platform:"linux/amd64"}

I've started the cluster using Kops and I'm running the debian stretch AMI debian-stretch-hvm-x86_64-gp2-2018-06-13-59294 (ami-810b35e4). Disclaimer: this is not the ami supported by kops (they're still using jessie), not sure if this could be related to the issue.

I've tried kubectl delete --now --force and they're still there. Even removing the container with docker rm -f didn't solve it.

mysql                        0/2       Terminating   4          1d
mysql2                       0/2       Terminating   7          23h

The issue seems pretty consistent, as I got it on both mysql containers I tried to run.

Didn't find anything helpful on kubelet logs either (tried to grep for mysql and filter for err messages and got nothing).

kisshore · 2018-09-03T15:23:23Z

Yes, Even in kubernetes version 1.10 docker version 17.03.2-ce this issue still consistent. I used sysbench tool to create 1Gig files on the container, then it went stale. Then i tried to delete pod normally didn't happen then i tried "--grace-period 0" still this was in "Terminating" state. I strongly suspect this might be IO issue.

Pod:

# kubectl get pods -n myname
NAME                                          READY     STATUS        RESTARTS   AGE
benchmark-app-1535722045 		      1/1       Terminating   0          3d

Log:

pod_workers.go:186] Error syncing pod 995fb967-ad21-11e8-8837-a81e847d8f7c ("benchmark-app-1535722045_myname(995fb967-ad21-11e8-8837-a81e847d8f7c)"), skipping: error killing pod: failed to "KillContainer" for "benchmark-app-1535722045" with KillContainerError: "rpc error: code = Unknown desc = operation timeout: context deadline exceeded"

Here is very interesting statistics of docker container
#docker stats 87bb32150c0e

CONTAINER           CPU %               MEM USAGE / LIMIT   MEM %               NET I/O             BLOCK I/O           PIDS
87bb32150c0e        --                  -- / --             --                  --                  --                  --

florianrusch · 2018-10-10T12:32:25Z

Same error on kubernetes version 1.12.0 and docker version 18.06.1-ce

Log:

Oct 10 14:32:50 xxx kubelet[29519]: E1010 14:32:50.833454   29519 remote_runtime.go:233] StopContainer "8d2c946887ea16a3a85f12868f9908888de399d6b7fc57278d4c7048c49128e8" from runtime service failed: rpc error: code = Unknown desc = operation timeout: context deadline exceeded
Oct 10 14:32:50 xxx kubelet[29519]: E1010 14:32:50.833457   29519 kuberuntime_container.go:577] Container "docker://817c7c1df1934d122baeafc69af53b680a8cb624e17b596ff659bb715f039931" termination failed with gracePeriod 30: rpc error: code = Unknown desc = operation timeout: context deadline exceeded
Oct 10 14:32:50 xxx kubelet[29519]: E1010 14:32:50.835904   29519 kubelet.go:1551] error killing pod: failed to "KillContainer" for "rabbitmq" with KillContainerError: "rpc error: code = Unknown desc = operation timeout: context deadline exceeded"
Oct 10 14:32:50 xxx kubelet[29519]: E1010 14:32:50.836973   29519 pod_workers.go:186] Error syncing pod 06e4f6ba-cc70-11e8-9bec-4437e678cc01 ("rabbitmq-0_dev(06e4f6ba-cc70-11e8-9bec-4437e678cc01)"), skipping: error killing pod: failed to "KillContainer" for "rabbitmq" with KillContainerError: "rpc error: code = Unknown desc = operation timeout: context deadline exceeded"

fernandrone · 2018-10-16T19:19:15Z

Hi, I've experienced the same issue when attempting to delete mysql containers with k8s v1.9.3.
$ kubectl version
Client Version: version.Info{Major:"1", Minor:"11", GitVersion:"v1.11.0", GitCommit:"91e7b4fd31fcd3d5f436da26c980becec37ceefe", GitTreeState:"clean", BuildDate:"2018-06-27T20:17:28Z", GoVersion:"go1.10.2", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"9", GitVersion:"v1.9.6", GitCommit:"9f8ebd171479bec0ada837d7ee641dec2f8c6dd1", GitTreeState:"clean", BuildDate:"2018-03-21T15:13:31Z", GoVersion:"go1.9.3", Compiler:"gc", Platform:"linux/amd64"}
I've started the cluster using Kops and I'm running the debian stretch AMI debian-stretch-hvm-x86_64-gp2-2018-06-13-59294 (ami-810b35e4). Disclaimer: this is not the ami supported by kops (they're still using jessie), not sure if this could be related to the issue.

I've tried kubectl delete --now --force and they're still there. Even removing the container with docker rm -f didn't solve it.
mysql                        0/2       Terminating   4          1d
mysql2                       0/2       Terminating   7          23h
The issue seems pretty consistent, as I got it on both mysql containers I tried to run.

Didn't find anything helpful on kubelet logs either (tried to grep for mysql and filter for err messages and got nothing).

Realized I never gave a follow up on my issue above. In the end the problem manifested only when running Istio with the canal CNI. When we switched canal for calico, the problem was gone.

scruplelesswizard · 2019-06-16T03:04:40Z

/remove-lifecycle frozen
/lifecycle rotten

xudifsd · 2019-06-18T03:37:02Z

Same here, hanging for more than 12 hours. I've tried to restart docker daemon and kubelet, not working. But restart of etcd works for me.

kisshore · 2019-06-21T10:34:37Z

anyone tried with finalizers?

kubectl patch pod <pod> -p '{"metadata":{"finalizers":null}}'

fejta-bot · 2019-07-21T11:09:24Z

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

k8s-ci-robot · 2019-07-21T11:09:32Z

@fejta-bot: Closing this issue.

In response to this:

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

kongkongk · 2019-07-22T02:40:06Z

/reopen
/remove-lifecycle rotten

k8s-ci-robot · 2019-07-22T02:40:13Z

@kongkongk: You can't reopen an issue/PR unless you authored it or you are a collaborator.

In response to this:

/reopen
/remove-lifecycle rotten

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

2rs2ts · 2020-03-04T19:43:12Z

Can this please get reopened?

easvera · 2020-07-23T11:18:21Z

-p '{"metadata":{"finalizers":null}}'

You are a life saver.

shachar-ash · 2020-09-07T07:19:03Z

I'm facing the same error on k8s 1.17.7, Docker version 19.03.4, build 9013bf583a.

Warning FailedKillPod 10s kubelet, ip-1-2-3-4.region.compute.internal error killing pod: failed to "KillContainer" for "pod-name" with KillContainerError: "rpc error: code = Unknown desc = operation timeout: context deadline exceeded"

lichaojacobs · 2020-09-10T03:15:30Z

I'm facing the same error on k8s 1.17.7, Docker version 19.03.4, build 9013bf583a.

Warning FailedKillPod 10s kubelet, ip-1-2-3-4.region.compute.internal error killing pod: failed to "KillContainer" for "pod-name" with KillContainerError: "rpc error: code = Unknown desc = operation timeout: context deadline exceeded"

me too, how do you fix that?

easvera · 2020-09-10T04:12:41Z

@lichaojacobs @shachar-ash
Use the below command as mentioned by @kisshore
kubectl patch pod -p '{"metadata":{"finalizers":null}}'

milieu · 2020-09-26T00:01:08Z

We are experiencing this bug on Kubernetes 1.17.8

Can this issue please be reopened?

  Warning  FailedCreatePodSandBox  86s (x42 over 92m)  kubelet, ip-12345.some-aws-region.compute.internal  Failed to create pod sandbox: rpc error: code = Unknown desc = failed to create a sandbox for pod "some-job-in-production-1601071200-bg4s9": operation timeout: context deadline exceeded

JacobHenner · 2020-11-27T15:17:32Z

For AWS users who are using Amazon's EKS AMI, it looks like this is due to a containerd compatibility issue. See awslabs/amazon-eks-ami#563 for details.

diskun00 · 2021-05-08T02:31:08Z

Just FYI, if above solutions doesn't work, instead of restarting Docker service, we may just kill the corresponding container process of the stuck pod, i.e.,

go to the corresponding node
use docker ps find the container id for the correpsponding pod
run ps aux|grep $container_id
kill the process kill -9 $process_id
remove the pod from k8s kubectl delete pod $pod_name --grace-period=0 --force --namespace $namespace

bprashanth added the sig/node Categorizes an issue or PR as relevant to SIG Node. label May 11, 2016

dchen1107 added area/docker priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. labels May 16, 2016

dchen1107 added priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. and removed priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. labels May 27, 2016

dchen1107 self-assigned this May 27, 2016

bprashanth mentioned this issue Jun 3, 2016

Turn on Feature:PetSet e2es #26824

Closed

namliz mentioned this issue Aug 8, 2016

Repeated creation/deletion of resources breaks cluster cncf/demo#25

Closed

bprashanth mentioned this issue Oct 7, 2016

Better kubectl exec #10975

Closed

k8s-ci-robot added lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. and removed lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. labels Feb 23, 2018

k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. labels Jun 16, 2019

k8s-ci-robot closed this as completed Jul 21, 2019

k8s-ci-robot removed the lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. label Jul 22, 2019

docker kill hangs, pod stuck in terminating #25456

docker kill hangs, pod stuck in terminating #25456

Comments

bprashanth commented May 11, 2016

bprashanth commented May 11, 2016

yujuhong commented May 11, 2016 • edited

bprashanth commented May 11, 2016

yujuhong commented May 11, 2016

bprashanth commented May 11, 2016

yujuhong commented May 11, 2016 • edited

vishh commented May 11, 2016

bprashanth commented May 12, 2016

chrislovecnm commented May 14, 2016

bprashanth commented May 14, 2016

chrislovecnm commented May 14, 2016

chrislovecnm commented May 14, 2016

bprashanth commented May 14, 2016

ncdc commented May 15, 2016

dchen1107 commented May 16, 2016

bprashanth commented May 16, 2016

dchen1107 commented May 16, 2016

dchen1107 commented May 27, 2016

bprashanth commented May 27, 2016 • edited

dchen1107 commented May 28, 2016

ApsOps commented Dec 15, 2016

nsidhaye commented Feb 5, 2018

ApsOps commented Feb 6, 2018

rachirib-zz commented Feb 15, 2018

ApsOps commented Feb 23, 2018

ApsOps commented Jun 6, 2018

fernandrone commented Jul 17, 2018 • edited

kisshore commented Sep 3, 2018 • edited

florianrusch commented Oct 10, 2018 • edited

fernandrone commented Oct 16, 2018

scruplelesswizard commented Jun 16, 2019

xudifsd commented Jun 18, 2019

kisshore commented Jun 21, 2019

fejta-bot commented Jul 21, 2019

k8s-ci-robot commented Jul 21, 2019

kongkongk commented Jul 22, 2019

k8s-ci-robot commented Jul 22, 2019

2rs2ts commented Mar 4, 2020

easvera commented Jul 23, 2020

shachar-ash commented Sep 7, 2020 • edited

lichaojacobs commented Sep 10, 2020

easvera commented Sep 10, 2020

milieu commented Sep 26, 2020

JacobHenner commented Nov 27, 2020

diskun00 commented May 8, 2021

yujuhong commented May 11, 2016 •

edited

yujuhong commented May 11, 2016 •

edited

bprashanth commented May 27, 2016 •

edited

fernandrone commented Jul 17, 2018 •

edited

kisshore commented Sep 3, 2018 •

edited

florianrusch commented Oct 10, 2018 •

edited

shachar-ash commented Sep 7, 2020 •

edited