New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
docker kill hangs, pod stuck in terminating #25456
Comments
Oh and the pod was in Running/Ready before killing it |
This looks similar to #21751 (comment), in which case There has also been some new docker issues about "Container does not exist: container destroyed": moby/moby#12738 |
The pod stuck around for 13 hours till I restarted docker.
Yes. kind: PetSet
metadata:
name: mysql
spec:
serviceName: "galera"
replicas: 3
template:
metadata:
labels:
app: mysql
annotations:
pod.alpha.kubernetes.io/initialized: "true"
pod.alpha.kubernetes.io/init-containers: '[
{
"name": "install",
"image": "bprashanth/galera-install:0.1",
"imagePullPolicy": "Always",
"args": ["--work-dir=/work-dir"],
"volumeMounts": [
{
"name": "workdir",
"mountPath": "/work-dir"
},
{
"name": "config",
"mountPath": "/etc/mysql"
}
]
},
{
"name": "bootstrap",
"image": "debian:jessie",
"command": ["/work-dir/peer-finder"],
"args": ["-on-start=\"/work-dir/on-start.sh\"", "-service=galera"],
"env": [
{
"name": "POD_NAMESPACE",
"valueFrom": {
"fieldRef": {
"apiVersion": "v1",
"fieldPath": "metadata.namespace"
}
}
}
],
"volumeMounts": [
{
"name": "workdir",
"mountPath": "/work-dir"
},
{
"name": "config",
"mountPath": "/etc/mysql"
}
]
}
]'
spec:
containers:
- name: mysql
image: erkules/galera:basic
ports:
- containerPort: 3306
name: mysql
- containerPort: 4444
name: sst
- containerPort: 4567
name: replication
- containerPort: 4568
name: ist
args:
- --defaults-file=/etc/mysql/my-galera.cnf
- --user=root
readinessProbe:
exec:
command:
- sh
- -c
- "mysql -u root -e 'show databases;'"
initialDelaySeconds: 15
timeoutSeconds: 5
volumeMounts:
- name: datadir
mountPath: /var/lib/
- name: config
mountPath: /etc/mysql
volumes:
- name: config
emptyDir: {}
- name: workdir
emptyDir: {}
volumeClaimTemplates:
- metadata:
name: datadir
annotations:
volume.alpha.kubernetes.io/storage-class: anything
spec:
accessModes: [ "ReadWriteOnce" ]
resources:
requests:
storage: 10Gi Is there a workaround or do we have to swap docker exec for the http-exec-bridge? |
We should check if docker 1.11 fixed this issue (#23397 (comment)). Besides that, we may be able to add extra plumbing in kubelet to always stop the probers (and stop receiving exec requests) before killing the container. Although this doesn't stop users from running exec directly on the node. /cc @kubernetes/sig-node |
Actually this probe is gross with http-exec, because I need to install mysql and mount my database in the http-exec sidecar. Can't we use nsenter for exec probe instead of docker exec, if it's available on the host? |
The exec handler is configurable, e.g, "--docker-exec-handler=nsenter". The default is Maybe we should reconsider using nsenter as default. @vishh @dchen1107, what do you think? EDIT: of course we should verify whether nsenter would cause the same problem for docker. |
AFAIK the issue with nsenter was that it is an extra dependency that we had to ship with kubelet somehow. What happens if we issue a |
I'd much rather have an extra dependency and predictable outcomes. Docker exec has just proven to be unstable, repeatedly. Wedging a database in an unknown state is terrible for consistency, and I doubt anyone is going to read the fine print that describes the effects of exec probes on dying containers.
Consider this my pound of flesh :) |
I am guessing that I recreated this with PetSet. Let me know if this is a different issue, but this smells like the same issue.
The pod failed on create because I put the wrong damn image url in, but it hangs on the delete. If this is a specific petset issue let me know. |
@chrislovecnm elaborate? do you mean ssh into node + docker delete, kubectl delete hangs, or you scaled down the petset? |
@bprashanth bash-3.2$ kubectl get po
NAME READY STATUS RESTARTS AGE
cassandra-0 0/1 ImagePullBackOff 0 24m bash-3.2$ kubectl get po here: https://gist.github.com/chrislovecnm/22ace1559dcb0ba7af64f74123b1fff8 I can still see the pod, and when I create the petset again the pod You on slack? |
Full k8s restart did not clear it. Also I can see the pause docker. Don't know what it is. core@k8solo-01 ~ $ docker ps | grep cass
28f99a6e27df gcr.io/google_containers/pause-amd64:3.0 "/pause" 6 minutes ago Up 6 minutes k8s_POD.98a3788b_cassandra-0_default_4e0f1aee-19eb-11e6-942e-a6fe4615cf32_984984d6
core@k8solo-01 ~ $ |
Yeah that's because the petset won't scale+delete like the rc. It goes off safety first, so it won't touch your volumes if you simply delete it while it has |
It's important to note that nsenter just drops you in as root with a On Thursday, May 12, 2016, Prashanth B notifications@github.com wrote:
|
I marked this as p0 for decision if we should disable docker exec completely? Switch to nsenter? Or suggested the user to use http-exec-bridge? One caveat for nsenter is that nsenter simply enter the namespace, without proper cgroup, even capabilities, besides of carrying other dependency. |
So that's a slight pain for the cases I mentioned because I need to at least install mysql or whatever db client into my pet (#25456 (comment)) |
cc/ @timstclair Can we try to reproduce this failure case with docker 1.11? |
@bprashanth docker 1.11.1 image was out a while back, can we reproduce the issue? |
Oh, hmm, i swapped out my exec probe with a http probe + mysql container. I'll try once I've got the other bugs sorted out and report back. |
Dropped the priority to p1 based on @bprashanth's comment above. Will try to reproduce against docker 1.11.1. |
I'm hitting this on k8s Deleting the pod gets stuck in
Docker says container is
|
@ApsOps : Did you able to get resolved issue? I am getting exact same error but root cause is different.
|
@nsidhaye I had upgraded Kubernetes and docker and haven't seen this issue again. Though I've seen a couple of other docker related issues. It's almost always a docker problem. |
is this relate to this one? |
I'm still hitting this on
Docker runtime is up. I'm able to run Restarting the docker service fixes it. /remove-lifecycle rotten |
I'm still hitting this on @dchen1107 This is kinda critical issue since it doesn't resolve itself and containers get stuck in /cc @kubernetes/sig-node |
Hi, I've experienced the same issue when attempting to delete mysql containers with k8s v1.9.3.
I've started the cluster using Kops and I'm running the debian stretch AMI I've tried
The issue seems pretty consistent, as I got it on both mysql containers I tried to run. Didn't find anything helpful on kubelet logs either (tried to grep for mysql and filter for err messages and got nothing). |
Yes, Even in Pod:
Log:
Here is very interesting statistics of docker container
|
Same error on Log:
|
Realized I never gave a follow up on my issue above. In the end the problem manifested only when running Istio with the canal CNI. When we switched canal for calico, the problem was gone. |
/remove-lifecycle frozen |
Same here, hanging for more than 12 hours. I've tried to restart docker daemon and kubelet, not working. But restart of etcd works for me. |
anyone tried with finalizers?
|
Rotten issues close after 30d of inactivity. Send feedback to sig-testing, kubernetes/test-infra and/or fejta. |
@fejta-bot: Closing this issue. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
/reopen |
@kongkongk: You can't reopen an issue/PR unless you authored it or you are a collaborator. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
Can this please get reopened? |
You are a life saver. |
I'm facing the same error on
|
me too, how do you fix that? |
@lichaojacobs @shachar-ash |
We are experiencing this bug on Kubernetes 1.17.8 Can this issue please be reopened?
|
For AWS users who are using Amazon's EKS AMI, it looks like this is due to a containerd compatibility issue. See awslabs/amazon-eks-ami#563 for details. |
Just FYI, if above solutions doesn't work, instead of restarting Docker service, we may just kill the corresponding container process of the stuck pod, i.e.,
|
I had a pod stuck in terminating. Similar to some other issues, didn't check for exact dupe.
Logged into the node and debugged a little (container cde46198ade6):
Exec works after the kill failed
Inspect shows Running
true
:But the pid isn't around:
And the pod remains in terminating:
This is on 1.9.1, maybe fixed?
@kubernetes/goog-node
The text was updated successfully, but these errors were encountered: