New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Node DNS sync container takes 30-60s to shut down because it doesn't handle SIGTERM properly #65
Comments
|
/assign @ironcladlou |
|
Fixed by #68 |
|
@pravisankar: Closing this issue. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
rbbratta
added a commit
to rbbratta/verification-tests
that referenced
this issue
Nov 5, 2020
Pods run the entrypoint as PID 1, whatever process is run as PID 1 won't receive SIGTERM and won't shutdown quickly (< 1min) We need to trap SIGTERM and exit immediately See: https://github.com/openshift/cluster-network-operator/pull/859/files#diff-6f3fb59cc9404c296e2f7aa6b442137bdb2d4559ced7bfc51b7a188c2b64233fR65 openshift/cluster-dns-operator#65 We should: > Listen for SIGTERM explicitly > Always `sleep X & wait` instead of just `sleep X` https://vagga.readthedocs.io/en/latest/pid1mode.html https://hynek.me/articles/docker-signals/ The current `["/bin/bash", "-c", "sleep 2000000000000"]` is causing us to sleep ~42 seconds each time we delete the pod. So we need to trap SIGTERM and then exit ```shell command: ["/bin/bash", "-c", "trap 'kill $(jobs -p); exit 0' TERM ; sleep 2000000000000 & wait"] ``` Time to delete for `["/bin/bash", "-c", "sleep 2000000000000"]` ``` pod "test-pod" deleted real 0m41.459s user 0m0.175s sys 0m0.050s ``` Time to delete for `["/bin/bash", "-c", "trap 'kill $(jobs -p); exit 0' TERM ; sleep 2000000000000 & wait"]` ``` pod "test-pod" deleted real 0m2.738s user 0m0.186s sys 0m0.057s ``` averages over 20 runs with ``` for f in {1..20} ; do oc create -f generic_multus_pod.yaml ; oc wait --for=condition=Ready pod/test-pod ; time oc delete pod test-pod ; done grep -P -o '(?<=real 0m)[^s]+' | awk '{ sum += $1; n++ } END { if (n > 0) print sum / n; }' ``` `["/bin/bash", "-c", "sleep 2000000000000"]` 36.317s `["/bin/bash", "-c", "trap 'kill $(jobs -p); exit 0' TERM ; sleep 2000000000000 & wait"]` 6.9327 Not handling SIGTERM is probably costing us ~30s every time we delete a project with a sleeping Pod after the test.
openshift-merge-robot
pushed a commit
to openshift/verification-tests
that referenced
this issue
Nov 5, 2020
Pods run the entrypoint as PID 1, whatever process is run as PID 1 won't receive SIGTERM and won't shutdown quickly (< 1min) We need to trap SIGTERM and exit immediately See: https://github.com/openshift/cluster-network-operator/pull/859/files#diff-6f3fb59cc9404c296e2f7aa6b442137bdb2d4559ced7bfc51b7a188c2b64233fR65 openshift/cluster-dns-operator#65 We should: > Listen for SIGTERM explicitly > Always `sleep X & wait` instead of just `sleep X` https://vagga.readthedocs.io/en/latest/pid1mode.html https://hynek.me/articles/docker-signals/ The current `["/bin/bash", "-c", "sleep 2000000000000"]` is causing us to sleep ~42 seconds each time we delete the pod. So we need to trap SIGTERM and then exit ```shell command: ["/bin/bash", "-c", "trap 'kill $(jobs -p); exit 0' TERM ; sleep 2000000000000 & wait"] ``` Time to delete for `["/bin/bash", "-c", "sleep 2000000000000"]` ``` pod "test-pod" deleted real 0m41.459s user 0m0.175s sys 0m0.050s ``` Time to delete for `["/bin/bash", "-c", "trap 'kill $(jobs -p); exit 0' TERM ; sleep 2000000000000 & wait"]` ``` pod "test-pod" deleted real 0m2.738s user 0m0.186s sys 0m0.057s ``` averages over 20 runs with ``` for f in {1..20} ; do oc create -f generic_multus_pod.yaml ; oc wait --for=condition=Ready pod/test-pod ; time oc delete pod test-pod ; done grep -P -o '(?<=real 0m)[^s]+' | awk '{ sum += $1; n++ } END { if (n > 0) print sum / n; }' ``` `["/bin/bash", "-c", "sleep 2000000000000"]` 36.317s `["/bin/bash", "-c", "trap 'kill $(jobs -p); exit 0' TERM ; sleep 2000000000000 & wait"]` 6.9327 Not handling SIGTERM is probably costing us ~30s every time we delete a project with a sleeping Pod after the test.
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Bash processes run as pid 1 need to follow a certain set of patterns in order to avoid being unresponsive when the container runtime invokes them.
Right now, the sidecar sync container is ignoring SIGTERM and so pod deletion takes however long a single wait lasts, instead of happening near instantly. This is frustrating and delays upgrade and can lead to extended downtime on the node for DNS.
Specifically, you need to
sleep X & waitinstead of justsleep X: https://github.com/openshift/openshift-ansible/blob/release-3.11/roles/openshift_node_group/files/sync.yaml#L83The text was updated successfully, but these errors were encountered: