terraform destroy does not respect stop_grace_period for docker swarm services running on DOCKER_HOST #16
Labels
bug
Something isn't working
r/service
Relates to the service resource
stale
waiting for response
Waiting for a response of the creator
Hi,
When creating a docker swarm service and the container happens to run on the
DOCKER_HOST
(parameterhost
in provider) then the container started as part of the service receives afterterraform destroy
aSIGKILL
immediately rather thanstop_signal
(and optionallySIGKILL
afterstop_grace_period
if container has not shutdown until then). For containers which happen to be created on worker nodes the code works as expected (and likely also for containers which are created on manager nodes which are not the DOCKER_HOST).Note: docker CLI (ie.
docker service rm
does work as expected)This is a show stopper for all database applications and those containers which contain databases within (such as Nexus3) and can lead to catastrophic data loss as we have learned the hard way.
I already have written to here - but did not receive any answers at all (maybe wrong place to post issues):
hashicorp/terraform-provider-docker#313
Terraform Version
Terraform v0.13.5
Note: kreuzwerker/terraform-provider-docker V2.8.0 also has this problem.
Affected Resource(s)
docker_service
Terraform Configuration Files (test.tf)
Debug Output
none
Panic Output
none
Expected Behavior
Container receives upon shutdown first SIGTERM:
Actual Behavior
When container runs on DOCKER_HOST it receives immediately SIGKILL:
Steps to Reproduce
Create a docker swarm cluster with just 1 manager and 1 worker (this just makes it easy to get the test case working). Then:
terraform init
terraform apply -var 'node=manager' -var 'docker_host=tcp://<place here the DOCKER_HOST ip>:2375'
DOCKER_HOST=tcp://<place here the DOCKER_HOST ip>:2375 docker service logs -f test
terraform destroy -var 'node=manager' -var 'docker_host=tcp://<place here the DOCKER_HOST ip>:2375'
To run the working test case:
Do the same as before but replace
-var 'node=manager'
with-var 'node=worker'
Important Factoids
Looking at the source code (I am not a go expert!) I found the following:
https://github.com/terraform-providers/terraform-provider-docker/blob/master/docker/resource_docker_service_funcs.go
Line 270ff
func deleteService(serviceID string, d *schema.ResourceData, client *client.Client)
Line 297 actually removes the service (and as I believe it is the only line needed and everything further down is just for historic reasons - docker now does all needed by himself):
if err := client.ServiceRemove(context.Background(), serviceID); err != nil {
Note: you can see in the docker daemon debug log of the DOCKER_HOST a line such as:
time="2020-11-18T18:05:10.871048408Z" level=debug msg="Calling DELETE /v1.40/services/so1zejuqgruzyrsz9c5bz1isn"
... which is the only rest api call when doing a
docker service rm so1zejuqgruzyrsz9c5bz1isn
using docker CLILine 309 is supposed to wait until the container has been removed - but actually always returns immediately:
exitCode, _ := client.ContainerWait(ctx, containerID, container.WaitConditionRemoved)
2020-11-19T16:46:45.960+0100 [DEBUG] plugin.terraform-provider-docker_v2.7.2_x4: 2020/11/19 16:46:45 [INFO] Found container ['running'] for destroying: '7b00d2424f16fa19c4cd6e39dcd148f1337f9c383c0a3373091aa7c6b2f11736'
2020-11-19T16:46:45.960+0100 [DEBUG] plugin.terraform-provider-docker_v2.7.2_x4: 2020/11/19 16:46:45 [INFO] Deleting service: 'i1b30zzgweff3he63r35ky3i3'
2020-11-19T16:46:45.994+0100 [DEBUG] plugin.terraform-provider-docker_v2.7.2_x4: 2020/11/19 16:46:45 [INFO] Waiting for container: '7b00d2424f16fa19c4cd6e39dcd148f1337f9c383c0a3373091aa7c6b2f11736' to exit: max 30s
2020-11-19T16:46:46.027+0100 [DEBUG] plugin.terraform-provider-docker_v2.7.2_x4: 2020/11/19 16:46:46 [INFO] Container exited with code [0xc000094660]: '7b00d2424f16fa19c4cd6e39dcd148f1337f9c383c0a3373091aa7c6b2f11736'
2020-11-19T16:46:46.027+0100 [DEBUG] plugin.terraform-provider-docker_v2.7.2_x4: 2020/11/19 16:46:46 [INFO] Removing container: '7b00d2424f16fa19c4cd6e39dcd148f1337f9c383c0a3373091aa7c6b2f11736'
Note: between waiting and removing container passes just 33ms.
I don't know why it always returned immediately.
Line 318 is supposed to remove the container (why? docker does this by himself!).
if err := client.ContainerRemove(context.Background(), containerID, removeOpts); err != nil {
There are 2 problems here:
a) this line has the same effect as
docker container rm --force <containerID>
which will as the result of--force
immediately send aSIGKILL
to the container. Due to the first problem (no waiting is happening) this will happen before docker had a chance to remove the service and send SIGTERM to all the containers. Therefore the effect of killing the containers on the manager node.time="2020-11-18T17:56:36.644210735Z" level=debug msg="Calling DELETE /v1.40/services/dll5o060pagmxr0sg9lasjit3"
time="2020-11-18T17:56:36.689439754Z" level=debug msg="Calling POST /v1.40/containers/b00490988356851e84099901a2011264c22024f3bed977b1059fbe0591aa6b5a/wait?condition=remo
ved"
time="2020-11-18T17:56:36.769538131Z" level=debug msg="Calling DELETE /v1.40/containers/b00490988356851e84099901a2011264c22024f3bed977b1059fbe0591aa6b5a?force=1&v=1"
time="2020-11-18T17:56:36.769598227Z" level=debug msg="Sending kill signal 9 to container b00490988356851e84099901a2011264c22024f3bed977b1059fbe0591aa6b5a"
Note: have a look at the time code!
b) The code does not seem to anticipate that when we issue container commands (rather than service commands) that they have to be send to the exact node which runs the container. This is the reason why this is only happening on the DOCKER_HOST: the workers (or any other node - this I haven't checked) will simply never receive the ContainerRemove command and the DOCKER_HOST says that this container is unknown to him.
Disclaimer: these are just my findings by looking hard at the source code and I wanted to share this with you in the hope to be useful and maybe saves some time.
References
none
The text was updated successfully, but these errors were encountered: