You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
When creating a docker swarm service and the container happens to run on the DOCKER_HOST (parameter host in provider) then the container started as part of the service receives after terraform destroy a SIGKILL immediately rather than stop_signal (and optionally SIGKILL after stop_grace_period if container has not shutdown until then). For containers which happen to be created on worker nodes the code works as expected (and likely also for containers which are created on manager nodes which are not the DOCKER_HOST).
This is a show stopper for all database applications and those containers which contain databases within (such as Nexus3) and can lead to catastrophic data loss as we have learned the hard way.
xyz$ docker service logs -f test
test.1.z243xk5gqgk0@ci-jct-swarm-cluster-worker-node-0 | test
test.1.z243xk5gqgk0@ci-jct-swarm-cluster-worker-node-0 | SIGTERM
Actual Behavior
When container runs on DOCKER_HOST it receives immediately SIGKILL:
xyz$ docker service logs -f test
test.1.z243xk5gqgk0@ci-jct-swarm-cluster-manager-node-0 | test
Steps to Reproduce
Create a docker swarm cluster with just 1 manager and 1 worker (this just makes it easy to get the test case working). Then:
terraform init
terraform apply -var 'node=manager' -var 'docker_host=tcp://<place here the DOCKER_HOST ip>:2375'
in a 2nd console enter: DOCKER_HOST=tcp://<place here the DOCKER_HOST ip>:2375 docker service logs -f test
back to original console enter: terraform destroy -var 'node=manager' -var 'docker_host=tcp://<place here the DOCKER_HOST ip>:2375'
watch output on 2nd console
To run the working test case:
Do the same as before but replace -var 'node=manager' with -var 'node=worker'
Line 297 actually removes the service (and as I believe it is the only line needed and everything further down is just for historic reasons - docker now does all needed by himself): if err := client.ServiceRemove(context.Background(), serviceID); err != nil {
Note: you can see in the docker daemon debug log of the DOCKER_HOST a line such as: time="2020-11-18T18:05:10.871048408Z" level=debug msg="Calling DELETE /v1.40/services/so1zejuqgruzyrsz9c5bz1isn"
... which is the only rest api call when doing a docker service rm so1zejuqgruzyrsz9c5bz1isn using docker CLI
Line 309 is supposed to wait until the container has been removed - but actually always returns immediately: exitCode, _ := client.ContainerWait(ctx, containerID, container.WaitConditionRemoved)
2020-11-19T16:46:45.960+0100 [DEBUG] plugin.terraform-provider-docker_v2.7.2_x4: 2020/11/19 16:46:45 [INFO] Found container ['running'] for destroying: '7b00d2424f16fa19c4cd6e39dcd148f1337f9c383c0a3373091aa7c6b2f11736'
2020-11-19T16:46:45.960+0100 [DEBUG] plugin.terraform-provider-docker_v2.7.2_x4: 2020/11/19 16:46:45 [INFO] Deleting service: 'i1b30zzgweff3he63r35ky3i3'
2020-11-19T16:46:45.994+0100 [DEBUG] plugin.terraform-provider-docker_v2.7.2_x4: 2020/11/19 16:46:45 [INFO] Waiting for container: '7b00d2424f16fa19c4cd6e39dcd148f1337f9c383c0a3373091aa7c6b2f11736' to exit: max 30s
2020-11-19T16:46:46.027+0100 [DEBUG] plugin.terraform-provider-docker_v2.7.2_x4: 2020/11/19 16:46:46 [INFO] Container exited with code [0xc000094660]: '7b00d2424f16fa19c4cd6e39dcd148f1337f9c383c0a3373091aa7c6b2f11736'
2020-11-19T16:46:46.027+0100 [DEBUG] plugin.terraform-provider-docker_v2.7.2_x4: 2020/11/19 16:46:46 [INFO] Removing container: '7b00d2424f16fa19c4cd6e39dcd148f1337f9c383c0a3373091aa7c6b2f11736'
Note: between waiting and removing container passes just 33ms.
Line 318 is supposed to remove the container (why? docker does this by himself!). if err := client.ContainerRemove(context.Background(), containerID, removeOpts); err != nil {
There are 2 problems here:
a) this line has the same effect as docker container rm --force <containerID> which will as the result of --force immediately send a SIGKILL to the container. Due to the first problem (no waiting is happening) this will happen before docker had a chance to remove the service and send SIGTERM to all the containers. Therefore the effect of killing the containers on the manager node.
time="2020-11-18T17:56:36.644210735Z" level=debug msg="Calling DELETE /v1.40/services/dll5o060pagmxr0sg9lasjit3"
time="2020-11-18T17:56:36.689439754Z" level=debug msg="Calling POST /v1.40/containers/b00490988356851e84099901a2011264c22024f3bed977b1059fbe0591aa6b5a/wait?condition=remo
ved"
time="2020-11-18T17:56:36.769538131Z" level=debug msg="Calling DELETE /v1.40/containers/b00490988356851e84099901a2011264c22024f3bed977b1059fbe0591aa6b5a?force=1&v=1"
time="2020-11-18T17:56:36.769598227Z" level=debug msg="Sending kill signal 9 to container b00490988356851e84099901a2011264c22024f3bed977b1059fbe0591aa6b5a"
Note: have a look at the time code!
b) The code does not seem to anticipate that when we issue container commands (rather than service commands) that they have to be send to the exact node which runs the container. This is the reason why this is only happening on the DOCKER_HOST: the workers (or any other node - this I haven't checked) will simply never receive the ContainerRemove command and the DOCKER_HOST says that this container is unknown to him.
Disclaimer: these are just my findings by looking hard at the source code and I wanted to share this with you in the hope to be useful and maybe saves some time.
References
none
The text was updated successfully, but these errors were encountered:
This issue is stale because it has been open 60 days with no activity.
Remove stale label or comment or this will be closed in 7 days.
If you don't want this issue to be closed, please set the label pinned.
This issue was originally opened by @SebastianFaulborn as hashicorp/terraform-provider-docker#313. It was migrated here as a result of the community provider takeover from @kreuzwerker. The original body of the issue is below.
Hi,
When creating a docker swarm service and the container happens to run on the
DOCKER_HOST
(parameterhost
in provider) then the container started as part of the service receives afterterraform destroy
a SIGKILL immediately rather thanstop_signal
(and optionally SIGKILL afterstop_grace_period
if container has not shutdown until then). For containers which happen to be created on worker nodes the code works as expected (and likely also for containers which are created on manager nodes which are not the DOCKER_HOST).This is a show stopper for all database applications and those containers which contain databases within (such as Nexus3) and can lead to catastrophic data loss as we have learned the hard way.
Terraform Version
Terraform v0.13.5
Affected Resource(s)
Terraform Configuration Files (test.tf)
Debug Output
none
Panic Output
none
Expected Behavior
Container receives upon shutdown first SIGTERM:
Actual Behavior
When container runs on DOCKER_HOST it receives immediately SIGKILL:
Steps to Reproduce
Create a docker swarm cluster with just 1 manager and 1 worker (this just makes it easy to get the test case working). Then:
terraform init
terraform apply -var 'node=manager' -var 'docker_host=tcp://<place here the DOCKER_HOST ip>:2375'
DOCKER_HOST=tcp://<place here the DOCKER_HOST ip>:2375 docker service logs -f test
terraform destroy -var 'node=manager' -var 'docker_host=tcp://<place here the DOCKER_HOST ip>:2375'
To run the working test case:
Do the same as before but replace
-var 'node=manager'
with-var 'node=worker'
Important Factoids
Looking at the source code (I am not a go expert!) I found the following:
https://github.com/terraform-providers/terraform-provider-docker/blob/master/docker/resource_docker_service_funcs.go
Line 270ff
func deleteService(serviceID string, d *schema.ResourceData, client *client.Client)
Line 297 actually removes the service (and as I believe it is the only line needed and everything further down is just for historic reasons - docker now does all needed by himself):
if err := client.ServiceRemove(context.Background(), serviceID); err != nil {
Note: you can see in the docker daemon debug log of the DOCKER_HOST a line such as:
time="2020-11-18T18:05:10.871048408Z" level=debug msg="Calling DELETE /v1.40/services/so1zejuqgruzyrsz9c5bz1isn"
... which is the only rest api call when doing a
docker service rm so1zejuqgruzyrsz9c5bz1isn
using docker CLILine 309 is supposed to wait until the container has been removed - but actually always returns immediately:
exitCode, _ := client.ContainerWait(ctx, containerID, container.WaitConditionRemoved)
2020-11-19T16:46:45.960+0100 [DEBUG] plugin.terraform-provider-docker_v2.7.2_x4: 2020/11/19 16:46:45 [INFO] Found container ['running'] for destroying: '7b00d2424f16fa19c4cd6e39dcd148f1337f9c383c0a3373091aa7c6b2f11736'
2020-11-19T16:46:45.960+0100 [DEBUG] plugin.terraform-provider-docker_v2.7.2_x4: 2020/11/19 16:46:45 [INFO] Deleting service: 'i1b30zzgweff3he63r35ky3i3'
2020-11-19T16:46:45.994+0100 [DEBUG] plugin.terraform-provider-docker_v2.7.2_x4: 2020/11/19 16:46:45 [INFO] Waiting for container: '7b00d2424f16fa19c4cd6e39dcd148f1337f9c383c0a3373091aa7c6b2f11736' to exit: max 30s
2020-11-19T16:46:46.027+0100 [DEBUG] plugin.terraform-provider-docker_v2.7.2_x4: 2020/11/19 16:46:46 [INFO] Container exited with code [0xc000094660]: '7b00d2424f16fa19c4cd6e39dcd148f1337f9c383c0a3373091aa7c6b2f11736'
2020-11-19T16:46:46.027+0100 [DEBUG] plugin.terraform-provider-docker_v2.7.2_x4: 2020/11/19 16:46:46 [INFO] Removing container: '7b00d2424f16fa19c4cd6e39dcd148f1337f9c383c0a3373091aa7c6b2f11736'
Note: between waiting and removing container passes just 33ms.
Line 318 is supposed to remove the container (why? docker does this by himself!).
if err := client.ContainerRemove(context.Background(), containerID, removeOpts); err != nil {
There are 2 problems here:
a) this line has the same effect as
docker container rm --force <containerID>
which will as the result of--force
immediately send a SIGKILL to the container. Due to the first problem (no waiting is happening) this will happen before docker had a chance to remove the service and send SIGTERM to all the containers. Therefore the effect of killing the containers on the manager node.time="2020-11-18T17:56:36.644210735Z" level=debug msg="Calling DELETE /v1.40/services/dll5o060pagmxr0sg9lasjit3"
time="2020-11-18T17:56:36.689439754Z" level=debug msg="Calling POST /v1.40/containers/b00490988356851e84099901a2011264c22024f3bed977b1059fbe0591aa6b5a/wait?condition=remo
ved"
time="2020-11-18T17:56:36.769538131Z" level=debug msg="Calling DELETE /v1.40/containers/b00490988356851e84099901a2011264c22024f3bed977b1059fbe0591aa6b5a?force=1&v=1"
time="2020-11-18T17:56:36.769598227Z" level=debug msg="Sending kill signal 9 to container b00490988356851e84099901a2011264c22024f3bed977b1059fbe0591aa6b5a"
Note: have a look at the time code!
b) The code does not seem to anticipate that when we issue container commands (rather than service commands) that they have to be send to the exact node which runs the container. This is the reason why this is only happening on the
DOCKER_HOST
: the workers (or any other node - this I haven't checked) will simply never receive theContainerRemove
command and the DOCKER_HOST says that this container is unknown to him.Disclaimer: these are just my findings by looking hard at the source code and I wanted to share this with you in the hope to be useful and maybe saves some time.
References
none
The text was updated successfully, but these errors were encountered: