terraform destroy does not respect stop_grace_period for docker swarm services running on DOCKER_HOST #16

SebastianFaulborn · 2020-12-02T15:51:07Z

Hi,

When creating a docker swarm service and the container happens to run on the DOCKER_HOST (parameter host in provider) then the container started as part of the service receives after terraform destroy a SIGKILL immediately rather than stop_signal (and optionally SIGKILL after stop_grace_period if container has not shutdown until then). For containers which happen to be created on worker nodes the code works as expected (and likely also for containers which are created on manager nodes which are not the DOCKER_HOST).

Note: docker CLI (ie. docker service rm does work as expected)

This is a show stopper for all database applications and those containers which contain databases within (such as Nexus3) and can lead to catastrophic data loss as we have learned the hard way.

I already have written to here - but did not receive any answers at all (maybe wrong place to post issues):
hashicorp/terraform-provider-docker#313

Terraform Version

Terraform v0.13.5

docker = {
      source = "terraform-providers/docker"
      version = "2.7.2"
}

Note: kreuzwerker/terraform-provider-docker V2.8.0 also has this problem.

Affected Resource(s)

docker_service

Terraform Configuration Files (test.tf)

variable "docker_host" {
  type    = string
}

variable "node" {
  type = string
}

terraform {
  required_providers {
    docker = {
      source = "terraform-providers/docker"
      version = "2.7.2"
    }
  }
}

provider "docker" {
  host = var.docker_host
}

resource "docker_service" "test" {
  name = "test"
  
  task_spec {
    container_spec {
      image = "ubuntu"
      
      stop_signal       = "SIGTERM"
      stop_grace_period = "30s"
      
      command = [ "/bin/bash" ]
      args    = [ "-c", "echo test; trap 'echo SIGTERM && exit 0' SIGTERM; while true; do sleep 1; done" ]
    }
    
    placement {
        constraints = [
            "node.role==${var.node}"       
            ]
    }
  }
}

Debug Output

none

Panic Output

none

Expected Behavior

Container receives upon shutdown first SIGTERM:

xyz$ docker service logs -f test
test.1.z243xk5gqgk0@ci-jct-swarm-cluster-worker-node-0    | test          (Note: output after start)
test.1.z243xk5gqgk0@ci-jct-swarm-cluster-worker-node-0    | SIGTERM       (Note: output after receiving SIGTERM signal)

Actual Behavior

When container runs on DOCKER_HOST it receives immediately SIGKILL:

xyz$ docker service logs -f test
test.1.z243xk5gqgk0@ci-jct-swarm-cluster-manager-node-0    | test  (Note: output after start. Output after receiving SIGTERM is missing)

Steps to Reproduce

Create a docker swarm cluster with just 1 manager and 1 worker (this just makes it easy to get the test case working). Then:

terraform init
terraform apply -var 'node=manager' -var 'docker_host=tcp://<place here the DOCKER_HOST ip>:2375'
in a 2nd console enter: DOCKER_HOST=tcp://<place here the DOCKER_HOST ip>:2375 docker service logs -f test
back to original console enter:
terraform destroy -var 'node=manager' -var 'docker_host=tcp://<place here the DOCKER_HOST ip>:2375'
watch output on 2nd console

To run the working test case:
Do the same as before but replace -var 'node=manager' with -var 'node=worker'

Important Factoids

Looking at the source code (I am not a go expert!) I found the following:
https://github.com/terraform-providers/terraform-provider-docker/blob/master/docker/resource_docker_service_funcs.go
Line 270ff
func deleteService(serviceID string, d *schema.ResourceData, client *client.Client)

Line 297 actually removes the service (and as I believe it is the only line needed and everything further down is just for historic reasons - docker now does all needed by himself):
if err := client.ServiceRemove(context.Background(), serviceID); err != nil {

Note: you can see in the docker daemon debug log of the DOCKER_HOST a line such as:
time="2020-11-18T18:05:10.871048408Z" level=debug msg="Calling DELETE /v1.40/services/so1zejuqgruzyrsz9c5bz1isn"
... which is the only rest api call when doing a docker service rm so1zejuqgruzyrsz9c5bz1isn using docker CLI

Line 309 is supposed to wait until the container has been removed - but actually always returns immediately:
exitCode, _ := client.ContainerWait(ctx, containerID, container.WaitConditionRemoved)

2020-11-19T16:46:45.960+0100 [DEBUG] plugin.terraform-provider-docker_v2.7.2_x4: 2020/11/19 16:46:45 [INFO] Found container ['running'] for destroying: '7b00d2424f16fa19c4cd6e39dcd148f1337f9c383c0a3373091aa7c6b2f11736'
2020-11-19T16:46:45.960+0100 [DEBUG] plugin.terraform-provider-docker_v2.7.2_x4: 2020/11/19 16:46:45 [INFO] Deleting service: 'i1b30zzgweff3he63r35ky3i3'
2020-11-19T16:46:45.994+0100 [DEBUG] plugin.terraform-provider-docker_v2.7.2_x4: 2020/11/19 16:46:45 [INFO] Waiting for container: '7b00d2424f16fa19c4cd6e39dcd148f1337f9c383c0a3373091aa7c6b2f11736' to exit: max 30s
2020-11-19T16:46:46.027+0100 [DEBUG] plugin.terraform-provider-docker_v2.7.2_x4: 2020/11/19 16:46:46 [INFO] Container exited with code [0xc000094660]: '7b00d2424f16fa19c4cd6e39dcd148f1337f9c383c0a3373091aa7c6b2f11736'
2020-11-19T16:46:46.027+0100 [DEBUG] plugin.terraform-provider-docker_v2.7.2_x4: 2020/11/19 16:46:46 [INFO] Removing container: '7b00d2424f16fa19c4cd6e39dcd148f1337f9c383c0a3373091aa7c6b2f11736'

Note: between waiting and removing container passes just 33ms.
I don't know why it always returned immediately.

Line 318 is supposed to remove the container (why? docker does this by himself!).
if err := client.ContainerRemove(context.Background(), containerID, removeOpts); err != nil {

There are 2 problems here:

a) this line has the same effect as docker container rm --force <containerID> which will as the result of --force immediately send a SIGKILL to the container. Due to the first problem (no waiting is happening) this will happen before docker had a chance to remove the service and send SIGTERM to all the containers. Therefore the effect of killing the containers on the manager node.

time="2020-11-18T17:56:36.644210735Z" level=debug msg="Calling DELETE /v1.40/services/dll5o060pagmxr0sg9lasjit3"
time="2020-11-18T17:56:36.689439754Z" level=debug msg="Calling POST /v1.40/containers/b00490988356851e84099901a2011264c22024f3bed977b1059fbe0591aa6b5a/wait?condition=remo
ved"
time="2020-11-18T17:56:36.769538131Z" level=debug msg="Calling DELETE /v1.40/containers/b00490988356851e84099901a2011264c22024f3bed977b1059fbe0591aa6b5a?force=1&v=1"
time="2020-11-18T17:56:36.769598227Z" level=debug msg="Sending kill signal 9 to container b00490988356851e84099901a2011264c22024f3bed977b1059fbe0591aa6b5a"
Note: have a look at the time code!

b) The code does not seem to anticipate that when we issue container commands (rather than service commands) that they have to be send to the exact node which runs the container. This is the reason why this is only happening on the DOCKER_HOST: the workers (or any other node - this I haven't checked) will simply never receive the ContainerRemove command and the DOCKER_HOST says that this container is unknown to him.

Disclaimer: these are just my findings by looking hard at the source code and I wanted to share this with you in the hope to be useful and maybe saves some time.

References

none

The text was updated successfully, but these errors were encountered:

github-actions · 2021-03-29T10:53:25Z

This issue is stale because it has been open 60 days with no activity.
Remove stale label or comment or this will be closed in 7 days.
If you don't want this issue to be closed, please set the label pinned.

github-actions · 2021-06-06T10:39:52Z

This issue is stale because it has been open 60 days with no activity.
Remove stale label or comment or this will be closed in 7 days.
If you don't want this issue to be closed, please set the label pinned.

mavogel · 2021-06-23T09:26:07Z

hi @SebastianFaulborn you might want to check this again with the latest v2.13.0 release where we fixed this in #227.

TL;DR: the function returned a channel which we did not wait for, that's why it returned immediately. I think the signature changed in one of the updates

github-actions · 2021-08-24T10:10:59Z

This issue is stale because it has been open 60 days with no activity.
Remove stale label or comment or this will be closed in 7 days.
If you don't want this issue to be closed, please set the label pinned.

mavogel added bug Something isn't working r/service Relates to the service resource labels Dec 25, 2020

github-actions bot added the stale label Mar 29, 2021

github-actions bot closed this as completed Apr 6, 2021

mavogel removed the stale label Apr 7, 2021

mavogel reopened this Apr 7, 2021

mavogel mentioned this issue Apr 7, 2021

terraform destroy does not respect stop_grace_period for docker swarm services running on DOCKER_HOST #42

Closed

github-actions bot added the stale label Jun 6, 2021

mavogel removed the stale label Jun 14, 2021

mavogel added the waiting for response Waiting for a response of the creator label Jun 23, 2021

github-actions bot added the stale label Aug 24, 2021

github-actions bot closed this as completed Sep 1, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

terraform destroy does not respect stop_grace_period for docker swarm services running on DOCKER_HOST #16

terraform destroy does not respect stop_grace_period for docker swarm services running on DOCKER_HOST #16

SebastianFaulborn commented Dec 2, 2020 •

edited by mavogel

Loading

github-actions bot commented Mar 29, 2021

github-actions bot commented Jun 6, 2021

mavogel commented Jun 23, 2021

github-actions bot commented Aug 24, 2021

terraform destroy does not respect stop_grace_period for docker swarm services running on DOCKER_HOST #16

terraform destroy does not respect stop_grace_period for docker swarm services running on DOCKER_HOST #16

Comments

SebastianFaulborn commented Dec 2, 2020 • edited by mavogel Loading

Terraform Version

Affected Resource(s)

Terraform Configuration Files (test.tf)

Debug Output

Panic Output

Expected Behavior

Actual Behavior

Steps to Reproduce

Important Factoids

References

github-actions bot commented Mar 29, 2021

github-actions bot commented Jun 6, 2021

mavogel commented Jun 23, 2021

github-actions bot commented Aug 24, 2021

SebastianFaulborn commented Dec 2, 2020 •

edited by mavogel

Loading