Skip to content
This repository has been archived by the owner on Dec 11, 2020. It is now read-only.

terraform destroy does not respect stop_grace_period for docker swarm services running on DOCKER_HOST #313

Open
SebastianFaulborn opened this issue Nov 19, 2020 · 0 comments

Comments

@SebastianFaulborn
Copy link

Hi,

When creating a docker swarm service and the container happens to run on the DOCKER_HOST (parameter host in provider) then the container started as part of the service receives after terraform destroy a SIGKILL immediately rather than stop_signal (and optionally SIGKILL after stop_grace_period if container has not shutdown until then). For containers which happen to be created on worker nodes the code works as expected (and likely also for containers which are created on manager nodes which are not the DOCKER_HOST).

This is a show stopper for all database applications and those containers which contain databases within (such as Nexus3) and can lead to catastrophic data loss as we have learned the hard way.

Terraform Version

Terraform v0.13.5

Affected Resource(s)

  • docker_service

Terraform Configuration Files (test.tf)

variable "docker_host" {
  type    = string
}

variable "node" {
  type = string
}

terraform {
  required_providers {
    docker = {
      source = "terraform-providers/docker"
      version = "2.7.2"
    }
  }
}

provider "docker" {
  host = var.docker_host
}

resource "docker_service" "test" {
  name = "test"
  
  task_spec {
    container_spec {
      image = "ubuntu"
      
      stop_signal       = "SIGTERM"
      stop_grace_period = "30s"
      
      command = [ "/bin/bash" ]
      args    = [ "-c", "echo test; trap 'echo SIGTERM && exit 0' SIGTERM; while true; do sleep 1; done" ]
    }
    
    placement {
        constraints = [
            "node.role==${var.node}"       
            ]
    }
  }
}

Debug Output

none

Panic Output

none

Expected Behavior

Container receives upon shutdown first SIGTERM:

xyz$ docker service logs -f test
test.1.z243xk5gqgk0@ci-jct-swarm-cluster-worker-node-0    | test
test.1.z243xk5gqgk0@ci-jct-swarm-cluster-worker-node-0    | SIGTERM

Actual Behavior

When container runs on DOCKER_HOST it receives immediately SIGKILL:

xyz$ docker service logs -f test
test.1.z243xk5gqgk0@ci-jct-swarm-cluster-manager-node-0    | test

Steps to Reproduce

Create a docker swarm cluster with just 1 manager and 1 worker (this just makes it easy to get the test case working). Then:

  1. terraform init
  2. terraform apply -var 'node=manager' -var 'docker_host=tcp://<place here the DOCKER_HOST ip>:2375'
  3. in a 2nd console enter: DOCKER_HOST=tcp://<place here the DOCKER_HOST ip>:2375 docker service logs -f test
  4. back to original console enter:
    terraform destroy -var 'node=manager' -var 'docker_host=tcp://<place here the DOCKER_HOST ip>:2375'
  5. watch output on 2nd console

To run the working test case:
Do the same as before but replace -var 'node=manager' with -var 'node=worker'

Important Factoids

Looking at the source code (I am not a go expert!) I found the following:
https://github.com/terraform-providers/terraform-provider-docker/blob/master/docker/resource_docker_service_funcs.go
Line 270ff
func deleteService(serviceID string, d *schema.ResourceData, client *client.Client)

Line 297 actually removes the service (and as I believe it is the only line needed and everything further down is just for historic reasons - docker now does all needed by himself):
if err := client.ServiceRemove(context.Background(), serviceID); err != nil {

Note: you can see in the docker daemon debug log of the DOCKER_HOST a line such as:
time="2020-11-18T18:05:10.871048408Z" level=debug msg="Calling DELETE /v1.40/services/so1zejuqgruzyrsz9c5bz1isn"
... which is the only rest api call when doing a docker service rm so1zejuqgruzyrsz9c5bz1isn using docker CLI

Line 309 is supposed to wait until the container has been removed - but actually always returns immediately:
exitCode, _ := client.ContainerWait(ctx, containerID, container.WaitConditionRemoved)

2020-11-19T16:46:45.960+0100 [DEBUG] plugin.terraform-provider-docker_v2.7.2_x4: 2020/11/19 16:46:45 [INFO] Found container ['running'] for destroying: '7b00d2424f16fa19c4cd6e39dcd148f1337f9c383c0a3373091aa7c6b2f11736'
2020-11-19T16:46:45.960+0100 [DEBUG] plugin.terraform-provider-docker_v2.7.2_x4: 2020/11/19 16:46:45 [INFO] Deleting service: 'i1b30zzgweff3he63r35ky3i3'
2020-11-19T16:46:45.994+0100 [DEBUG] plugin.terraform-provider-docker_v2.7.2_x4: 2020/11/19 16:46:45 [INFO] Waiting for container: '7b00d2424f16fa19c4cd6e39dcd148f1337f9c383c0a3373091aa7c6b2f11736' to exit: max 30s
2020-11-19T16:46:46.027+0100 [DEBUG] plugin.terraform-provider-docker_v2.7.2_x4: 2020/11/19 16:46:46 [INFO] Container exited with code [0xc000094660]: '7b00d2424f16fa19c4cd6e39dcd148f1337f9c383c0a3373091aa7c6b2f11736'
2020-11-19T16:46:46.027+0100 [DEBUG] plugin.terraform-provider-docker_v2.7.2_x4: 2020/11/19 16:46:46 [INFO] Removing container: '7b00d2424f16fa19c4cd6e39dcd148f1337f9c383c0a3373091aa7c6b2f11736'

Note: between waiting and removing container passes just 33ms.

Line 318 is supposed to remove the container (why? docker does this by himself!).
if err := client.ContainerRemove(context.Background(), containerID, removeOpts); err != nil {

There are 2 problems here:

a) this line has the same effect as docker container rm --force <containerID> which will as the result of --force immediately send a SIGKILL to the container. Due to the first problem (no waiting is happening) this will happen before docker had a chance to remove the service and send SIGTERM to all the containers. Therefore the effect of killing the containers on the manager node.

time="2020-11-18T17:56:36.644210735Z" level=debug msg="Calling DELETE /v1.40/services/dll5o060pagmxr0sg9lasjit3"
time="2020-11-18T17:56:36.689439754Z" level=debug msg="Calling POST /v1.40/containers/b00490988356851e84099901a2011264c22024f3bed977b1059fbe0591aa6b5a/wait?condition=remo
ved"
time="2020-11-18T17:56:36.769538131Z" level=debug msg="Calling DELETE /v1.40/containers/b00490988356851e84099901a2011264c22024f3bed977b1059fbe0591aa6b5a?force=1&v=1"
time="2020-11-18T17:56:36.769598227Z" level=debug msg="Sending kill signal 9 to container b00490988356851e84099901a2011264c22024f3bed977b1059fbe0591aa6b5a"

Note: have a look at the time code!

b) The code does not seem to anticipate that when we issue container commands (rather than service commands) that they have to be send to the exact node which runs the container. This is the reason why this is only happening on the DOCKER_HOST: the workers (or any other node - this I haven't checked) will simply never receive the ContainerRemove command and the DOCKER_HOST says that this container is unknown to him.

Disclaimer: these are just my findings by looking hard at the source code and I wanted to share this with you in the hope to be useful and maybe saves some time.

References

none

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant