Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

terraform destroy does not respect stop_grace_period for docker swarm services running on DOCKER_HOST #16

Closed
SebastianFaulborn opened this issue Dec 2, 2020 · 4 comments
Labels
bug Something isn't working r/service Relates to the service resource stale waiting for response Waiting for a response of the creator

Comments

@SebastianFaulborn
Copy link

SebastianFaulborn commented Dec 2, 2020

Hi,

When creating a docker swarm service and the container happens to run on the DOCKER_HOST (parameter host in provider) then the container started as part of the service receives after terraform destroy a SIGKILL immediately rather than stop_signal (and optionally SIGKILL after stop_grace_period if container has not shutdown until then). For containers which happen to be created on worker nodes the code works as expected (and likely also for containers which are created on manager nodes which are not the DOCKER_HOST).

Note: docker CLI (ie. docker service rm does work as expected)

This is a show stopper for all database applications and those containers which contain databases within (such as Nexus3) and can lead to catastrophic data loss as we have learned the hard way.

I already have written to here - but did not receive any answers at all (maybe wrong place to post issues):
hashicorp/terraform-provider-docker#313

Terraform Version

Terraform v0.13.5

docker = {
      source = "terraform-providers/docker"
      version = "2.7.2"
}

Note: kreuzwerker/terraform-provider-docker V2.8.0 also has this problem.

Affected Resource(s)

docker_service

Terraform Configuration Files (test.tf)

variable "docker_host" {
  type    = string
}

variable "node" {
  type = string
}

terraform {
  required_providers {
    docker = {
      source = "terraform-providers/docker"
      version = "2.7.2"
    }
  }
}

provider "docker" {
  host = var.docker_host
}

resource "docker_service" "test" {
  name = "test"
  
  task_spec {
    container_spec {
      image = "ubuntu"
      
      stop_signal       = "SIGTERM"
      stop_grace_period = "30s"
      
      command = [ "/bin/bash" ]
      args    = [ "-c", "echo test; trap 'echo SIGTERM && exit 0' SIGTERM; while true; do sleep 1; done" ]
    }
    
    placement {
        constraints = [
            "node.role==${var.node}"       
            ]
    }
  }
}

Debug Output

none

Panic Output

none

Expected Behavior

Container receives upon shutdown first SIGTERM:

xyz$ docker service logs -f test
test.1.z243xk5gqgk0@ci-jct-swarm-cluster-worker-node-0    | test          (Note: output after start)
test.1.z243xk5gqgk0@ci-jct-swarm-cluster-worker-node-0    | SIGTERM       (Note: output after receiving SIGTERM signal)

Actual Behavior

When container runs on DOCKER_HOST it receives immediately SIGKILL:

xyz$ docker service logs -f test
test.1.z243xk5gqgk0@ci-jct-swarm-cluster-manager-node-0    | test  (Note: output after start. Output after receiving SIGTERM is missing)

Steps to Reproduce

Create a docker swarm cluster with just 1 manager and 1 worker (this just makes it easy to get the test case working). Then:

  1. terraform init
  2. terraform apply -var 'node=manager' -var 'docker_host=tcp://<place here the DOCKER_HOST ip>:2375'
  3. in a 2nd console enter: DOCKER_HOST=tcp://<place here the DOCKER_HOST ip>:2375 docker service logs -f test
  4. back to original console enter:
    terraform destroy -var 'node=manager' -var 'docker_host=tcp://<place here the DOCKER_HOST ip>:2375'
  5. watch output on 2nd console

To run the working test case:
Do the same as before but replace -var 'node=manager' with -var 'node=worker'

Important Factoids

Looking at the source code (I am not a go expert!) I found the following:
https://github.com/terraform-providers/terraform-provider-docker/blob/master/docker/resource_docker_service_funcs.go
Line 270ff
func deleteService(serviceID string, d *schema.ResourceData, client *client.Client)

Line 297 actually removes the service (and as I believe it is the only line needed and everything further down is just for historic reasons - docker now does all needed by himself):
if err := client.ServiceRemove(context.Background(), serviceID); err != nil {

Note: you can see in the docker daemon debug log of the DOCKER_HOST a line such as:
time="2020-11-18T18:05:10.871048408Z" level=debug msg="Calling DELETE /v1.40/services/so1zejuqgruzyrsz9c5bz1isn"
... which is the only rest api call when doing a docker service rm so1zejuqgruzyrsz9c5bz1isn using docker CLI

Line 309 is supposed to wait until the container has been removed - but actually always returns immediately:
exitCode, _ := client.ContainerWait(ctx, containerID, container.WaitConditionRemoved)

2020-11-19T16:46:45.960+0100 [DEBUG] plugin.terraform-provider-docker_v2.7.2_x4: 2020/11/19 16:46:45 [INFO] Found container ['running'] for destroying: '7b00d2424f16fa19c4cd6e39dcd148f1337f9c383c0a3373091aa7c6b2f11736'
2020-11-19T16:46:45.960+0100 [DEBUG] plugin.terraform-provider-docker_v2.7.2_x4: 2020/11/19 16:46:45 [INFO] Deleting service: 'i1b30zzgweff3he63r35ky3i3'
2020-11-19T16:46:45.994+0100 [DEBUG] plugin.terraform-provider-docker_v2.7.2_x4: 2020/11/19 16:46:45 [INFO] Waiting for container: '7b00d2424f16fa19c4cd6e39dcd148f1337f9c383c0a3373091aa7c6b2f11736' to exit: max 30s
2020-11-19T16:46:46.027+0100 [DEBUG] plugin.terraform-provider-docker_v2.7.2_x4: 2020/11/19 16:46:46 [INFO] Container exited with code [0xc000094660]: '7b00d2424f16fa19c4cd6e39dcd148f1337f9c383c0a3373091aa7c6b2f11736'
2020-11-19T16:46:46.027+0100 [DEBUG] plugin.terraform-provider-docker_v2.7.2_x4: 2020/11/19 16:46:46 [INFO] Removing container: '7b00d2424f16fa19c4cd6e39dcd148f1337f9c383c0a3373091aa7c6b2f11736'

Note: between waiting and removing container passes just 33ms.
I don't know why it always returned immediately.

Line 318 is supposed to remove the container (why? docker does this by himself!).
if err := client.ContainerRemove(context.Background(), containerID, removeOpts); err != nil {

There are 2 problems here:

a) this line has the same effect as docker container rm --force <containerID> which will as the result of --force immediately send a SIGKILL to the container. Due to the first problem (no waiting is happening) this will happen before docker had a chance to remove the service and send SIGTERM to all the containers. Therefore the effect of killing the containers on the manager node.

time="2020-11-18T17:56:36.644210735Z" level=debug msg="Calling DELETE /v1.40/services/dll5o060pagmxr0sg9lasjit3"
time="2020-11-18T17:56:36.689439754Z" level=debug msg="Calling POST /v1.40/containers/b00490988356851e84099901a2011264c22024f3bed977b1059fbe0591aa6b5a/wait?condition=remo
ved"
time="2020-11-18T17:56:36.769538131Z" level=debug msg="Calling DELETE /v1.40/containers/b00490988356851e84099901a2011264c22024f3bed977b1059fbe0591aa6b5a?force=1&v=1"
time="2020-11-18T17:56:36.769598227Z" level=debug msg="Sending kill signal 9 to container b00490988356851e84099901a2011264c22024f3bed977b1059fbe0591aa6b5a"

Note: have a look at the time code!

b) The code does not seem to anticipate that when we issue container commands (rather than service commands) that they have to be send to the exact node which runs the container. This is the reason why this is only happening on the DOCKER_HOST: the workers (or any other node - this I haven't checked) will simply never receive the ContainerRemove command and the DOCKER_HOST says that this container is unknown to him.

Disclaimer: these are just my findings by looking hard at the source code and I wanted to share this with you in the hope to be useful and maybe saves some time.

References

none

@mavogel mavogel added bug Something isn't working r/service Relates to the service resource labels Dec 25, 2020
@github-actions
Copy link

This issue is stale because it has been open 60 days with no activity.
Remove stale label or comment or this will be closed in 7 days.
If you don't want this issue to be closed, please set the label pinned.

@github-actions
Copy link

github-actions bot commented Jun 6, 2021

This issue is stale because it has been open 60 days with no activity.
Remove stale label or comment or this will be closed in 7 days.
If you don't want this issue to be closed, please set the label pinned.

@github-actions github-actions bot added the stale label Jun 6, 2021
@mavogel mavogel removed the stale label Jun 14, 2021
@mavogel
Copy link
Contributor

mavogel commented Jun 23, 2021

hi @SebastianFaulborn you might want to check this again with the latest v2.13.0 release where we fixed this in #227.

TL;DR: the function returned a channel which we did not wait for, that's why it returned immediately. I think the signature changed in one of the updates

@mavogel mavogel added the waiting for response Waiting for a response of the creator label Jun 23, 2021
@github-actions
Copy link

This issue is stale because it has been open 60 days with no activity.
Remove stale label or comment or this will be closed in 7 days.
If you don't want this issue to be closed, please set the label pinned.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working r/service Relates to the service resource stale waiting for response Waiting for a response of the creator
Projects
None yet
Development

No branches or pull requests

2 participants