Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ECS service parameter wait_for_ready_state can lead to inconsistent new deployments #16012

Closed
maximelenair opened this issue Nov 4, 2020 · 6 comments · Fixed by #25641
Closed
Labels
bug Addresses a defect in current functionality. service/ecs Issues and PRs that pertain to the ecs service.
Milestone

Comments

@maximelenair
Copy link

Community Note

  • Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
  • Please do not leave "+1" or other comments that do not add relevant new information or questions, they generate extra noise for issue followers and do not help prioritize the request
  • If you are interested in working on this issue or have submitted a pull request, please leave a comment

Terraform CLI and Terraform AWS Provider Version

Terraform core version: 0.13.3
Terraform AWS provider version: 3.13.0

Affected Resource(s)

  • aws_ecs_service

Terraform Configuration Files

resource "aws_ecs_service" "my_service" {
  name          = "my-service"
  cluster       = "my-cluster"
  desired_count = 1

  launch_type             = "FARGATE"
  enable_ecs_managed_tags = true
  propagate_tags          = "SERVICE"

  deployment_maximum_percent         = 200
  deployment_minimum_healthy_percent = 50
  wait_for_steady_state              = true

  network_configuration {
    subnets         = ["subnet-*******", "subnet-*******", "subnet-*******"]
    security_groups = ["sg-********"]
  }

  service_registries {
    registry_arn = "arn:aws:servicediscovery:*******:********:service/srv-*********"
  }

  task_definition = "my-service:10"
  tags            = local.common_tags
}

Expected Behavior

When using the wait_for_steady_state parameter during the creation of a service, we have multiple service / task status possible:

  1. Both the service and the task are in a healthy state, terraform apply is successful
  2. The service has an issue and terraform apply fails after a timeout (ie, the specific Docker image does not exist)
  3. The service is running but the task has a consistent issue preventing it from passing its initial health check and terraform apply fails after a timeout (ie, a container fails to start because of a missing environment variable)

Actual Behavior

In both case 1. and 2., the actual behaviour follows the expected behavior.
For case 3., the actual behavior is inconsistent given the same Terraform configuration.

Example on 5 different tests (deploy/destroy without any configuration change):

  • Test 1: Success - after 2 minutes 30
  • Test 2: Success - after 7 minutes 40
  • Test 3: Success - after 1 minutes 20
  • Test 4: Failure - timeout after 10 minutes
  • Test 5: Success - after 5 minutes 20

Steps to Reproduce

  • Create a Docker container that will fail on startup
  • Create an ECS service using the wait_for_steady_state parameter
  • Create and destroy the resource multiple times

References

@ghost ghost added the service/ecs Issues and PRs that pertain to the ecs service. label Nov 4, 2020
@github-actions github-actions bot added the needs-triage Waiting for first response or review from a maintainer. label Nov 4, 2020
@bill-rich bill-rich added bug Addresses a defect in current functionality. and removed needs-triage Waiting for first response or review from a maintainer. labels Nov 4, 2020
@jacob-israel-turner
Copy link

Assuming the provider is using ecs.wait under the hood - I ran into a similar issue (outside of Terraform) on a project. ecs.wait will only wait for 10 minutes before failing out, while occasionally ECS deployments can take up to 15 minutes to reach a steady state. We solved this locally by calling ecs.wait twice in a row, in case the first timed out. We haven't run into this issue since.

@straygar
Copy link

This is super painful. We have to choose between:

  • Fire and forget with the Terraform update -> verify the deployment in the ECS service
  • Have flaky deployments, that take close to 10min or more

Are there any workaround we can use for now, until this is fixed?

@nickfaughey
Copy link

If it fits your app's architecture, you can look into lowering deregistration_delay, which defaults to 5 minutes and may eat up half of this hardcoded 10 minute wait time.

zackads added a commit to mcagov/beacons that referenced this issue Mar 14, 2022
We intended to replace wait_for_steady_state to be replaced with a
custom shell script in commit 6a4fac2.

We found the shell script was buggy and difficult to maintain across
local (macOS) and CI/CD (ubuntu) environments.

Instead, we'll replace wait_for_steady_state.  If flakiness continues
to be an issue, we'll investigate reducing the deregistration_delay on
the ALB to allow services to reach a steady state more quickly, as
suggested here hashicorp/terraform-provider-aws#16012 (comment)

Co-authored-by: Olly Swanson <olly.swanson95@gmail.com>
@anGie44
Copy link
Contributor

anGie44 commented Apr 19, 2022

Hi @maximelenair , some changes to handling of wait_for_steady_state were recently released in v4.10.0 (edit: and more recently in v4.13.0), so if you are able to upgrade to the latest version and give it a go, it's possible this particular issue you are seeing has been addressed as well. If you or anyone following this issue are able to provide feedback after upgrading, it would be much appreciated!

Relates #24223

@github-actions
Copy link

github-actions bot commented Jul 8, 2022

This functionality has been released in v4.22.0 of the Terraform AWS Provider. Please see the Terraform documentation on provider versioning or reach out if you need any assistance upgrading.

For further feature requests or bug reports with this functionality, please create a new GitHub issue following the template. Thank you!

@github-actions
Copy link

github-actions bot commented Aug 7, 2022

I'm going to lock this issue because it has been closed for 30 days ⏳. This helps our maintainers find and focus on the active issues.
If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Aug 7, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
bug Addresses a defect in current functionality. service/ecs Issues and PRs that pertain to the ecs service.
Projects
None yet
6 participants