Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Inconsistent Timeout on CloudFront distribution creation (150+ distros) #6197

Closed
Djiit opened this issue Oct 18, 2018 · 13 comments · Fixed by #7809
Closed

Inconsistent Timeout on CloudFront distribution creation (150+ distros) #6197

Djiit opened this issue Oct 18, 2018 · 13 comments · Fixed by #7809
Labels
bug Addresses a defect in current functionality. service/cloudfront Issues and PRs that pertain to the cloudfront service.
Milestone

Comments

@Djiit
Copy link

Djiit commented Oct 18, 2018

Community Note

  • Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
  • Please do not leave "+1" or "me too" comments, they generate extra noise for issue followers and do not help prioritize the request
  • If you are interested in working on this issue or have submitted a pull request, please leave a comment

Terraform Version

Terraform v0.11.7

  • provider.aws v1.40.0

Affected Resource(s)

  • aws_cloudfront_distribution

Terraform Configuration Files

100 CF distributions with 50 rules each.

Debug Output

* aws_cloudfront_distribution.xxx.1: error updating CloudFront Distribution (xxx): timeout while waiting for state to become 'success' (timeout: 1m0s)
(...)

(20x or so, mileage varies).

Expected Behavior

Terraform should have applied our changes to AWS. (here, create 100-ish CloudFront (CF) distributions). Also we didn't expect a "1 minute timeout" on this resource. According to https://github.com/terraform-providers/terraform-provider-aws/blob/master/aws/resource_aws_cloudfront_distribution.go, it's 70 minute for CF distributions

Actual Behavior

Terraform errored and leave multiple CF distros on AWS, forcing us to clean all this mess by hand.

Steps to Reproduce

  1. Create a .tf file with approx. 100 CF distributions
  2. terraform apply
@bflad bflad added the service/cloudfront Issues and PRs that pertain to the cloudfront service. label Oct 18, 2018
@bflad
Copy link
Contributor

bflad commented Oct 18, 2018

Hi @Djiit 👋 Very sorry for the trouble! Would it be possible to share what change(s) you were attempting? The terraform plan output would be very helpful in troubleshooting. If you also potentially have the debug logging from Terraform, that might help determine if the timeout was occurring because of the AWS Go SDK automatically retrying or not. Its likely that when so many changes are occurring at once that the CloudFront API is throttling the requests in some fashion so we may need to tweak the logic around this.

That error message (with its 1 minute timeout for retries) occurs here in the code, which is called during updates or during deletion (to disable the distribution first).

@Djiit
Copy link
Author

Djiit commented Oct 18, 2018

I'm currently destroying theses CF distros (might have to wait for a couple hours), but i'll send you that as soon as I can. The planoutput is huge (100*50 changes to apply).

Thanks for your quick answer, I appreciate this.

@bflad
Copy link
Contributor

bflad commented Oct 18, 2018

If its giving you lots of trouble, you can also try reducing your concurrency with the terraform plan/apply -parallelism flag -- it defaults to 10. This will obviously slow down the process, but it might workaround it not working at all or leaving behind resources.

@Djiit
Copy link
Author

Djiit commented Oct 18, 2018

Hmmm the plan output encoding is weird. I'll try again tomorrow.

@Djiit
Copy link
Author

Djiit commented Oct 19, 2018

So, applying again on a fresh, empty environment with parallelism set to 2, now get this error :

* aws_cloudfront_distribution.front-app.157: error creating CloudFront Distribution: CNAMEAlreadyExists: One or more of the CNAMEs you provided are already associated with a different resource.

on 20 or so resources.

But AFAIK, there isn't any resource associated with this specific CNAME. //EDIT// More on that, all the concerned distributions are here, activated and seems healthy ! But the linked route53 records are not created (as the need to be created after the distributions)

Do you have an email adresse where I can send you my plan and debug ouput (as it might contain some sensitive informations) ?

@bflad
Copy link
Contributor

bflad commented Oct 19, 2018

Feel free to drop a Gist which can be encrypted with the HashiCorp GPG Key.

@Djiit
Copy link
Author

Djiit commented Oct 19, 2018

Thanks, i'll do that. Some new information here :

  • When I try to re-apply, it tells me it needs to create 21 resources out of 161 (CF distributions) (like it doesn't know there are created), the same resources that errored ("CNAMEAlreadyExists"). The "apply" then fail the the exact same 21 errors. I can ran this multiple times with the same effect.

(edited)

  • When I manually Deactivate, then Destroy the 21 distributions, there is no more error.

@Djiit
Copy link
Author

Djiit commented Oct 19, 2018

Here is the encrypted output : https://gist.github.com/Djiit/cd40c6ad858b3ffa797ae466a3adf734, I hope I'm doing this well ahah

@Djiit
Copy link
Author

Djiit commented Nov 12, 2018

Hi there, any update on this ?

FWIW when we force the parallelism to 1, it's ok. But hell this is long.

@bflad bflad added the bug Addresses a defect in current functionality. label Mar 4, 2019
bflad added a commit that referenced this issue Mar 4, 2019
… timeout retry for AWS Go SDK retries

Reference:
* #6197

When using `resource.Retry()` for handling eventual consistency, it will timebox the inner function to the configured timeout, which we generally set to a minute or two. The AWS Go SDK, when it encounters recoverable conditions such as 5XX errors or throttling errors, will automatically retry within itself up to the configured session `MaxRetries` (Terraform AWS Provider `max_retries` configuration) before returning to the calling code. For heavily utilized AWS accounts, the throttling errors will cause the outer timeout, which does not give the resource the opportunity to keep retrying outside the timebox.

Here we implement this final retry by checking for timeout error from `resource.Retry()` outside the timeboxing, so the AWS Go SDK can return the proper error messaging in these situations or (hopefully) finally succeed in the case of throttling. Since this error handling condition would require extraneous amounts of resources to only potentially trigger the handling, we do not generally implement covering acceptance testing for this code, but it may be a good candidate for special Terraform AWS Provider handling within a future planned Terraform Provider linting tool.

Output from acceptance testing:

```
--- PASS: TestAccAWSCloudFrontDistribution_Origin_EmptyOriginID (2.08s)
--- PASS: TestAccAWSCloudFrontDistribution_Origin_EmptyDomainName (2.08s)
--- PASS: TestAccAWSCloudFrontDistribution_ViewerCertificate_AcmCertificateArn (1821.71s)
--- PASS: TestAccAWSCloudFrontDistribution_ViewerCertificate_AcmCertificateArn_ConflictsWithCloudFrontDefaultCertificate (1821.72s)
--- PASS: TestAccAWSCloudFrontDistribution_noCustomErrorResponseConfig (2086.99s)
--- PASS: TestAccAWSCloudFrontDistribution_orderedCacheBehavior (2090.63s)
--- PASS: TestAccAWSCloudFrontDistribution_HTTP11Config (2092.43s)
--- PASS: TestAccAWSCloudFrontDistribution_noOptionalItemsConfig (2092.72s)
--- PASS: TestAccAWSCloudFrontDistribution_IsIPV6EnabledConfig (2097.43s)
--- PASS: TestAccAWSCloudFrontDistribution_S3Origin (2277.83s)
--- PASS: TestAccAWSCloudFrontDistribution_multiOrigin (2280.49s)
--- PASS: TestAccAWSCloudFrontDistribution_customOrigin (2282.05s)
--- PASS: TestAccAWSCloudFrontDistribution_S3OriginWithTags (3345.90s)
```
@bflad
Copy link
Contributor

bflad commented Mar 4, 2019

Pull request submitted: #7809

@bflad bflad added this to the v2.1.0 milestone Mar 5, 2019
@bflad
Copy link
Contributor

bflad commented Mar 5, 2019

The fix for this has been merged and will release with version 2.1.0 of the Terraform AWS Provider, likely in the next day or two.

@bflad
Copy link
Contributor

bflad commented Mar 8, 2019

This has been released in version 2.1.0 of the AWS provider. Please see the Terraform documentation on provider versioning or reach out if you need any assistance upgrading.

@ghost
Copy link

ghost commented Mar 31, 2020

I'm going to lock this issue because it has been closed for 30 days ⏳. This helps our maintainers find and focus on the active issues.

If you feel this issue should be reopened, we encourage creating a new issue linking back to this one for added context. Thanks!

@ghost ghost locked and limited conversation to collaborators Mar 31, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
bug Addresses a defect in current functionality. service/cloudfront Issues and PRs that pertain to the cloudfront service.
Projects
None yet
2 participants