Implement isResourceTimeoutError() Conditionals After All resource.Retry() Usage #7873

bflad · 2019-03-08T20:40:26Z

Community Note

Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
Please do not leave "+1" or "me too" comments, they generate extra noise for issue followers and do not help prioritize the request
If you are interested in working on this issue or have submitted a pull request, please leave a comment

Description

When implementing resource.Retry() to allow Terraform AWS Provider resources to retry AWS API requests either for eventual consistency errors or for known concurrency errors, the code immediately following should include a conditional check for isResourceTimeoutError() to retry the request outside the timeout.

If the AWS Go SDK encounters an error such as 5XX AWS API responses, networking related errors, or certain throttling errors, it will automatically retry the AWS API requests itself without returning to the calling code. If these automatic AWS Go SDK retries last longer than the resource.Retry() timeout and we do not retry the request outside the resource.Retry() block before returning the error, operators will receive an "empty" timeout error without any context about the underlying AWS Go SDK error:

timeout while waiting for state to become 'success' (timeout: 1m0s)

Debugging these with Terraform is only possible by enabling debug logging, e.g. TF_LOG=debug terraform plan/apply.

Example incorrect implementation in pseudocode:

err := resource.Retry(1 * time.Minute, func() *resource.RetryError {
  _, err := conn.SomeRequest(input)
  // ... other logic including resource.RetryableError() ...
})

if err != nil {
  return err
}

Fixed implementation:

err := resource.Retry(1 * time.Minute, func() *resource.RetryError {
  _, err := conn.SomeRequest(input)
  // ... other logic including resource.RetryableError() ...
})

if isResourceTimeoutError(err) {
  // Using = here intentionally to overwrite the err
  _, err = conn.SomeRequest(input)
}

if err != nil {
  return err
}

References

There are numerous bugs and pull requests relating to this behavior, but most recently: #7871

The text was updated successfully, but these errors were encountered:

… after creation due to eventual consistency References: * #7891 * #6560 * #7873 * hashicorp/terraform#17220 The KMS service has eventual consistency considerations and the `aws_kms_alias` resource immediately tries to read the KMS alias after creation, which may not find the KMS alias. When not able to find the KMS alias, the resource logic returns an empty API object instead of an error. Since a `nil` check was already performed on the error, the error will always be `nil`. Invoking `return resource.RetryableError(nil)` is equivalent to `return nil`. The resource during its Read performs an error check first which will skip because its `nil`, then assumes the resource has been deleted outside Terraform and triggers recreation. Here when we cannot find a KMS alias after allowing some time for eventual consistency, we return a resource not found error and ensure we handle any timeouts due to automatic AWS Go SDK retries. Output from acceptance testing: ``` --- PASS: TestAccAWSKmsAlias_no_name (37.63s) --- PASS: TestAccAWSKmsAlias_name_prefix (37.80s) --- PASS: TestAccAWSKmsAlias_multiple (38.38s) --- PASS: TestAccAWSKmsAlias_importBasic (40.13s) --- PASS: TestAccAWSKmsAlias_ArnDiffSuppress (43.61s) --- PASS: TestAccAWSKmsAlias_basic (46.76s) ```

… after creation due to eventual consistency (#7907) References: * #7891 * #6560 * #7873 * hashicorp/terraform#17220 The KMS service has eventual consistency considerations and the `aws_kms_alias` resource immediately tries to read the KMS alias after creation, which may not find the KMS alias. When not able to find the KMS alias, the resource logic returns an empty API object instead of an error. Since a `nil` check was already performed on the error, the error will always be `nil`. Invoking `return resource.RetryableError(nil)` is equivalent to `return nil`. The resource during its Read performs an error check first which will skip because its `nil`, then assumes the resource has been deleted outside Terraform and triggers recreation. Here when we cannot find a KMS alias after allowing some time for eventual consistency, we return a resource not found error and ensure we handle any timeouts due to automatic AWS Go SDK retries. Output from acceptance testing: ``` --- PASS: TestAccAWSKmsAlias_no_name (37.63s) --- PASS: TestAccAWSKmsAlias_name_prefix (37.80s) --- PASS: TestAccAWSKmsAlias_multiple (38.38s) --- PASS: TestAccAWSKmsAlias_importBasic (40.13s) --- PASS: TestAccAWSKmsAlias_ArnDiffSuppress (43.61s) --- PASS: TestAccAWSKmsAlias_basic (46.76s) ```

ryndaniels · 2019-09-17T08:21:42Z

HURRAY WE DID IT

ghost · 2019-11-01T15:11:14Z

I'm going to lock this issue because it has been closed for 30 days ⏳. This helps our maintainers find and focus on the active issues.

If you feel this issue should be reopened, we encourage creating a new issue linking back to this one for added context. Thanks!

bflad added technical-debt Addresses areas of the codebase that need refactoring or redesign. provider Pertains to the provider itself, rather than any interaction with AWS. labels Mar 8, 2019

bflad mentioned this issue Mar 13, 2019

resource/aws_kms_alias: Prevent state removal of resource immediately after creation due to eventual consistency #7907

Merged

bflad mentioned this issue Mar 20, 2019

aws_lb_listener read taking longer than 1 minute timeout #8025

Closed

bflad mentioned this issue Apr 24, 2019

fix: Increases the timeout for TransitGateway routes #8417

Closed

ryndaniels self-assigned this May 13, 2019

bflad mentioned this issue May 14, 2019

Retry LB listener methods after timeout #8630

Merged

bflad mentioned this issue May 22, 2019

Add AWS MSK cluster resource #8635

Merged

This was referenced Jun 4, 2019

Retry timeout error for acmpca cert authority #8856

Merged

Fixes for more resource.Retry calls #8893

Merged

Retry for deleting default DHCP options, plus pagination plus a test sweeper #8907

Merged

This was referenced Jun 20, 2019

Cleaning up some resource retries for cloudwatch functions #9065

Merged

Cleanup around resource.Retry methods for various api gateway resources #9068

Merged

Retries after timeouts on spot resources #9078

Merged

Timeout retries for ECR resources #9079

Merged

ryndaniels mentioned this issue Jul 4, 2019

final retry when waiting for transfer user deletion #9241

Merged

This was referenced Aug 20, 2019

Final ACL retries #9830

Merged

Final retries for s3 timeouts #9861

Merged

Final retries for ACM cert #9863

Merged

Final retries for instances #9879

Merged

ryndaniels mentioned this issue Aug 27, 2019

Final retries for elasticsearch domain resources #9892

Merged

ryndaniels closed this as completed Sep 17, 2019

nywilken mentioned this issue Sep 20, 2019

Revert "resource/aws_acm_certificate: Retry logic refactor" #10184

Merged

ghost locked and limited conversation to collaborators Nov 1, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement isResourceTimeoutError() Conditionals After All resource.Retry() Usage #7873

Implement isResourceTimeoutError() Conditionals After All resource.Retry() Usage #7873

bflad commented Mar 8, 2019

ryndaniels commented Sep 17, 2019

ghost commented Nov 1, 2019

Implement isResourceTimeoutError() Conditionals After All resource.Retry() Usage #7873

Implement isResourceTimeoutError() Conditionals After All resource.Retry() Usage #7873

Comments

bflad commented Mar 8, 2019

Community Note

Description

References

ryndaniels commented Sep 17, 2019

ghost commented Nov 1, 2019