Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WIP: Automatically retry when encounter connection reset by peer error from aws api #20300

Closed
wants to merge 1 commit into from
Closed

WIP: Automatically retry when encounter connection reset by peer error from aws api #20300

wants to merge 1 commit into from

Conversation

jiashuChen
Copy link
Contributor

@jiashuChen jiashuChen commented Jul 24, 2021

Community Note

  • Please vote on this pull request by adding a 👍 reaction to the original pull request comment to help the community and maintainers prioritize this request
  • Please do not leave "+1" or other comments that do not add relevant new information or questions, they generate extra noise for pull request followers and do not help prioritize the request

Relates #10715

Draft PR, I'm not sure if this is the most correct way to solve this issue, raising as a draft, feedback will be appreciated. The goal of this pull request is to resolve to have a global wrapper function to deal with connection reset by peer error from aws api, which seems to impact multiple aws region and services.

│ Error: error deleting IAM Role (IAM-ROLE-NAME): RequestError: send request failed
│ caused by: Post "https://iam.amazonaws.com/": read tcp IP:PORT->DIFFERENT_IP:PORT: read: connection reset by peer

Furthermore, this helps resolve the issue with resource.Retry being marked deprecated.

https://github.com/hashicorp/terraform-plugin-sdk/blob/main/helper/resource/wait.go#L66-L69

Lastly, this pull request now seems to have more than 100 labels automatically applied to it, which has caused verification to fail. I'm not sure how to resolve it. Some help will be appreciated.

NOTE: although there are many files changes, the core of the change is here in this file:

func RetryOnConnectionResetByPeer(timeout time.Duration, f resource.RetryFunc) error {
return resource.RetryContext(context.Background(), timeout, func() *resource.RetryError {
err := f()
if err != nil && !err.Retryable && tfawserr.ErrMessageContains(err.Err, request.ErrCodeRequestError, "read: connection reset by peer") {
return resource.RetryableError(err.Err)
}
return err
})
}

@jiashuChen jiashuChen marked this pull request as draft July 24, 2021 15:31
@github-actions github-actions bot added size/XL Managed by automation to categorize the size of a PR. documentation Introduces or discusses updates to documentation. provider Pertains to the provider itself, rather than any interaction with AWS. service/accessanalyzer Issues and PRs that pertain to the accessanalyzer service. service/acm Issues and PRs that pertain to the acm service. service/acmpca Issues and PRs that pertain to the acmpca service. service/apigateway Issues and PRs that pertain to the apigateway service. service/applicationautoscaling service/appmesh Issues and PRs that pertain to the appmesh service. service/autoscaling Issues and PRs that pertain to the autoscaling service. service/backup Issues and PRs that pertain to the backup service. service/cloud9 Issues and PRs that pertain to the cloud9 service. service/cloudformation Issues and PRs that pertain to the cloudformation service. service/cloudfront Issues and PRs that pertain to the cloudfront service. service/cloudtrail Issues and PRs that pertain to the cloudtrail service. service/codebuild Issues and PRs that pertain to the codebuild service. service/codestarnotifications Issues and PRs that pertain to the codestarnotifications service. service/configservice Issues and PRs that pertain to the configservice service. service/datapipeline Issues and PRs that pertain to the datapipeline service. service/datasync Issues and PRs that pertain to the datasync service. service/dax Issues and PRs that pertain to the dax service. service/directconnect Issues and PRs that pertain to the directconnect service. service/docdb Issues and PRs that pertain to the docdb service. labels Jul 24, 2021
@github-actions github-actions bot added service/s3control Issues and PRs that pertain to the s3control service. service/sagemaker Issues and PRs that pertain to the sagemaker service. service/secretsmanager Issues and PRs that pertain to the secretsmanager service. service/securityhub Issues and PRs that pertain to the securityhub service. service/servicecatalog Issues and PRs that pertain to the servicecatalog service. service/ses Issues and PRs that pertain to the ses service. service/sfn Issues and PRs that pertain to the sfn service. service/signer Issues and PRs that pertain to the signer service. service/sns Issues and PRs that pertain to the sns service. service/sqs Issues and PRs that pertain to the sqs service. service/ssm Issues and PRs that pertain to the ssm service. service/ssoadmin Issues and PRs that pertain to the ssoadmin service. service/storagegateway Issues and PRs that pertain to the storagegateway service. service/synthetics Issues and PRs that pertain to the synthetics service. service/transfer Issues and PRs that pertain to the transfer service. service/waf Issues and PRs that pertain to the waf service. service/wafv2 Issues and PRs that pertain to the wafv2 service. tests PRs: expanded test coverage. Issues: expanded coverage, enhancements to test infrastructure. labels Jul 24, 2021
@jiashuChen jiashuChen changed the title Automatically retry when encounter connection reset by peer error from aws api WIP: Automatically retry when encounter connection reset by peer error from aws api Jul 24, 2021
@github-actions
Copy link

Thank you for your contribution! 🚀

Please note that typically Go dependency changes are handled in this repository by dependabot or the maintainers. This is to prevent pull request merge conflicts and further delay reviews of contributions. Remove any changes to the go.mod or go.sum files and commit them into this pull request.

Additional details:

  • Check open pull requests with the dependencies label to view other dependency updates.
  • If this pull request includes an update the AWS Go SDK (or any other dependency) version, only updates submitted via dependabot will be merged. This pull request will need to remove these changes and will need to be rebased after the existing dependency update via dependabot has been merged for this pull request to be reviewed.
  • If this pull request is for supporting a new AWS service:
    • Ensure the new AWS service changes are following the Contributing Guide section on new services, in particular that the dependency addition and initial provider support are in a separate pull request from other changes (e.g. new resources). Contributions not following this item will not be reviewed until the changes are split.
    • If this pull request is already a separate pull request from the above item, you can ignore this message.

@@ -114,3 +115,15 @@ func RetryConfigContext(ctx context.Context, delay time.Duration, delayRand time
// more likely to be useful
return resultErr
}

func RetryOnConnectionResetByPeer(timeout time.Duration, f resource.RetryFunc) error {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Main Change

@@ -97,3 +98,55 @@ func TestRetryConfigContext_error(t *testing.T) {
t.Fatal("timeout")
}
}

func TestRetryOnConnectionResetByPeer(t *testing.T) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unit Test

@AdamTylerLynch
Copy link
Collaborator

Question for the author: the AWS GoSDK has retry configured by default for all client operations. The retry mechanism will retry 3 times by default and use exponential back off with jitter.

I’m curious as to how this PR will be any different? I would suggest modifying the default policy for the SDK rather than implement additional retry logic, specifically because that logic does not include exponential backoff with jitter.

https://aws.github.io/aws-sdk-go-v2/docs/configuring-sdk/retries-timeouts/

Additionally, TCP reset could be happening for a number of reasons. I use Terraform all the time, running the Acceptance Tests, 20 threads at a time, and I’ve never gotten a TCP reset error.

Have you checked your network route to AWS? A proxy, machine-in-the-middle, or AWS API throttling are all potential sources of TCP reset. I would recommend checking your CloudTrail logs for your account to see if you are being throttled.

@jiashuChen
Copy link
Contributor Author

Question for the author: the AWS GoSDK has retry configured by default for all client operations. The retry mechanism will retry 3 times by default and use exponential back off with jitter.

I’m curious as to how this PR will be any different? I would suggest modifying the default policy for the SDK rather than implement additional retry logic, specifically because that logic does not include exponential backoff with jitter.

https://aws.github.io/aws-sdk-go-v2/docs/configuring-sdk/retries-timeouts/

Additionally, TCP reset could be happening for a number of reasons. I use Terraform all the time, running the Acceptance Tests, 20 threads at a time, and I’ve never gotten a TCP reset error.

Have you checked your network route to AWS? A proxy, machine-in-the-middle, or AWS API throttling are all potential sources of TCP reset. I would recommend checking your CloudTrail logs for your account to see if you are being throttled.

Thanks. I'll close the current pull requests.

@jiashuChen jiashuChen closed this Aug 18, 2021
@jiashuChen jiashuChen deleted the b-connection-reset-by-peer branch August 18, 2021 07:49
@ewbankkit ewbankkit removed the needs-triage Waiting for first response or review from a maintainer. label Aug 18, 2021
@github-actions
Copy link

I'm going to lock this pull request because it has been closed for 30 days ⏳. This helps our maintainers find and focus on the active issues.
If you have found a problem that seems related to this change, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Sep 18, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
documentation Introduces or discusses updates to documentation. provider Pertains to the provider itself, rather than any interaction with AWS. service/accessanalyzer Issues and PRs that pertain to the accessanalyzer service. service/acm Issues and PRs that pertain to the acm service. service/acmpca Issues and PRs that pertain to the acmpca service. service/apigateway Issues and PRs that pertain to the apigateway service. service/appmesh Issues and PRs that pertain to the appmesh service. service/autoscaling Issues and PRs that pertain to the autoscaling service. service/backup Issues and PRs that pertain to the backup service. service/cloud9 Issues and PRs that pertain to the cloud9 service. service/cloudformation Issues and PRs that pertain to the cloudformation service. service/cloudfront Issues and PRs that pertain to the cloudfront service. service/cloudtrail Issues and PRs that pertain to the cloudtrail service. service/codebuild Issues and PRs that pertain to the codebuild service. service/codestarnotifications Issues and PRs that pertain to the codestarnotifications service. service/configservice Issues and PRs that pertain to the configservice service. service/datapipeline Issues and PRs that pertain to the datapipeline service. service/datasync Issues and PRs that pertain to the datasync service. service/dax Issues and PRs that pertain to the dax service. service/directconnect Issues and PRs that pertain to the directconnect service. service/docdb Issues and PRs that pertain to the docdb service. service/dynamodb Issues and PRs that pertain to the dynamodb service. service/ec2 Issues and PRs that pertain to the ec2 service. service/ecr Issues and PRs that pertain to the ecr service. service/ecrpublic Issues and PRs that pertain to the ecrpublic service. service/ecs Issues and PRs that pertain to the ecs service. service/eks Issues and PRs that pertain to the eks service. service/elasticache Issues and PRs that pertain to the elasticache service. service/elasticbeanstalk Issues and PRs that pertain to the elasticbeanstalk service. service/elasticsearch Issues and PRs that pertain to the elasticsearch service. service/elb Issues and PRs that pertain to the elb service. service/elbv2 Issues and PRs that pertain to the elbv2 service. service/emr Issues and PRs that pertain to the emr service. service/firehose Issues and PRs that pertain to the firehose service. service/gamelift Issues and PRs that pertain to the gamelift service. service/glue Issues and PRs that pertain to the glue service. service/guardduty Issues and PRs that pertain to the guardduty service. service/iam Issues and PRs that pertain to the iam service. service/imagebuilder Issues and PRs that pertain to the imagebuilder service. service/inspector Issues and PRs that pertain to the inspector service. service/iot Issues and PRs that pertain to the iot service. service/kafka Issues and PRs that pertain to the kafka service. service/kinesisanalytics Issues and PRs that pertain to the kinesisanalytics service. service/kinesisanalyticsv2 Issues and PRs that pertain to the kinesisanalyticsv2 service. service/kms Issues and PRs that pertain to the kms service. service/lakeformation Issues and PRs that pertain to the lakeformation service. service/lambda Issues and PRs that pertain to the lambda service. service/lexmodels Issues and PRs that pertain to the lexmodels service. service/location Issues and PRs that pertain to the location service. service/mediapackage Issues and PRs that pertain to the mediapackage service. service/mediastore Issues and PRs that pertain to the mediastore service. service/neptune Issues and PRs that pertain to the neptune service. service/opsworks Issues and PRs that pertain to the opsworks service. service/organizations Issues and PRs that pertain to the organizations service. service/pinpoint Issues and PRs that pertain to the pinpoint service. service/qldb Issues and PRs that pertain to the qldb service. service/ram Issues and PRs that pertain to the ram service. service/rds Issues and PRs that pertain to the rds service. service/redshift Issues and PRs that pertain to the redshift service. service/route53 Issues and PRs that pertain to the route53 service. service/s3control Issues and PRs that pertain to the s3control service. service/s3 Issues and PRs that pertain to the s3 service. service/sagemaker Issues and PRs that pertain to the sagemaker service. service/secretsmanager Issues and PRs that pertain to the secretsmanager service. service/securityhub Issues and PRs that pertain to the securityhub service. service/servicecatalog Issues and PRs that pertain to the servicecatalog service. service/ses Issues and PRs that pertain to the ses service. service/sfn Issues and PRs that pertain to the sfn service. service/signer Issues and PRs that pertain to the signer service. service/sns Issues and PRs that pertain to the sns service. service/sqs Issues and PRs that pertain to the sqs service. service/ssm Issues and PRs that pertain to the ssm service. service/ssoadmin Issues and PRs that pertain to the ssoadmin service. service/storagegateway Issues and PRs that pertain to the storagegateway service. service/synthetics Issues and PRs that pertain to the synthetics service. service/transfer Issues and PRs that pertain to the transfer service. service/waf Issues and PRs that pertain to the waf service. service/wafv2 Issues and PRs that pertain to the wafv2 service. size/XL Managed by automation to categorize the size of a PR. tests PRs: expanded test coverage. Issues: expanded coverage, enhancements to test infrastructure.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants