Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add retry handling when a request's connection is reset by peer #10715

Open
davegallant opened this issue Nov 1, 2019 · 18 comments
Open

Add retry handling when a request's connection is reset by peer #10715

davegallant opened this issue Nov 1, 2019 · 18 comments
Labels
bug Addresses a defect in current functionality. provider Pertains to the provider itself, rather than any interaction with AWS.

Comments

@davegallant
Copy link

davegallant commented Nov 1, 2019

Community Note

  • Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
  • Please do not leave "+1" or "me too" comments, they generate extra noise for issue followers and do not help prioritize the request
  • If you are interested in working on this issue or have submitted a pull request, please leave a comment

Terraform Version

Terraform v0.12.10

Affected Resource(s)

  • aws_iam_instance_profile

Terraform Configuration Files

data "aws_iam_role" "my_role" {
  name = "0f9f1e2t-instance"
}

Debug Output

N/A

Panic Output

N/A

Expected Behavior

It would be nice if there was a retry mechanism implemented for this resource since it is only doing a read.

Actual Behavior

Error: Error reading IAM instance profile 0f9f1e2t-instance: RequestError: send request failed

caused by: Post https://iam.amazonaws.com/: read tcp 172.17.0.2:36404->59.133.22.207:443: read: connection reset by peer

Steps to Reproduce

  1. terraform apply

Important Factoids

Does not look like there is any retry logic when reading an IAM instance profile:

https://github.com/terraform-providers/terraform-provider-aws/blob/98b8b848ca94031b20c3e626c9d40484e3af80de/aws/resource_aws_iam_instance_profile.go#L287-L305

An example of retrying within the same file:
https://github.com/terraform-providers/terraform-provider-aws/blob/98b8b848ca94031b20c3e626c9d40484e3af80de/aws/resource_aws_iam_instance_profile.go#L163-L175

References

None

@ghost ghost added the service/iam Issues and PRs that pertain to the iam service. label Nov 1, 2019
@github-actions github-actions bot added the needs-triage Waiting for first response or review from a maintainer. label Nov 1, 2019
@davegallant davegallant changed the title Missing retry logic when reading IAM Instance Profile IAM Instance Profile - Add retry logic when reading Nov 1, 2019
@camlow325
Copy link
Contributor

We saw something similar but with an aws_iam_account_alias data source instead. In that case, at least, it appeared that Terraform would attempt to perform some number of retries for the failed API call - up to the value configured for max_retries for the AWS provider instance - for cases where the request failed due to an i/o timeout. If a connection reset by peer failure occurred, though, like the one mentioned in this issue, no further retries were attempted. Would it make sense to make the more generic API retry handling be used for connection reset by peer errors?

@analogrithems
Copy link

This is also happening for us when working with s3 buckets

Error: error getting S3 Bucket CORS configuration: RequestError: send request failed caused by: Get https://example-config-us-west-2-prod-sandbox.s3.us-west-2.amazonaws.com/?cors=: read tcp 192.168.208.3:57182->52.218.237.161:443: read: connection reset by peer

@fattybenji
Copy link

This is also happening for us with a cloudfront distribution:

Error: RequestError: send request failed
 caused by: Get https://cloudfront.amazonaws.com/<date>/distribution/<id>>: read tcp <ip>:<port>-><ip>:443: read: connection reset by peer

This is happening with a simple plan in CI, so a retry logic would be nice too.

@davegallant maybe this issue could be renamed to be more generic since it's not only a problem for IAM instance profiles.

@gaspo53
Copy link

gaspo53 commented Aug 12, 2020

This is happening to us, frequently (using both 0.12.29 and 0.13):

`

Error: Error retrieving DB Instances: RequestError: send request failed

19:26:37 | caused by: Post "https://rds.us-east-1.amazonaws.com/": read tcp 10.98.196.183:58901->52.119.197.147:443: read: connection reset by peer

`

@mattburgess
Copy link
Collaborator

We're seeing this as well. The specific case for us just now was on the ec2.eu-west-2.amazonaws.com service that was being reached via a VPC endpoint. But it also happens quite frequently for us on calls to services that have to traverse through our Internet Proxy because VPC endpoints aren't available for those services (or the services exist in a different region to our CI tooling).

Interestingly for us, this happened whilst trying to investigate "hangs" during a terraform plan/apply cycle which seem somewhat related. In that case, TF would hit some kind of network issue, then not bother retrying for around 15 minutes, but would then retry and succeed.

Still digging into this as I'm not sure whether this is a provider or TF issue.

@davegallant davegallant changed the title IAM Instance Profile - Add retry logic when reading Add retry handling when a request's connection is reset by peer Oct 2, 2020
@lagrianitis
Copy link

lagrianitis commented Dec 17, 2020

The same with aws_vpc_endpoint datasource when I run either plan or apply.

@ag-TJNII
Copy link

ag-TJNII commented Feb 3, 2021

I think this is more serious than a simple retry needed. I had an apply wedge badly today due to this error and it looks like this causes Terraform to lose track of resources it has created. I had to manually hunt down and destroy EC2 instances it built but didn't save into the state to unwedge it.

@acdha
Copy link
Contributor

acdha commented Mar 23, 2021

This is still an issue with Terraform v0.14.8. I have a project which manages some cross region resources and the us-west-1 ones are failing somewhat regularly while us-east-1 (~20 minutes from my house) is rock-solid.

Error: Error retrieving list of aggregate authorizations: RequestError: send request failed
caused by: Post https://config.us-west-1.amazonaws.com/: read tcp …->176.32.118.187:443: read: connection reset by peer
Error: RequestError: send request failed
caused by: Post https://logs.us-west-1.amazonaws.com/: read tcp …->52.119.176.231:443: read: connection reset by peer
  • aws_config_aggregate_authorization
  • aws_cloudwatch_log_group

@dimisjim
Copy link
Contributor

Also encountered this issue with ACM:

Error: error listing tags for ACM Certificate (arn:aws:acm:eu-west-1:<accID>:certificate/<certID>): RequestError: send request failed
caused by: Post "https://acm.eu-west-1.amazonaws.com/": read tcp <privIp>:34550->54.239.33.223:443: read: connection reset by peer

@mbijon
Copy link

mbijon commented May 24, 2021

Having the same "connection reset by peer" issue during state checks on ElasticIPs. Have checked the AWS Service Health & Personal service health dashboards, both show all services up in the region this is running, us-west-2.

Haven't seen any examples of EIP failures when searching, so noting here:

module.bastion.aws_eip.default[0]: Refreshing state... [id=eipalloc-xxxxxx]
╷
│ Error: RequestError: send request failed
│ caused by: Post "https://ec2.us-west-2.amazonaws.com/": read tcp 192.168.86.33:51541->54.xxxx:443: read: connection reset by peer
│ 
│ Error: RequestError: send request failed
│ caused by: Post "https://ec2.us-west-2.amazonaws.com/": read tcp 192.168.86.33:51524->54.xxxx:443: read: connection reset by peer

Similar error in Security Group state check, about an hour after the above error. AWS Service & Personal health dashboards both show VPC & EC2 services are healthy:

│ Error: Error authorizing security group rule type egress: RequestError: send request failed
│ caused by: Post "https://ec2.us-west-2.amazonaws.com/": read tcp 192.168.86.33:51771->54.xxxx:443: read: connection reset by peer
│ 
│   on main.tf line 526, in resource "aws_security_group_rule" "egress_sec_to_webresource":
│  526: resource "aws_security_group_rule" "egress_sec_to_webresource" {

@davi5e
Copy link

davi5e commented Jun 29, 2021

Also encountered this issue with ACM:

Same here using Terraform v1.0.1 and aws v3.47.0...

it looks like this causes Terraform to lose track of resources it has created

Also the same thing here.

@BNMetrics
Copy link

Seeing the same error with ACM, it started happening this evening
Terraform v1.0.0, aws v3.44.0
Region: us-east-2

@jiashuChen
Copy link
Contributor

jiashuChen commented Jul 24, 2021

Seeing the same error with IAM as well, using
Terraform v1.0.1 and aws provider v3.51.0
Region: ap-southeast-2

│ Error: error deleting IAM Role (IAM-ROLE-NAME): RequestError: send request failed
│ caused by: Post "https://iam.amazonaws.com/": read tcp IP:PORT->DIFFERENT_IP:PORT: read: connection reset by peer

 

@justinretzolk justinretzolk added bug Addresses a defect in current functionality. and removed needs-triage Waiting for first response or review from a maintainer. labels Dec 9, 2021
@idharper
Copy link

idharper commented Jan 21, 2022

Suddenly seeing the same error with Cloudwatch log group and sqs queues.
TF v 1.0.1 aws provider v 3.73.0
Region: us-west-2 and us-east-1

aws cli equivalent commands work fine.

[ UPDATE ] After more digging I found someone suggest Network issues, dropped off corporate VPN and all works ok, reconnected to VPN and it fails - was working a few days ok. Will go and bash Corporate IT and see what they have to say for themselves

@breathingdust
Copy link
Member

Hi all 👋 Just letting you know that this is issue is featured on this quarters roadmap. If a PR exists to close the issue a maintainer will review and either make changes directly, or work with the original author to get the contribution merged. If you have written a PR to resolve the issue please ensure the "Allow edits from maintainers" box is checked. Thanks for your patience and we are looking forward to getting this merged soon!

@wenqiglantz-agi
Copy link

Hi @breathingdust, any update on the progress?

@mgusiew-guide
Copy link
Contributor

mgusiew-guide commented Jan 9, 2023

FTR this happens also in case when the wait loop is waiting for resource to change state (e.g. become active), here is an example for MSK cluster:

TestClusterConfig 2023-01-09T12:51:51Z logger.go:66: �[1m�[31mError: �[0m�[0m�[1mwaiting for MSK Cluster (arn:aws:kafka:xxx) create: RequestError: send request failed
TestClusterConfig 2023-01-09T12:51:51Z logger.go:66: caused by: Get "https://kafka.us-west-2.amazonaws.com/api/v2/clusters/xxx": read tcp xxx:55924->xxx:443: read: connection reset by peer�[0m
TestClusterConfig 2023-01-09T12:51:51Z logger.go:66:
TestClusterConfig 2023-01-09T12:51:51Z logger.go:66: �[0m on ../../../../../cluster/cluster.tf line 19, in resource "aws_msk_cluster" "msk_cluster":
TestClusterConfig 2023-01-09T12:51:51Z logger.go:66: 19: resource aws_msk_cluster msk_cluster �[4m{�[0m
TestClusterConfig 2023-01-09T12:51:51Z logger.go:66: �[0m
TestClusterConfig 2023-01-09T12:51:51Z logger.go:66: �[0m�[0m
TestClusterConfig 2023-01-09T12:51:51Z retry.go:99: Returning due to fatal error: FatalError{Underlying: error while running command: exit status 1; �[31m
�[1m�[31mError: �[0m�[0m�[1mwaiting for MSK Cluster (arn:aws:kafka:xxx) create: RequestError: send request failed
caused by: Get "https://kafka.us-west-2.amazonaws.com/api/v2/clusters/xxx": read tcp xxx:55924->xxx:443: read: connection reset by peer�[0m

�[0m on ../../../../../cluster/cluster.tf line 19, in resource "aws_msk_cluster" "msk_cluster":
19: resource aws_msk_cluster msk_cluster �[4m{�[0m
�[0m
�[0m�[0m}

@BrianLovelace128
Copy link

This would be very useful. I'm running into this issue with the msk module.

@gdavison gdavison added provider Pertains to the provider itself, rather than any interaction with AWS. and removed service/iam Issues and PRs that pertain to the iam service. labels Jan 15, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Addresses a defect in current functionality. provider Pertains to the provider itself, rather than any interaction with AWS.
Projects
None yet
Development

No branches or pull requests