Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Intermittent AWS Eventual Consistency Issues on CircleCI? #5335

Closed
josh-padnick opened this issue Feb 26, 2016 · 10 comments
Closed

Intermittent AWS Eventual Consistency Issues on CircleCI? #5335

josh-padnick opened this issue Feb 26, 2016 · 10 comments

Comments

@josh-padnick
Copy link

I have a terraform configuration whose job is to create a VPC. The basic structure is this:

prod-vpc
- creates vpc
- adds vpc peering route with mgmt-vpc

stage-vpc
- creates vpc
- adds vpc peering route with mgmt-vpc

mgmt-vpc
- adds vpc peering routes with prod and stage

create-vpc-peering-connections

As you can see, there are many interdependencies, however applying and destroying this from my macbook consistently works fine.

When I run this same terraform configuration as part of a CircleCI build, however, I get intermittent failures. Here are some of the errors I've gotten:

* aws_route.internet-gateway: 
error finding matching route for Route table (rtb-dff811b8) and destination CIDR block (0.0.0.0/0)
* aws_route.vpc-peering.1: 
error finding matching route for Route table (rtb-a39b72c4) and destination CIDR block (10.110.0.0/18)

Notice that sometimes I'm getting the same error, but for a different resource, though the second one occurred twice. Here are some other errors. Note that each build will fail with one of these, or in some rare cases, succeed.

4 error(s) occurred:

* aws_internet_gateway.main: InvalidInternetGatewayID.NotFound: The internetGateway ID 'igw-eab0118e' does not exist
    status code: 400, request id: 
* Resource 'aws_eip.nat' does not have attribute 'id' for variable 'aws_eip.nat.id'
* aws_route.vpc-peering.0: 
error finding matching route for Route table (rtb-eb99708c) and destination CIDR block (10.100.0.0/18)
* aws_route.vpc-peering.1: 
error finding matching route for Route table (rtb-eb99708c) and destination CIDR block (10.110.0.0/18)
1 error(s) occurred:

* aws_nat_gateway.nat.0: Error waiting for NAT Gateway (nat-0526389328c14da4e) to become available: unexpected state 'failed', wanted target '[available]'
4 error(s) occurred:

* aws_route.internet-gateway: 
error finding matching route for Route table (rtb-5ca64f3b) and destination CIDR block (0.0.0.0/0)
* Resource 'aws_eip.nat' does not have attribute 'id' for variable 'aws_eip.nat.*.id'
* Resource 'aws_eip.nat' does not have attribute 'id' for variable 'aws_eip.nat.*.id'
* Resource 'aws_eip.nat' does not have attribute 'id' for variable 'aws_eip.nat.*.id'

Every so often the build succeeds, but even then I sometimes receive some non-fatal diffs didn't match during apply warnings:

* aws_iam_instance_profile.instance_profile: diffs didn't match during apply. This is a bug with Terraform and should be reported as a GitHub Issue.

Please include the following information in your report:

Terraform Version: 0.6.12
    Resource ID: aws_iam_instance_profile.instance_profile
    Mismatch reason: attribute mismatch: roles.2877750799
    Diff One (usually from plan): *terraform.InstanceDiff{Attributes:map[string]*terraform.ResourceAttrDiff{"create_date":*terraform.ResourceAttrDiff{Old:"", New:"", NewComputed:true, NewRemoved:false, NewExtra:interface {}(nil), RequiresNew:false, Type:0x0}, "unique_id":*terraform.ResourceAttrDiff{Old:"", New:"", NewComputed:true, NewRemoved:false, NewExtra:interface {}(nil), RequiresNew:false, Type:0x0}, "name":*terraform.ResourceAttrDiff{Old:"", New:"lc-instance-profile-${var.app_name}-${var.vpc_name}", NewComputed:false, NewRemoved:false, NewExtra:interface {}(nil), RequiresNew:true, Type:0x0}, "path":*terraform.ResourceAttrDiff{Old:"", New:"/", NewComputed:false, NewRemoved:false, NewExtra:interface {}(nil), RequiresNew:true, Type:0x0}, "roles.#":*terraform.ResourceAttrDiff{Old:"", New:"1", NewComputed:false, NewRemoved:false, NewExtra:interface {}(nil), RequiresNew:false, Type:0x0}, "roles.2877750799":*terraform.ResourceAttrDiff{Old:"", New:"lc-instance-role-${var.app_name}-${var.vpc_name}", NewComputed:false, NewRemoved:false, NewExtra:interface {}(nil), RequiresNew:false, Type:0x0}, "arn":*terraform.ResourceAttrDiff{Old:"", New:"", NewComputed:true, NewRemoved:false, NewExtra:interface {}(nil), RequiresNew:false, Type:0x0}}, Destroy:false, DestroyTainted:false}
    Diff Two (usually from apply): *terraform.InstanceDiff{Attributes:map[string]*terraform.ResourceAttrDiff{"name":*terraform.ResourceAttrDiff{Old:"", New:"lc-instance-profile-asg-example-app-stg", NewComputed:false, NewRemoved:false, NewExtra:interface {}(nil), RequiresNew:true, Type:0x0}, "path":*terraform.ResourceAttrDiff{Old:"", New:"/", NewComputed:false, NewRemoved:false, NewExtra:interface {}(nil), RequiresNew:true, Type:0x0}, "roles.#":*terraform.ResourceAttrDiff{Old:"", New:"1", NewComputed:false, NewRemoved:false, NewExtra:interface {}(nil), RequiresNew:false, Type:0x0}, "roles.4293975883":*terraform.ResourceAttrDiff{Old:"", New:"lc-instance-role-asg-example-app-stg", NewComputed:false, NewRemoved:false, NewExtra:interface {}(nil), RequiresNew:false, Type:0x0}, "arn":*terraform.ResourceAttrDiff{Old:"", New:"", NewComputed:true, NewRemoved:false, NewExtra:interface {}(nil), RequiresNew:false, Type:0x0}, "create_date":*terraform.ResourceAttrDiff{Old:"", New:"", NewComputed:true, NewRemoved:false, NewExtra:interface {}(nil), RequiresNew:false, Type:0x0}, "unique_id":*terraform.ResourceAttrDiff{Old:"", New:"", NewComputed:true, NewRemoved:false, NewExtra:interface {}(nil), RequiresNew:false, Type:0x0}}, Destroy:false, DestroyTainted:false}

So, based on these data points, and the fact that it runs fine in my localdev, I'm guessing there's something about the CircleCI environment. Most (all?) of these errors indicate AWS eventual consistency issues, but I'm struggling to explain why the CircleCI environment would be more likely to trigger them?

Some other datapoints:

  • My localdev is a Macbook Pro with 8 cores
  • My localdev's GOMAXPROCS = 27984
  • CircleCI's environment has 32 cores (found from SSHinto into the build)
  • CircleCI's GOMAXPROCS = ~4000000
  • CircleCI's environment is running Ubuntu (I believe 14.04 LTS)
  • Setting GOMAXPROCS = 1 via an env var in CircleCI did yield a successful build, but obviously slowed things down.
  • Using terraform apply -parallelism=3 (just to reduce this down from 10) did not seem to affect things, but I'm guessing GOMAXPROCS and parallelism ultimately have the same effect.

Does anyone have any ideas for how I might further debug this or why this is happening? Thanks for your help and for this outstanding piece of software!

@josh-padnick josh-padnick changed the title Intermittent AWS Eventual Consistency Issues? Intermittent AWS Eventual Consistency Issues on CircleCI? Feb 26, 2016
@carlossg
Copy link
Contributor

The InvalidInternetGatewayID.NotFound is a known issue, tracked in #2174

@josh-padnick
Copy link
Author

@carlossg Thanks for the head's up. That would corroborate my initial hypothesis that these are all eventual consistency issues. It's pretty crazy this hasn't been more of a problem for all Terraform users. Not sure I understand why that is.

@jrnt30
Copy link
Contributor

jrnt30 commented May 27, 2016

This is more a consistency issue with the AWS API. We are seeing this error finding matching route for Route table significantly more in the past few days. Looking at the debug logs, the initial "creation" returns a success for the aws_route however the ec2/DescribeRouteTables does not actually return the rule immediately.

@carlossg
Copy link
Contributor

a bunch of eventual consistency problems are fixed in #6775

@josh-padnick
Copy link
Author

@carlossg Thanks for submitting this! We have an automated test suite for our terraform templates and, as you indicate in #6775, the problem seems to have gotten worse lately, probably due to AWS's own issues handling load. Very excited about this getting merged.

@bkc1
Copy link

bkc1 commented Jun 22, 2016

This issue seems related to #7038 and is biting me right now.

@bkc1
Copy link

bkc1 commented Aug 11, 2016

I may have found a workaround for this issue which seems related to the sequencing of how AWS network resources get created. After making all 'aws_route' resources dependent(using 'depends on') on the 'aws_internet_gateway' resource, I have not run into these errors.

See example below which is a terraform project (with 2 subnets) that VPC peers into 2 other terraform project VPCs, including routes and reverse routes.

resource "aws_route" "internet_access" {
  route_table_id         = "${aws_vpc.TSM.main_route_table_id}"
  destination_cidr_block = "0.0.0.0/0"
  gateway_id             = "${aws_internet_gateway.TSM.id}"
  depends_on             = ["aws_internet_gateway.TSM"]
}

resource "aws_route" "to-PM" {
  route_table_id         = "${aws_vpc.TSM.main_route_table_id}"
  destination_cidr_block = "${terraform_remote_state.pm_tf_state.output.vpc_cidr_block}"
  gateway_id             = "${aws_vpc_peering_connection.TSM-PM.id}"
  depends_on             = ["aws_internet_gateway.TSM"]
}

resource "aws_route" "to-UTIL" {
  route_table_id         = "${aws_vpc.TSM.main_route_table_id}"
  destination_cidr_block = "${terraform_remote_state.util_tf_state.output.vpc_cidr_block}"
  gateway_id             = "${aws_vpc_peering_connection.TSM-UTIL.id}"
  depends_on             = ["aws_internet_gateway.TSM"]
}

resource "aws_route" "from-PM1" {
  route_table_id         = "${terraform_remote_state.pm_tf_state.output.vpc_main_route_table_id}"
  destination_cidr_block = "${aws_subnet.TSM1.cidr_block}"
  gateway_id             = "${aws_vpc_peering_connection.TSM-PM.id}"
  depends_on             = ["aws_internet_gateway.TSM"]
}

resource "aws_route" "from_PM2" {
  route_table_id         = "${terraform_remote_state.pm_tf_state.output.vpc_main_route_table_id}"
  destination_cidr_block = "${aws_subnet.TSM2.cidr_block}"
  gateway_id             = "${aws_vpc_peering_connection.TSM-PM.id}"
  depends_on             = ["aws_internet_gateway.TSM"]
}

resource "aws_route" "from-UTIL1" {
  route_table_id         = "${terraform_remote_state.util_tf_state.output.vpc_main_route_table_id}"
  destination_cidr_block = "${aws_subnet.TSM1.cidr_block}"
  gateway_id             = "${aws_vpc_peering_connection.TSM-UTIL.id}"
  depends_on             = ["aws_internet_gateway.TSM"]
}

resource "aws_route" "from_UTIL2" {
  route_table_id         = "${terraform_remote_state.util_tf_state.output.vpc_main_route_table_id}"
  destination_cidr_block = "${aws_subnet.TSM2.cidr_block}"
  gateway_id             = "${aws_vpc_peering_connection.TSM-UTIL.id}"
  depends_on             = ["aws_internet_gateway.TSM"]
}

@brikis98
Copy link
Contributor

@bkc1 Your suggestion of adding a depends_on parameter to each aws_route pointing to the Internet Gateway definitely reduced some of my eventual consistency errors. On top of that, I also found that adding a depends_on to each aws_route pointing to the corresponding aws_route_table was necessary:

resource "aws_route" "nat" {
    count = "${var.num_availability_zones}"
    route_table_id = "${element(aws_route_table.private.*.id, count.index)}"
    destination_cidr_block = "0.0.0.0/0"
    nat_gateway_id = "${element(aws_nat_gateway.nat.*.id, count.index)}"

    depends_on = ["aws_internet_gateway.main", "aws_route_table.private"]
}

With these two additions, most of the eventual consistency errors have gone away, at least in my last ~10 or 15 apply/destroy attempts. The only exception I've seen is #8542. But at least it's progress.

@josh-padnick
Copy link
Author

An update on this. I finally realized why we see more errors in CircleCI than when running locally. It's because our Terraform test framework picks an AWS region at random, whereas I believe CircleCI runs in us-east-1. We've independently seen that running Terraform VPC commands in a region physically far away naturally results in higher latency which exposes more of the underlying eventual consistency bugs.

It'd be great if the hashicorp folks could create a canonical VPC in Terraform, and test it with high latencies to smoke out these issues since, 2 years in on Terraform, they continue to be an issue.

@ghost
Copy link

ghost commented Apr 11, 2020

I'm going to lock this issue because it has been closed for 30 days ⏳. This helps our maintainers find and focus on the active issues.

If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@ghost ghost locked and limited conversation to collaborators Apr 11, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants