Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Network ACLs must wait on Internet and NAT Gateways (finally found a workaround for lots of random eventual consistency errors) #7527

Closed
brikis98 opened this issue Jul 7, 2016 · 13 comments
Labels
bug core provider/aws waiting-response An issue/pull request is waiting for a response from the community

Comments

@brikis98
Copy link
Contributor

brikis98 commented Jul 7, 2016

While working on a complicated set of VPC templates that created multiple VPCs, subnets, route tables, and network ACLs, I was hitting a huge number of seemingly random eventual consistency issues, including #7038, #5335, #5185, #6813, #7516, and many others. Sometimes I'd get one error, sometimes I'd get a dozen (I listed an example in the "Actual Behavior" section below). Re-running terraform apply would get past some of these errors, only to reveal others, and often, I couldn't get the templates to apply successfully at all.

After lots of digging, I've finally found a workaround. I'm not sure if this is a bug that needs to be fixed in Terraform, or AWS, or just documentation that should be added, but I figured I'd describe my findings here in case other folks hit the same problems. See below for details of the problem plus a description of the workaround.

Terraform Version

Terraform v0.6.16

Affected Resource(s)

These errors seem to come up when you create network ACLs at the same time as you are creating a new VPC with Internet and NAT Gateways, so the affected resources are:

  • aws_vpc
  • aws_internet_gateway
  • aws_nat_gateway
  • aws_network_acl
  • aws_network_acl_rule

Terraform Configuration Files

I was creating my VPC and its Internet and Nat Gateways in one module and the Network ACLs in another. I don't know if this matters, but I figured I'd list it here just in case.

Key excerpts from the VPC module:

# Create the VPC
resource "aws_vpc" "main" {
    cidr_block = "${var.cidr_block}"
}

# Create an Internet Gateway
resource "aws_internet_gateway" "main" {
    vpc_id = "${aws_vpc.main.id}"
}

# Create a route in the route table for the public subnets that points to the Internet Gateway
resource "aws_route" "internet" {
    route_table_id = "${aws_route_table.public.id}"
    destination_cidr_block = "0.0.0.0/0"
    gateway_id = "${aws_internet_gateway.main.id}"
}

# Create NAT Gateways in the public subnet
resource "aws_nat_gateway" "nat" {
    count = "${var.num_nat_gateways}"
    allocation_id = "${element(aws_eip.nat.*.id, count.index)}"
    subnet_id = "${element(aws_subnet.public.*.id, count.index)}"

    depends_on = [
        "aws_internet_gateway.main",
        "aws_eip.nat"
    ]
}

# Create routes in the route tables for the private subnets that point to the NAT Gateways
resource "aws_route" "nat" {
    count = "${length(split(",", var.aws_availability_zones))}"
    route_table_id = "${element(aws_route_table.private.*.id, count.index)}"
    destination_cidr_block = "0.0.0.0/0"
    nat_gateway_id = "${element(aws_nat_gateway.nat.*.id, count.index)}"
}

Example excerpts from the Network ACLs module:

# Create an ACL in the private subnets
resource "aws_network_acl" "private_app_subnets" {
  vpc_id = "${var.vpc_id}"
  subnet_ids = ["${split(",", var.private_subnet_ids)}"]
}

Note that the exact details of the Network ACLs probably don't matter. All that matters is that you are trying to create ACLs at more or less the same time as you're creating the VPC and its subnets.

Expected Behavior

The VPC and Network ACLs should be created without errors.

Actual Behavior

I get a huge number of seemingly random errors about route tables not being found, or subnets not being found, or Network ACLs not being found, and so on. Sometimes I'd get one error, sometimes more than a dozen, as shown in this example output:

14 error(s) occurred:

* aws_route_table.private-app.1: InvalidRouteTableID.NotFound: The routeTable ID 'rtb-47fef520' does not exist
    status code: 400, request id: 4d22846d-00cc-4a37-8acb-e32351ca0f80
* aws_internet_gateway.main: InvalidInternetGatewayID.NotFound: The internetGateway ID 'igw-a15c7bc5' does not exist
    status code: 400, request id: f69bbacc-a3ee-423f-8df2-8938b5710481
* aws_subnet.public.2: InvalidSubnetID.NotFound: The subnet ID 'subnet-49fedf63' does not exist
    status code: 400, request id: 8cdafcd9-ec25-4392-9168-0d0da1bcdbc3
* aws_subnet.private-app.2: InvalidSubnetID.NotFound: The subnet ID 'subnet-4efedf64' does not exist
    status code: 400, request id: 3701fac2-6009-4d82-a53c-b16a8115d22d
* aws_subnet.private-app.1: InvalidSubnetID.NotFound: The subnet ID 'subnet-aadffcf2' does not exist
    status code: 400, request id: 0447360a-875d-4181-b40f-c94124003dc8
* aws_subnet.public.1: InvalidSubnetID.NotFound: The subnet ID 'subnet-acdffcf4' does not exist
    status code: 400, request id: d472a1c2-1dac-42ce-9ccf-c19ea4d87d50
* aws_route_table.private.1: InvalidRouteTableID.NotFound: The routeTable ID 'rtb-40fef527' does not exist
    status code: 400, request id: 571e930c-6349-4048-8e69-cb0b44f0ddb5
* aws_route.internet: 
error finding matching route for Route table (rtb-79fef51e) and destination CIDR block (0.0.0.0/0)
* aws_route.internet: 
error finding matching route for Route table (rtb-4cfef52b) and destination CIDR block (0.0.0.0/0)
* aws_network_acl.private_persistence_subnets: InvalidNetworkAclID.NotFound: The networkAcl ID 'acl-0b99f26c' does not exist
    status code: 400, request id: 6014a441-c326-4fe9-bc6b-54e4652316f9
* Resource 'aws_security_group.mgmt_example' does not have attribute 'id' for variable 'aws_security_group.mgmt_example.id'
* Resource 'aws_security_group.mgmt_example' does not have attribute 'id' for variable 'aws_security_group.mgmt_example.id'
* aws_subnet.private-persistence.2: InvalidSubnetID.NotFound: The subnet ID 'subnet-4ffedf65' does not exist
    status code: 400, request id: cce8523d-fa2b-4ed5-9619-d259f49ec717
* aws_network_acl.private_subnets: InvalidNetworkAclID.NotFound: The networkAcl ID 'acl-0999f26e' does not exist
    status code: 400, request id: 32ec6777-d82b-4729-a769-bc2ef415f13c

Workaround

I came across a comment by @mitchellh that said the following:

From googling, this appears to be caused by trying to do this route table update prior to both VPCs being connected to an internet gateway. I honestly haven't used this feature of AWS so I have no idea why that is related, but for CloudFormation users AWS support says to add a depends on to the routing table to the internet gateway attachments.

I was a bit desperate for a way forward, so I figured I'd give it a shot: I would force my Network ACLs module to wait until all the Internet Gateways, and, just in case, the NAT Gateways, and all relevant routes, were fully created. To do this, I added a new null_resource, and a corresponding output for it, to the VPC module:

resource "null_resource" "vpc_ready" {
    depends_on = ["aws_internet_gateway.main", "aws_nat_gateway.nat", "aws_route.internet", "aws_route.nat"]
}

output "vpc_ready" {
    value = "${null_resource.vpc_ready.id}"
}

Note how the null_resource explicitly depends on the Internet Gateway, NAT Gateway, and their corresponding routes to be created. I then added a vpc_ready variable to the Network ACL module, a null_resource that depends on that variable, and made sure that each ACL in those templates depends_on the null_resource.

variable "vpc_ready" {
  description = "Use this variable to ensure the Network ACL does not get created until the VPC is ready. This can help to work around a Terraform or AWS issue where trying to create certain resources, such as Network ACLs, before the VPC's Gateway and NATs are ready, leads to a huge variety of eventual consistency bugs. You should typically point this variable at the vpc_ready output from the Gruntwork VPCs."
}

resource "null_resource" "vpc_ready" {
  triggers {
    # Explicitly wait on the passed in vpc_ready variable
    vpc_ready = "${var.vpc_ready}"
  }
}

resource "aws_network_acl" "private_app_subnets" {
  vpc_id = "${var.vpc_id}"
  subnet_ids = ["${split(",", var.private_subnet_ids)}"]
  # Here, we ensure no ACLs are created until all the Gateways are ready
  depends_on = ["null_resource.vpc_ready"]
}

Finally, when I use the two modules together, I use set the vpc_ready input in the Network ACL module to the vpc_ready output from the VPC module to ensure the Network ACLs do not get created until all the Gateways are created:

module "vpc" {
  source = "./vpc"
  # ... lots of params omitted
}

module "acls" {
  source = "./acls"
  # ... lots of params omitted
  vpc_ready = "${module.vpc.vpc_ready}"
}

As soon as I added this, all the errors magically went away.

Note: this workaround would be much simpler (i.e. not require any extra variables, null_resources, etc) if Terraform supported depends_on for modules (see #1178).

@brikis98 brikis98 changed the title Network ACLs must wait on Internet and NAT Gateways Network ACLs must wait on Internet and NAT Gateways (finally found a workaround to lots of random eventual consistency errors) Jul 7, 2016
@brikis98 brikis98 changed the title Network ACLs must wait on Internet and NAT Gateways (finally found a workaround to lots of random eventual consistency errors) Network ACLs must wait on Internet and NAT Gateways (finally found a workaround for lots of random eventual consistency errors) Jul 7, 2016
@phinze
Copy link
Contributor

phinze commented Aug 5, 2016

@brikis98 wow this is some seriously great reporting. Thanks so much for all of your work in putting this together!

As soon as I added this, all the errors magically went away.

This is a huge finding. Give us a chance to chew on this a bit and we'll follow up!

@catsby
Copy link
Member

catsby commented Aug 10, 2016

Hey @brikis98 sorry for the silence here. I have a question about the setup you've shared here (excellent details by the way 😄 )

In your acls module, you have this:

resource "aws_network_acl" "private_app_subnets" {
  vpc_id = "${var.vpc_id}"
  ...
}

Can you tell me, where does var.vpc_id get it's value?

In your workaround, you have:

module "acls" {
  source = "./acls"
  # ... lots of params omitted
  vpc_ready = "${module.vpc.vpc_ready}"
}

so I'm curious where the vpc_id value comes from. Is vpc_id output from the vpc module? And if so, does using that as the input variable for vpc_id in the acls module not work?

I created a demo project based on the description of your example above, which can be found here:

In that demo, I export vpc_id from the vpc module, and use that as the value for vpc_id in the second module. I included a visualization of the plan to create them in out.png, which shows that module acl should wait until the vpc is up before it tries to create.

I’d like to know how you’re setting your var.vpc_id. I’m curious if my example is simply to simple to trigger what you’re seeing.

Thanks!

@catsby catsby added the waiting-response An issue/pull request is waiting for a response from the community label Aug 10, 2016
@brikis98
Copy link
Contributor Author

@catsby The vpc_id parameter of the acls module is set to the vpc_id output of the vpc module:

module "acls" {
  source = "./acls"
  # ... lots of params omitted
  vpc_ready = "${module.vpc.vpc_ready}"
  vpc_id = "${module.vpc.vpc_id}"
}

The vpc_id output, in turn, is set to the output of the aws_vpc resource:

output "vpc_id" {
    value = "${aws_vpc.main.id}"
}

@catsby
Copy link
Member

catsby commented Aug 11, 2016

Hey @brikis98 thanks for getting back.

I'm still not able to reproduce this with terraform v0.6.16 or v0.7. I've updated my demo app (which requires v0.7, but I have a v0.6.16 version as well) to include subnets et. al, but I'm still not hitting the issues.

Looking at the errors you shared:

14 error(s) occurred:

* aws_route_table.private-app.1: InvalidRouteTableID.NotFound: The routeTable ID 'rtb-47fef520' does not exist
* aws_internet_gateway.main: InvalidInternetGatewayID.NotFound: The internetGateway ID 'igw-a15c7bc5' does not exist
* aws_subnet.public.2: InvalidSubnetID.NotFound: The subnet ID 'subnet-49fedf63' does not exist
* aws_subnet.private-app.2: InvalidSubnetID.NotFound: The subnet ID 'subnet-4efedf64' does not exist
* aws_subnet.private-app.1: InvalidSubnetID.NotFound: The subnet ID 'subnet-aadffcf2' does not exist
* aws_subnet.public.1: InvalidSubnetID.NotFound: The subnet ID 'subnet-acdffcf4' does not exist
* aws_route_table.private.1: InvalidRouteTableID.NotFound: The routeTable ID 'rtb-40fef527' does not exist

Can you tell me which module those resources belong to? I'm assuming the VPC module, can you confirm?

Also, can you please tell me what region you using?

Please let me know if there's anything in my demo app that I can expand on to try and hit these errors.

Thanks!

@brikis98
Copy link
Contributor Author

@catsby All of those resources belong to the VPC module. When we run our tests, we pick a random region each time, so I saw those failures randomly in us-east-1, us-west-2, and many others.

Here are a few items in our VPC module that are not in your demo app:

  • We do not use propagating_vgws in our aws_route_table resources. I don't remember seeing it before and am not sure what it does, either. Is that a new parameter? Does removing it cause any of those errors to pop up for you?
  • We create aws_nat_gateway resources in the public subnets and attach aws_eip resources to them. The number of NAT gateways is configurable using a num_nat_gateways variable, which we pass to the count parameter of the aws_nat_gateway resources.
  • We create a number of aws_network_acl_rule resources, not just the aws_network_acl resource.

No idea if any of these would make a difference, but thought I'd mention them just in case.

@brikis98
Copy link
Contributor Author

@catsby I just upgraded to Terraform 0.7.2, and as far as I can tell, my workaround is less effective now. I'm seeing far more eventual consistency issues in general with this new version of Terraform (e.g. #7993 (comment), #6813 (comment), #8229 (comment), #8530), and with Network ACLs in particular, I'm getting a large number of eventual consistency errors, despite this workaround, and more often than not, the templates will not apply or destroy successfully. Not sure where to go from here.

@brikis98
Copy link
Contributor Author

Update: I've found, through trial and error and copying code examples I found online, that most of the issues I describe in this bug are resolved by adding two depends_on entries to each aws_route resource: one that points to the Internet Gateway in the VPC and one that points to the corresponding aws_route_table resource.

resource "aws_route" "internet" {
    route_table_id = "${aws_route_table.public.id}"
    destination_cidr_block = "0.0.0.0/0"
    gateway_id = "${aws_internet_gateway.main.id}"

    # A workaround for a series of eventual consistency bugs in Terraform. For a list of the errors, see the related
    # bugs described in this issue: https://github.com/hashicorp/terraform/issues/8542. The workaround is based on:
    # https://github.com/hashicorp/terraform/issues/5335 and https://charity.wtf/2016/04/14/scrapbag-of-useful-terraform-tips/
    depends_on = ["aws_internet_gateway.main", "aws_route_table.public"]
}

I have no idea why that helps, but it gets rid of most issues. The only one it does NOT get rid of is #8542.

@brikis98
Copy link
Contributor Author

brikis98 commented Sep 1, 2016

@catsby From browsing the Terraform code, I've noticed that some of the functions, after creating a resource, start making repeated API calls to AWS until the API says the resource exists. I'm guessing this is done to ensure that anything that depends on that resource doesn't execute until information about it has propagated. The catch is that those API calls are only repeated up to some maximum time out, such as waiting at most 15 seconds for a route to be created (see #8542 (comment)).

My suspicion is that for read API calls, AWS routes you to a replica in a nearby region. For example, you might be deploying a VPC in us-east-1, but if you're running the Terraform client while sitting in Europe, your read API calls will be routed to a replica in eu-west-1. The catch is that the further away a replica is, the longer it will take for information to propagate to it. So if you're deploying to a far away data center, you're much more likely to hit these timeouts.

Perhaps the reason you weren't able to repro the issues I was seeing was that you always deployed to a data center near you? Perhaps you need to try to deploy to something as far away as possible?

@mitchellh mitchellh removed the waiting-response An issue/pull request is waiting for a response from the community label Dec 1, 2016
@catsby
Copy link
Member

catsby commented Dec 13, 2016

Hey @brikis98 , how have things been here? Last I looked I was unable to reproduce this issue.

We do make repeated calls to confirm resources are fully created, and we still add more and more polling to cover edge cases, as we find them.

Can you tell me, are you still seeing this issue with any frequency? I don't feel these kinds of eventual consistency issues are still prevalent, but I'd like your feedback before closing this issue.

Please let me know!

@catsby catsby added the waiting-response An issue/pull request is waiting for a response from the community label Dec 13, 2016
@brikis98
Copy link
Contributor Author

@catsby Thanks for checking in. I have not seen this error in a while. It's an intermittent issue by nature, so I don't know if that really means it has been fixed, but it's probably safe to close the bug for now. I can reopen if I hit this problem again.

@catsby
Copy link
Member

catsby commented Dec 13, 2016

Thank you, @brikis98 , I appreciate the quick turn around. I wish I had a solid resolution here 😦

Please let us know if you do happen to stumble on anything conclusive in the future.

Thanks!

@kitforbes
Copy link

I had a similar issue when trying to add a route to my public subnet's route table. The route needed our site to site VPN's vpn_gateway_id. Essentially, it would time out with a > * aws_route_table.public: Gateway.NotAttached: resource vgw-*. Thanks to your workaround, I'm now passed this problem. I think this will go away entirely when module to module dependencies are implemented (using depends_on within the module block).

@ghost
Copy link

ghost commented Apr 14, 2020

I'm going to lock this issue because it has been closed for 30 days ⏳. This helps our maintainers find and focus on the active issues.

If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@ghost ghost locked and limited conversation to collaborators Apr 14, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
bug core provider/aws waiting-response An issue/pull request is waiting for a response from the community
Projects
None yet
Development

No branches or pull requests

6 participants