Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

"Referencing Attributes from Resources with count = 0" change in Terraform 0.11 should be a warning! #16726

Closed
brikis98 opened this issue Nov 21, 2017 · 22 comments

Comments

@brikis98
Copy link
Contributor

brikis98 commented Nov 21, 2017

I understand why Terraform 0.11 is trying to fix the Referencing Attributes from Resources with count = 0 issue (silent failures are definitely a problem!), but it seems like fixing this by making a massively backwards incompatible change is not a good solution. Changing from a silent failure that mostly didn't cause problems to a loud failure that breaks in a bunch of places means a huge percentage of the Terraform code out there that uses count is now broken. Worse yet, this failure only happens at run-time, and only if the count value happens to be set to 0, so there is no easy way to find these!

Every Terraform user now needs to scan through their code base looking for every resource that uses count, then find every usage of that resource, and decide if (a) that count could ever be 0 and (b) if it could, fix how you look up output attributes on that resource. For large Terraform codebases, this is a massive and highly-prone undertaking, making upgrading to 0.11 nearly impossible.

Is it possible to update 0.11.1 to:

  1. Keep the old behavior where this does NOT generate an error. Instead, show a WARNING at run-time that this will generate an error in the future (e.g., in 0.12). In other words, this is a deprecation warning, which is usually what you do before making any highly backwards incompatible changes.
  2. At "compile time" (e.g., during terraform validate), show a WARNING about all invalid lookups on resources with count to make it easier to find these issues.
@danhart
Copy link

danhart commented Nov 21, 2017

I'm having these issues since upgrading. Several of our modules attempt to output values from resources that sometimes have a count of 0. Example:

resource "aws_s3_bucket" "assets" {
  count  = "${terraform.env == "production" ? 1 : 0}"
  bucket = "assets.spyscape.net"
}

output "s3" {
  value = "${aws_s3_bucket.assets.id}"
}

When the workspace is not production this results in:

Error: Error running plan: 1 error(s) occurred:

* module.spyscapeassets.output.s3: Resource 'aws_s3_bucket.assets' not found for variable 'aws_s3_bucket.assets.id'

Does anyone have a workaround for this? I tried adding count to the output, but no luck there.

@danhart
Copy link

danhart commented Nov 21, 2017

@jbardin Looking through your PR #16204 suggests that you did try and cater for this issue. Do you have any suggestions?

@brikis98
Copy link
Contributor Author

@danhart The workaround is to use this ugly idiom:

output "s3" {
  value = "${element(concat(aws_s3_bucket.assets.*.id, list("")), 0)}"
}

Finding all the places where this needs to be fixed and updating the code is a mess :(

@danhart
Copy link

danhart commented Nov 21, 2017

Sorry @brikis98 I literally just spotted this in the upgrade guides. Thank you.

Hopefully a cleaner solution in the future 🤞. Perhaps just a warning for this specific use-case.

@jbardin
Copy link
Member

jbardin commented Nov 21, 2017

Hi @brikis98, @danhart,

Yes, this of course is a known issue, and one that we are planning on taking care of. Making output errors unconditional was important to diagnose and fix many issues customers have had within complex configurations.

What we intend to have is the ability for conditionals to short-circuit so that the invalid portion is not evaluated at all (hashicorp/hil#50). Until then, using a * is the only method we have to asses if there is resource to evaluate at all.

@jbardin jbardin closed this as completed Nov 21, 2017
@jbardin
Copy link
Member

jbardin commented Nov 21, 2017

See also #15605 #16688

@brikis98
Copy link
Contributor Author

brikis98 commented Nov 21, 2017

@jbardin Short circuiting doesn't have anything to do with this. The issue here is that Terraform 0.11 is backwards incompatible in the worst way possible: it introduces errors that may happen, in certain circumstances, at run time, to a very common coding pattern. This change breaks a bunch of existing code and provides no easy way to fix it.

The responsible solution for this sort of thing is to (1) issue a deprecation warning before the breaking change, so users can update their code and (2) to make it as easy as possible for users to find the places where the code needs to be changed.

@jbardin
Copy link
Member

jbardin commented Nov 21, 2017

@brikis98,

Sorry, I didn't mean that this is directly related to short-circuiting interpolation, only that there will be a cleaner syntax coming than the "ugly" idiom we have now.

I understand the frustration of having to update a possibly large amount of configuration. We may have also underestimated the number of users relying on the behavior of invalid interpolation strings in outputs. In our experience this is almost always a source of hard to diagnose configuration bugs, and warranted the breaking change.

While we are reviewing options for making this easier, this is probably a good issue for tracking that progress.

@jbardin jbardin reopened this Nov 21, 2017
@brikis98
Copy link
Contributor Author

@jbardin We can discuss the ugly syntax in a separate issue :)

The focus of this one is that this backwards incompatible change offers no obvious upgrade path. Companies with a lot of Terraform code are now forced to manually dig through every single resource that uses count and make a (hard) judgment call on whether anything that uses this resource needs the ugly workaround or not. It's a very common pattern. I'd bet many of the modules in the Terraform Registry (including Vault, Consul, and Nomad), as well as a ton of code at all large enterprise companies using Terraform, is now broken.

So, if at all possible, I'd strongly push for reverting this change so that is initially a deprecation warning, and adding tooling to track this down automatically, rather than relying on people to do it manually, or no large company will be able to upgrade to Terraform 0.11.

@brikis98
Copy link
Contributor Author

brikis98 commented Nov 21, 2017

We may have also underestimated the number of users relying on the behavior of invalid interpolation strings in outputs

Terraform doesn't have explicit support for conditional syntax, so it is a VERY common pattern to use count to emulate it:

resource "xxx" "yyy" {
  count = "${some_condition ? 1 : 0}"
}

As soon as you have the code above, anything that references the xxx.yyy resource may now be broken. Examples of such references:

  1. xxx.yyy.attribute: I agree it's weird for something that uses count to be referenced this way, but for count = 1 and count = 0, this worked just fine. Now you have replace all such references with element(concat(xxx.yyy.attribute, list("")), 0), which I could probably live with (despite the ugly syntax), except it's not obvious how to find those references in the first place!

  2. element(concat(xxx.yyy.*.attribute, aaa.bbb.*.attribute), 0): If you had multiple conditional resources, this used to work to select the first non-empty one. Now you have to replace it with element(concat(xxx.yyy.*.attribute, aaa.bbb.*.attribute, list("")), 0), but again, it's not clear how to find these.

Right now, the only way to find these issues is to look for all resources that use count, and then find every reference to those resources, and figure out if the references need to be fixed. This is obviously very time consuming and error prone, especially for large enterprise customers with a lot of Terraform code, and if you miss any, you don't get an error until run time. In fact, you only get the error when count is 0, so anywhere in your code where count happens to be positive will work for a while, and then after tweaking some random param, perhaps months later, it'll break completely unexpectedly.

So, again, I'm not defending the old behavior and totally agree it should be fixed, but it needs to be done in such a way that there is a sane upgrade path.

@JBirdVegas
Copy link

The validation of output variables also appears to have other side effects like introducing runtime errors when external datasources are used as values for output variables. See #16728

@apparentlymart
Copy link
Member

Hi all! Thanks for the great discussion here.

First of all, I want to apologize for how long it took to respond here. With the thanksgiving holiday followed by some sick days I've been out of action for a while. 😖

When we discussed the rollout of this feature we had not considered the situation that @brikis98 pointed out here, where a module may currently have suitable input values to work, but then can catch the user out if input values are later change such that count is anything other than 1. For major releases we try to make sure that any breaking changes are immediately actionable by those following the upgrade guide, but in this case we fell short and I apologize for that.

Since the new feature was already rolled out, and since development for 0.12 depends on the foundational changes that enabled it, we decided to compromise with an opt-out mechanism in 0.11.1, so those with configurations containing problematic output expressions have a means to use 0.11.1 without first fixing all modules. We understand that this is not the most ideal migration path -- if we could do this over again we would've introduced the warning in one of the 0.10 point releases -- but we hope that this compromise is acceptable so that we can continue to make progress towards 0.12.

Along with that, we have added a warning along the lines of what @brikis98 suggested which is produced for all cases where the singleton attribute syntax is used with a resource with count set, even if the current dynamic value of count happens to be 1. This should make it easier to quickly identify problematic expressions and fix them, even if they are not currently causing errors.

We are very conscious of the need to carefully walk the line between stability and improvement of the product as we make these workflow and architectural changes on our way to Terraform 1.0. We will be more vigilant about such changes in future, to ensure there's always a reasonable upgrade path. Thanks for pointing out our miss here, and for your patience while we worked towards a solution.

@brikis98
Copy link
Contributor Author

@apparentlymart Thank you for getting these fixes into 0.11.1. Hopefully, the warnings will let us track these issues down, and the opt out gives us a reasonable path forward until we get them all fixed. We'll give it a shot and report back on how it goes.

@ebarault
Copy link

ebarault commented Feb 6, 2018

hi,

As per the various recommendations, I have the following configuration to manage outputs of my conditional resources:

resource "aws_cloudwatch_log_group" "log_group" {
  count   = "${var.cloudwatch_log_group != 0 ? 1 : 0}"
  name    = "${local.cloudwatch_log_group_name}"
}

resource "aws_cloudwatch_log_stream" "log_stream" {
  count          = "${var.cloudwatch_log_group != 0 ? length(var.cloudwatch_log_streams) : 0}"
  name           = "${var.cloudwatch_log_streams[count.index]}"
  log_group_name = "${aws_cloudwatch_log_group.log_group.name}"
}

output "cloudwatch_log_group_arn" {
  value = "${element(concat(aws_cloudwatch_log_group.log_group.*.arn, list("")), 0)}"
}

output "cloudwatch_log_streams_arns" {
  value = "${compact(concat(aws_cloudwatch_log_stream.log_stream.*.arn, list("")))}"
}

While this works fine to compute the outputs initially, whatever the resource count (0, 1, or *), it fails when changing from 1 to 0, or from * to 0.

Say for example i created a log stream, or a log group, and then decides to remove it, terraform crashes with

# example for cloudwatch_log_streams_arns
...
aws_iam_policy.cloudwatch_logs_access_policy: Destruction complete after 1s
Error: Error applying plan:

1 error(s) occurred:

* output.cloudwatch_log_group_arn: Resource 'aws_cloudwatch_log_group.log_group' does not have attribute 'arn' for variable 'aws_cloudwatch_log_group.log_group.*.arn'
# example for aws_cloudwatch_log_stream
...
aws_cloudwatch_log_stream.log_stream: Destruction complete after 0s
Releasing state lock. This may take a few moments...
Error: Error applying plan:

1 error(s) occurred:

* output.cloudwatch_log_streams_arns: Resource 'aws_cloudwatch_log_stream.log_stream' does not have attribute 'arn' for variable 'aws_cloudwatch_log_stream.log_stream.*.arn'

Re-applying works with no issue.

To me it's kind of an additional issue on top of the rest, as the recommended fixes don't work.

I'm very concerned as @brikis98 by such issues, as I have many places in my deployed modules where such configurations (which I nevertheless considered safe), could blow out when changing resources count.

Any suggestion?
Is there a feature flag to prevent errors in output from popping until this is fixed? Couldn't have TF_OUTPUT_ERRORS introduced in #16204 working...

@danhart
Copy link

danhart commented Feb 21, 2018

@ebarault I have now hit the exact problems you mentioned. Re-applying does solve the issue, but this is less than ideal. Maybe raise this as a new issue and reference this one? I'm not sure if we have a better solution since November.

@brikis98
Copy link
Contributor Author

@apparentlymart Should we open a new issue for the problem @ebarault is seeing?

@jbardin
Copy link
Member

jbardin commented Feb 23, 2018

Hi @brikis98, that's a good idea.

This issue is related to how outputs and locals are evaluated, and I have a simpler test case I've been working with. I'll open the issue for others to find.

@jbardin
Copy link
Member

jbardin commented Feb 24, 2018

Hi @ebarault, @danhart,

I've opened #17425 about this specific issue.

@hopkinsth
Copy link

hopkinsth commented Jun 7, 2018

I am also using the official workaround for this issue, but in my case, I'm referring to a resource that may have count = 0 as a parameter of another resource that also may have count = 0.

Here's the relevant section of my config. In particular, I see an error from terraform v0.11.7 for the aws_ecs_task_definition.proxy resource.

variable target_type {
  default = "hostname"
}

resource "aws_ecs_task_definition" "proxy" {
  count                    = "${var.target_type == "hostname" ? 1 : 0}"
  family                   = "${var.app_name}-proxy"
  container_definitions    = "${data.template_file.task_definition.rendered}"
  network_mode             = "awsvpc"
  requires_compatibilities = ["FARGATE"]
}

resource "aws_ecs_service" "proxy" {
  count      = "${var.target_type == "hostname" ? length(local.app_ports) : 0}"

  name            = "${var.app_name}-proxy-${local.app_ports[count.index]}"
  cluster         = "${var.ecs_cluster}"
  launch_type     = "FARGATE"
  task_definition = "${element(concat(aws_ecs_task_definition.proxy.*.id, list("")), 0)}"
  desired_count   = "${var.num_containers}"

  load_balancer {
    target_group_arn = <<END
${element(
  aws_lb_target_group.group.*.arn, 
  index(
    aws_lb_target_group.group.*.name, 
    "${var.app_name}-${var.environment}-${element(local.app_ports, count.index)}"
  )
)}
END

    container_name = "${var.app_name}-proxy"
    container_port = "${local.app_ports[count.index]}"
  }

  network_configuration {
    subnets         = ["${var.subnets}"]
    security_groups = ["${var.security_groups}"]
  }
}

When I run terraform plan, I see an error similar to another one reported in this issue:

Error: Error running plan: 1 error(s) occurred:

* module.my_module.aws_ecs_service.proxy: 1 error(s) occurred:

* module.my_module.aws_ecs_service.proxy: Resource 'aws_ecs_task_definition.proxy' does not have attribute 'id' for variable 'aws_ecs_task_definition.proxy.*.id

My plan now is to split the module where this code lives in two to avoid needing to use count in this way, but has anyone else seen this issue when referring to attributes of a maybe-count = 0 resource as an argument to another resource?

For the sake of completeness, here's the output of my terraform version:

Terraform v0.11.7
+ provider.aws v1.22.0
+ provider.template v1.0.0

@hopkinsth
Copy link

I solved my issue here by avoiding the template_file resource for the ECS task definition. A heredoc worked where template_file did not in this case.

@bitpavel
Copy link

@danhart The workaround is to use this ugly idiom:

output "s3" {
  value = "${element(concat(aws_s3_bucket.assets.*.id, list("")), 0)}"
}

Finding all the places where this needs to be fixed and updating the code is a mess :(

Here's another simple workaround which works for me:
output "s3" { value = "${join("", aws_s3_bucket.assets.*.id)}" }

@ghost
Copy link

ghost commented Oct 18, 2019

I'm going to lock this issue because it has been closed for 30 days ⏳. This helps our maintainers find and focus on the active issues.

If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@ghost ghost locked and limited conversation to collaborators Oct 18, 2019
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

8 participants