Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Using element with splat reference should scope dependency to selected resource #3449

Closed
kklipsch opened this issue Oct 8, 2015 · 95 comments · Fixed by #14135
Closed

Using element with splat reference should scope dependency to selected resource #3449

kklipsch opened this issue Oct 8, 2015 · 95 comments · Fixed by #14135

Comments

@kklipsch
Copy link

kklipsch commented Oct 8, 2015

I'm trying to setup a multi-node cluster with attached ebs volumes. An example below:

resource "aws_instance" "nodes" {
    instance_type = "${var.model}"
    key_name = "${var.ec2_keypair}"
    ami = "${lookup(var.zk_amis, var.region)}"
    count = "${var.node_count}"
    vpc_security_group_ids = ["${aws_security_group.default.id}"]
    subnet_id = "${lookup(var.subnet_ids, element(keys(var.subnet_ids), count.index))}"
    associate_public_ip_address = true
    user_data = "${file("cloud_init")}"
    tags {
        Name = "${var.cluster_name}-${count.index}"
    }
}

resource "aws_ebs_volume" "node-ebs" {
    count = "${var.node-count}"
    availability_zone = "${element(keys(var.subnet_ids), count.index)}"
    size = 100
    tags {
        Name = "${var.cluster_name}-ebs-${count.index}"
    }
}

resource "aws_volume_attachment" "node-attach" {
    count = "${var.node_count}"
    device_name = "/dev/xvdh"
    volume_id = "${element(aws_ebs_volume.node-ebs.*.id, count.index)}"
    instance_id = "${element(aws_instance.nodes.*.id, count.index)}"
}

If a change happens to a single node (for instance if a single ec2 instance is terminated) ALL of the aws_volume_attachments are recreated.

Clearly we would not want volume attachments to be removed in a production environment. Worse than that, in conjunction with #2957 you first must unmount these attachments before they can be recreated. This has the effect of making volume attachments only viable on brand new clusters.

@danabr
Copy link

danabr commented Oct 9, 2015

Confirmed. We have run into this issue as well. I think it has to do with dependencies not taking the "count" into account.

@danabr
Copy link

danabr commented Oct 9, 2015

I think this comes down to that the state does not track which specific instance an instance depends on, only the resource. Here is an example:

"aws_volume_attachment.db_persistent_volume_attachment.0": {
                "type": "aws_volume_attachment",
                "depends_on": [
                    "aws_ebs_volume.db_volume",
                    "aws_instance.db_instance"
                ],
                "primary": {
                    "id": "vai-1795886726",
                    "attributes": {
                        "device_name": "/dev/sdb",
                        "id": "vai-1795886726",
                        "instance_id": "i-bb16c319",
                        "volume_id": "vol-cca36821"
                    }
                }
            }

When removing an aws_instance, you would have to find all aws_volume_attachments which happen to share the same "instance_id" attribute. But that would be provider, and perhaps even resource, specific.

However, this is not specific to aws. It will occur anytime you have to resources with count parameters, where one resource depends on the other. The right abstraction would be to depend on "gw_instance.db_instance.0" in this case. I don't know what the implications of that would be, though.

@danabr
Copy link

danabr commented Oct 12, 2015

Turns out I was wrong. The "depends_on" attribute in the state file has nothing to do with this. Consider this diff:

-/+ aws_volume_attachment.persistent_volume_attachment.0
    device_name:  "/dev/sdb" => "/dev/sdb"
    force_detach: "" => "<computed>"
    instance_id:  "i-deb76479" => "${element(aws_instance.my_instance.*.id, count.index)}" (forces new resource)
    volume_id:    "vol-9ba36878" => "vol-9ba36878"

It seems like changing one element of the aws_instance.my_instance.*.id causes the entire "element" expression to be considered changed.

@danabr
Copy link

danabr commented Oct 13, 2015

Our current workaround is to duplicate the "aws_volume_attachment" resources, rather than using the element function.

@danabr
Copy link

danabr commented Oct 14, 2015

I dug further into this. It seems the expected behaviour broke with commit 7735847, which was introduced to fix issue #2744.

To me, it seems like you want the treatment of unknown values in splats to behave differently depending on the interpolation context. When you use formatlist, you want to treat the entire list as unknown if it contains any unknown value, but for element, you only care about if a specific value in the list is unknown or not.

I did a test where I introduced a new splat operator with the only difference being how it is treated if the list contains unknown values. It solves the problem, but having two splat operators is kind of confusing.

@mitchellh: Ideas?

@phinze
Copy link
Contributor

phinze commented Oct 14, 2015

Thanks for the report @kklipsch and thanks for taking the time to do a deep dive, @danabr!

To me, it seems like you want the treatment of unknown values in splats to behave differently depending on the interpolation context. When you use formatlist, you want to treat the entire list as unknown if it contains any unknown value, but for element, you only care about if a specific value in the list is unknown or not.

Yep I think this is the key insight. Switching this to core since it's not a provider-level bug.

In the meantime, duplicating aws_volume_attachments to avoid the usage of splat / element() is a valid workaround.

@phinze phinze added core and removed provider/aws labels Oct 14, 2015
@kklipsch
Copy link
Author

Ok thanks. Unfortunately, for our use case that very quickly becomes unwieldy as we are doing 10s of nodes currently but want to be able to scale up to hundreds.

@danabr
Copy link

danabr commented Oct 15, 2015

@kklipsch: If you are OK with running a patched terraform for a while, and you don't rely on the formatlist behavior anywhere, you can just comment out the three lines at https://github.com/hashicorp/terraform/blob/master/terraform/interpolate.go#L466, and compile terraform yourself.

@pdakhane
Copy link

@danabr and @phinze
I tried the work around by using terraform constructs in following manner but it did not help, Could you please share more details on the work around mentioned above "In the meantime, duplicating aws_volume_attachments to avoid the usage of splat / element() is a valid workaround."

resource "aws_instance" "appnodes" {
  instance_type = "${var.flavor_name}"
  ami = "${var.image_name}"
  key_name = "${var.key_name}"
  security_groups = ["${split(",", var.security_groups)}"]
  availability_zone = "${var.availability_zone}" 
  user_data = "${file("mount.sh")}"
  tags {
    Name = "${var.app_name}-${format("%02d", 1)}"
  }
}


resource "aws_volume_attachment" "ebsatt" {
  device_name = "/dev/sdh"
  volume_id = "${aws_ebs_volume.ebsvolumes.id}"
  instance_id = "${aws_instance.appnodes.id}"
}

resource "aws_ebs_volume" "ebsvolumes" {
  availability_zone = "${var.availability_zone}" 
  size = "${var.ebs_size}"
  type = "${var.ebs_type}"
}

resource "aws_instance" "app-nodes" {
  instance_type = "${var.flavor_name}"
  ami = "${var.image_name}"
  key_name = "${var.key_name}"
  security_groups = ["${split(",", var.security_groups)}"]
  availability_zone = "${var.availability_zone}" 
  user_data = "${file("mount.sh")}"
  tags {
    Name = "${var.app_name}-${format("%02d", 1)}"
  }
}


resource "aws_volume_attachment" "ebs_att" {
  device_name = "/dev/sdh"
  volume_id = "${aws_ebs_volume.ebs-volumes.id}"
  instance_id = "${aws_instance.app-nodes.id}"
}

resource "aws_ebs_volume" "ebs-volumes" {
  availability_zone = "${var.availability_zone}" 
  size = "${var.ebs_size}"
  type = "${var.ebs_type}"
}

@danabr
Copy link

danabr commented Oct 22, 2015

@pdakhane: Just take kklipsch, example, but instead of using a "count" attribute of the aws_volume_attachment resource, create multiple aws_volume_attachment_resources referring directly to the instances and volumes. For example if you have three instances:

resource "aws_volume_attachment" "persistent_volume_attachment_0" {
    device_name = "/dev/sdb"
    instance_id = "${aws_instance.instance.0.id}"
    volume_id = "${aws_ebs_volume.volume.0.id}"
}

resource "aws_volume_attachment" "persistent_volume_attachment_1" {
    device_name = "/dev/sdb"
    instance_id = "${aws_instance.instance.1.id}"
    volume_id = "${aws_ebs_volume.volume.1.id}"
}
resource "aws_volume_attachment" "persistent_volume_attachment_2" {
    device_name = "/dev/sdb"
    instance_id = "${aws_instance.instance.2.id}"
    volume_id = "${aws_ebs_volume.volume.2.id}"
}

This only works if you have a small number of nodes, though, and are OK to use the same number of instances in all environments.

@mberhault
Copy link

@phinze pointed to this issue as potentially related to mine.

Here is my config (redacted for readability):

resource "aws_instance" "cockroach" {
  tags {
    Name = "${var.key_name}-${count.index}"
  }
  count = "${var.num_instances}"
  ...
}

resource "null_resource" "cockroach-runner" {
  count = "${var.num_instances}"
  connection {
    ...
    host = "${element(aws_instance.cockroach.*.public_ip, count.index)}"
  }

  triggers {
    instance_ids = "${element(aws_instance.cockroach.*.id, count.index)}"
  }

   provisioner "remote-exec" {
     ....
  }
}

The basic idea is that every instance gets a "runner" attached that does binary deployment and other things. I'm using a null_resource to break a dependency cycle with ELB addresses used by the runner.

The first time I bring up an instance, everything works fine: each instance gets created, then the null_resource runs properly on each.
However, when I terminate an arbitrary instance through the EC2 console (eg: destroying instance 1), all null_resources get rerun.

Here's the log of terraform plan after terminating an instance:

~ aws_elb.elb
    instances.#: "" => "<computed>"

+ aws_instance.cockroach.1
    ami:                        "" => "ami-1c552a76"
    availability_zone:          "" => "us-east-1b"
    ebs_block_device.#:         "" => "<computed>"
    ephemeral_block_device.#:   "" => "<computed>"
    instance_type:              "" => "t2.medium"
    key_name:                   "" => "cockroach-marc"
    placement_group:            "" => "<computed>"
    private_dns:                "" => "<computed>"
    private_ip:                 "" => "<computed>"
    public_dns:                 "" => "<computed>"
    public_ip:                  "" => "<computed>"
    root_block_device.#:        "" => "<computed>"
    security_groups.#:          "" => "1"
    security_groups.2129892981: "" => "cockroach-marc-security-group"
    source_dest_check:          "" => "1"
    subnet_id:                  "" => "<computed>"
    tags.#:                     "" => "1"
    tags.Name:                  "" => "cockroach-marc-1"
    tenancy:                    "" => "<computed>"
    vpc_security_group_ids.#:   "" => "<computed>"

-/+ null_resource.cockroach-runner.0
    triggers.#:            "1" => "<computed>" (forces new resource)
    triggers.instance_ids: "i-21867290" => ""

-/+ null_resource.cockroach-runner.1
    triggers.#:            "1" => "<computed>" (forces new resource)
    triggers.instance_ids: "i-fd85714c" => ""

-/+ null_resource.cockroach-runner.2
    triggers.#:            "1" => "<computed>" (forces new resource)
    triggers.instance_ids: "i-20867291" => ""

I was expecting only "null_resource.cockroach-runner.1" to be updated, but it seems that 0 and 2 changed as well.

@phinze
Copy link
Contributor

phinze commented Dec 15, 2015

Re-titling this to indicate the nature of the core issue here. We'll get this looked at soon!

@phinze phinze changed the title aws_volume_attachment recreates attachments unnecessarily Using element with splat reference should scope dependency to selected resource Dec 15, 2015
@bflad
Copy link
Contributor

bflad commented Jan 4, 2016

Just pinging here since we just ran into this issue as well.

@phinze
Copy link
Contributor

phinze commented Jan 13, 2016

Okay just consolidated a few other issue threads that were expressions of this bug into this one.

My apologies to all the users who have been hitting this - this is now in my list of top priority core bugs to get fixed soon.

As I alluded to with the re-title, this issue comes down to the fact that Terraform core is currently unaware that ${element(some.splat.*.reference)} is a dependency on a single element from the splat. It simply sees "there is a splat reference in this interpolation" and therefore believes--incorrectly--that it needs to consider every element in the list when calculating whether or not the overall value of the interpolation is computed or not.

The most direct solution would be to "just make it work for element()". In other words, add special-casing into Terraform's interpolation handling of splats that would look for a surrounding element() and use the different logic for computed calculations if it is found.

This is probably not the right way to go as it is (a) difficult to implement "context-awareness" into that part of the codebase, and (b) a brittle solution that sets a bad precedent of special casing certain functions in the core.

Because of this, the core team thinks the best way forward is to add first-class list indexing into the interpolation language. This would promote the behavior of element() to a language feature (likely square-bracket notation) and give the core interpolation code a rich enough context to be able to implement the correct computed-value scoping we need to fix this bug.

I've got a spike of first-class-indexing started, and I'll keep this thread updated with my progress.

@phinze phinze self-assigned this Jan 13, 2016
@justinclayton
Copy link
Contributor

💯

@bflad
Copy link
Contributor

bflad commented Jan 13, 2016

@phinze thank you so much for the detailed response and the ongoing effort! 🎉

@rgs1
Copy link

rgs1 commented Feb 6, 2016

Thanks for the report @phinze - is there a WIP branch available to follow along?

@jkinred
Copy link
Contributor

jkinred commented Feb 7, 2016

Keen to see this one resolved. Quite limiting for those of using count with instances and template_file to generate userdata.

Does anyone know of a workaround?

@sigmunau
Copy link

I created a new issue #14536 with details of my problem

@dannytrigo
Copy link
Contributor

I have a similar issue passing a splat list to a module, even if I access the elements inside the module with [] syntax. Does passing a list into a module impact whether its considered 'unknown' as a whole?

@jnahelou
Copy link

@apparentlymart I have the same issue as @dannytrigo :'(

Here is my sample :

.
├── inputs.tfvars
├── instance
│   └── main.tf
├── main.tf

main.tf

variable "shortnames" {
  type = "list"
}

module "generic_linux" {
  source     = "./instance/"
  shortnames = "${var.shortnames}"
}

resource "aws_ebs_volume" "data" {
  count             = "${length(var.shortnames)}"
  availability_zone = "eu-west-1a"
  size              = "5"
  type              = "gp2"
}

resource "aws_volume_attachment" "data_ebs_att" {
  count       = "${length(var.shortnames)}"
  device_name = "/dev/sdc"
  volume_id   = "${aws_ebs_volume.data.*.id[count.index]}"
  instance_id = "${module.generic_linux.instances_ids[count.index]}"
}

My module code is :
instance/main.tf

variable "shortnames" {
  type        = "list"
  description = "list of shortname"
}

resource "aws_instance" "instances" {
  count = "${length(var.shortnames)}"

  instance_type          = "t2.micro"
  key_name               = "formation-hpc"
  ami                    = "ami-xxxxxxxx"
  vpc_security_group_ids = ["sg-xxxxxxxx"]
  subnet_id              = "subnet-xxxxxxxx"

  tags {
    Name = "${var.shortnames[count.index]}-${count.index}"
  }
}

output "instances_ids" {
  value = "${aws_instance.instances.*.id}"
}

Usage of instance_id = "${module.generic_linux.instances_ids[count.index]}" force new resource :

-/+ aws_volume_attachment.data_ebs_att[0] (new resource required)
      id:                               "vai-764663169" => <computed> (forces new resource)
      device_name:                      "/dev/sdc" => "/dev/sdc"
      force_detach:                     "" => <computed>
      instance_id:                      "i-02759cd6c3590764f" => "${module.generic_linux.instances_ids[count.index]}" (forces new resource)
      skip_destroy:                     "" => <computed>
      volume_id:                        "vol-096815f03a512625c" => "vol-096815f03a512625c"

Is there a workaround to add nodes on an undefined size cluster based on a generic instance module without recreate each dependant resources ?

@loalf
Copy link

loalf commented Mar 2, 2018

I do have the same issue. It's a bit shocking that this issue has been opened for 2 years and hasn't been fixed already. It'd be great if someone had a workaround for this.

@loalf
Copy link

loalf commented Mar 2, 2018

I just happened to find a workaround-ish using lifecycle ignore_changes. So, in @jnahelou's example, the amended terraform script would look like:

resource "aws_volume_attachment" "data_ebs_att" {
  count       = "${length(var.shortnames)}"
  device_name = "/dev/sdc"
  volume_id   = "${aws_ebs_volume.data.*.id[count.index]}"
  instance_id = "${module.generic_linux.instances_ids[count.index]}"

  lifecycle {
    ignore_changes: ["instance_id"]
  }
}

@armenr
Copy link

armenr commented Mar 5, 2018

I've done similar @loalf - but it feels as though that really shouldn't be necessary. Being that Terraform is intentionally declarative, I can see how it's ended up being this way.

In my case, I dynamically allocate instances in round-robin fashion to whatever variable number of subnets I have. BUT, when you change the number of subnets you have provisioned in a given VPC, it can dangerously trigger the recreation of your EC2's, so I've done something similar to what you have.

Check this:

resource "aws_instance" "ec2_instance" {
  ami                     = "${lookup(var.aws_machine_images, "${var.ubuntu_version},${var.aws_region}")}"
  instance_type           = "${var.instance_type}"
  count                   = "${var.total_instances}"
  disable_api_termination = "${var.enable_instance_protection}"

  # TODO: Fix this!
  # Changing the number of subnets will trigger resource recreation
  # ergo, the lifecycle manager
  subnet_id = "${element(var.subnet_ids, count.index)}"

  key_name                    = "${var.key_pair_id}"
  vpc_security_group_ids      = ["${var.security_group_ids}"]
  associate_public_ip_address = "${var.associate_public_ip_address}"

  root_block_device {
    volume_type           = "${var.root_volume_type}"
    volume_size           = "${var.root_volume_size_gb}"
    delete_on_termination = "${var.storage_delete_on_termination}"
  }

  tags {
    Name        = "${var.total_instances > 1 ? format("%s-%02d-%s", var.instance_name, (count.index + 1), var.environment) : format("%s-%s", var.instance_name, var.environment)}"
    ServerGroup = "${var.instance_name}-${var.environment}"
    ServerName  = "${var.instance_name}${count.index}"
    Environment = "${var.environment}"
  }

  lifecycle {
    ignore_changes = ["subnet_ids"]
  }
}

@misham
Copy link
Contributor

misham commented Mar 5, 2018

@armenr would your solution create additional ec2 instances or remove extra ones when the count of subnets changes?

Or is this designed to always keep the number of ec2 instances static after initial creation?

@bereti
Copy link

bereti commented Mar 14, 2018

Please check this option:
#14357

Instead of "element" use the [] option.

@armenr
Copy link

armenr commented Mar 20, 2018

@misham - Good question! It will KEEP existing instances in the subnets where they reside, and add instances when you add a subnet to your list of subnets.

From what I recall, if I issue a destroy on a specific subnet, the EC2's get destroyed also.

@zeroedin
Copy link

zeroedin commented May 17, 2018

Recently upgraded terraform

Terraform v0.11.7
+ provider.aws v1.18.0
+ provider.template v1.0.0

I've tried the syntax by @apparentlymart

# Create AWS Instances
resource "aws_instance" "web" {
  count                         = "${var.count}"
  ami                           = "${var.aws_ami}"
  instance_type                 = "${var.aws_instance_type}"
  associate_public_ip_address   = "${var.aws_public_ip}"
  ...
}

# Attach Instances to Application Load Balancer
resource "aws_alb_target_group_attachment" "web" {
  count = "${var.count}"
  target_group_arn = "${var.aws_alb_target_group_arn}"
  # target_id = "${element(aws_instance.web.*.id, count.index)}"
  target_id = "${aws_instance.web.*.id[count.index]}"
  port = "${var.aws_alb_target_group_port}"
}

However when I issue the command:

terraform plan --destroy --var-file=staging.tfvars -target=aws_alb_target_group_attachment.web[2] -target=aws_instance.web[2]

or just

terraform plan --destroy --var-file=staging.tfvars -target=aws_instance.web[2]

Terraform wants to destroy all aws_alb_target_group_attachments:

Terraform will perform the following actions:

  - aws_alb_target_group_attachment.web[0]

  - aws_alb_target_group_attachment.web[1]

  - aws_alb_target_group_attachment.web[2]

  - aws_alb_target_group_attachment.web[3]

  - aws_instance.web[2]


Plan: 0 to add, 0 to change, 5 to destroy.

I can properly remove just the aws_alb_target_group_attachment:

terraform plan --destroy --var-file=staging.tfvars -target=aws_alb_target_group_attachment.web[2]

However, if I follow that up with a destroy of the instance it will want to remove all other remaining target group attachment still.

Is the approach wrong or is there still a bug here?

@miry
Copy link

miry commented Feb 8, 2019

I still have same problem.

Example:

resource "aws_instance" "masters" {
  count = "3"
  ami = "${var.ami}"
}

resource "null_resource" "outindex" {
  count = "3"
  triggers {
    cluster_instance = "${aws_instance.masters.*.id[count.index]}"
  }

  provisioner "local-exec" {
    command = "date"
  }
  lifecycle { create_before_destroy = true }
}

When I try to update instance with new AMI for first resource it first updated ALL instances, then start execute null resource.

 $ terraform plan -target="null_resource.outindex[0]"
-/+ aws_instance.masters[0] (new resource required)
      ami:               "ami-xxxx" => "ami-yyy" (forces new resource)

-/+ aws_instance.masters[1] (new resource required)
      ami:               "ami-xxxx" => "ami-yyy" (forces new resource)

-/+ aws_instance.masters[2] (new resource required)
      ami:               "ami-xxxx" => "ami-yyy" (forces new resource)

+ null_resource.outindex[0]

I expected to see only for first instance changes.

Environment:

 $ terraform version
Terraform v0.11.11
+ provider.aws v1.57.0
+ provider.external v1.0.0
+ provider.local v1.1.0
+ provider.null v2.0.0
+ provider.template v2.0.0

OS: MacOS

UPDATE:

Currently to fix this I do:

$ terraform plan -target="null_resource.outindex[0]" -target="aws_instance.masters[0]"
-/+ aws_instance.masters[0] (new resource required)
      ami:               "ami-xxxx" => "ami-yyy" (forces new resource)

+ null_resource.outindex[0]

lonegunmanb added a commit to lonegunmanb/tf-ucloud-gb-sample that referenced this issue Mar 12, 2019
you cannot use count.index with a list variable to get resource id, it'll cause a forces new resource, that is, delete all old resources and re-create again
@zygisa
Copy link

zygisa commented Apr 23, 2019

I see that this is closed but I'm still experiencing the same issue in v0.11.10. Is this expected?

@dekimsey
Copy link
Contributor

dekimsey commented Jul 1, 2019

I noticed this issue appears to still be happening in Terraform v0.11.14. Could this be because we are using a module under the hood to create the EC2 instances? Incrementing our count from 7 => 8 causes all volume attachments 1-7 to be re-attached.

module "elk-elasticsearch-node" {
  source = "./app-cluster-static"
}
# ./app-cluster-static/main.yml
module "this" {
  source  = "terraform-aws-modules/ec2-instance/aws"
  version = "~> 1.19.0"
  ...
}
-/+ aws_volume_attachment.elk-elasticsearch-node[7] (new resource required)
      id:                                        "vai-2048368739" => <computed> (forces new resource)
      device_name:                               "/dev/sdf" => "/dev/sdf"
      instance_id:                               "i-08493c7837712a7ea" => "${module.elk-elasticsearch-node.instance_ids[count.index]}" (forces new resource)
      volume_id:                                 "vol-0c9a19f4f4ce4dfd6" => "vol-0c9a19f4f4ce4dfd6"

  + aws_volume_attachment.elk-elasticsearch-node[8]
      id:                                        <computed>
      device_name:                               "/dev/sdf"
      instance_id:                               "${module.elk-elasticsearch-node.instance_ids[count.index]}"
      volume_id:                                 "${aws_ebs_volume.elk-elasticsearch-node.*.id[count.index]}"
$ terraform -v
Terraform v0.11.14
+ provider.aws v2.13.0
+ provider.azuread v0.3.1
+ provider.null v2.1.0
+ provider.random v2.1.0
+ provider.template v2.1.0
+ provider.tls v2.0.1

@ghost
Copy link

ghost commented Jul 24, 2019

I'm going to lock this issue because it has been closed for 30 days ⏳. This helps our maintainers find and focus on the active issues.

If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@ghost ghost locked and limited conversation to collaborators Jul 24, 2019
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet