Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error when instance changed that has EBS volume attached #2957

Closed
bloopletech opened this issue Aug 7, 2015 · 81 comments

Comments

Projects
None yet
@bloopletech
Copy link

commented Aug 7, 2015

This is the specific error I get from terraform:

aws_volume_attachment.admin_rundeck: Destroying...
aws_volume_attachment.admin_rundeck: Error: 1 error(s) occurred:

* Error waiting for Volume (<vol id>) to detach from Instance: <instance id>
Error applying plan:

3 error(s) occurred:

* Error waiting for Volume (<vol id>) to detach from Instance: <instance id>
* aws_instance.admin_rundeck: diffs didn't match during apply. This is a bug with Terraform and should be reported.
* aws_volume_attachment.admin_rundeck: diffs didn't match during apply. This is a bug with Terraform and should be reported.

Terraform does not automatically rollback in the face of errors.
Instead, your Terraform state file has been partially updated with
any resources that successfully completed. Please address the error
above and apply again to incrementally change your infrastructure.

We are building out some infrastructure in EC2 using terraform (v0.6.0). I'm currently working out our persistent storage setup. The strategy I'm planning is to have the root volume of every instance be ephemeral, and to move all persistent data to a separate EBS volume (one persistent volume per instance). We want this to be as automated as possible of course.

Here is a relevant excerpt from our terraform config:

resource "aws_instance" "admin_rundeck" {
  ami = "${var.aws_ami_rundeck}"
  instance_type = "${var.aws_instance_type}"
  subnet_id = "${aws_subnet.admin_private.id}"
  vpc_security_group_ids = ["${aws_security_group.base.id}", "${aws_security_group.admin_rundeck.id}"]
  key_name = "Administration"

  root_block_device {
    delete_on_termination = false
  }

  tags {
    Name = "admin-rundeck-01"
    Role = "rundeck"
    Application = "rundeck"
    Project = "Administration"
  }
}

resource "aws_ebs_volume" "admin_rundeck" {
  size = 500
  availability_zone = "${var.default_aws_az}"
  snapshot_id = "snap-66fc2258"
  tags = {
    Name = "Rundeck Data Volume"
  }
}

resource "aws_volume_attachment" "admin_rundeck" {
  device_name = "/dev/xvdf"
  instance_id = "${aws_instance.admin_rundeck.id}"
  volume_id = "${aws_ebs_volume.admin_rundeck.id}"

  depends_on = "aws_route53_record.admin_rundeck"

  connection {
    host = "admin-rundeck-01.<domain name>"
    bastion_host = "${aws_instance.admin_jumpbox.public_ip}"
    timeout = "1m"
    key_file = "~/.ssh/admin.pem"
    user = "ubuntu"
  }

  provisioner "remote-exec" {
    script = "mount.sh"
  }

  provisioner "remote-exec" {
    inline = [
      "sudo mkdir -m 2775 /data/rundeck",
      "sudo mkdir /data/rundeck/data /data/rundeck/projects && sudo chown -R rundeck:rundeck /data/rundeck",
      "sudo service rundeckd restart"
    ]
  }
}

And mount.sh:

#!/bin/bash

while [ ! -e /dev/xvdf ]; do sleep 1; done

fstab_string='/dev/xvdf /data ext4 defaults,nofail,nobootwait 0 2'
if grep -q -F -v "$fstab_string" /etc/fstab; then
  echo "$fstab_string" | sudo tee -a /etc/fstab
fi

sudo mkdir -p /data && sudo mount -t ext4 /dev/xvdf /data

As you can see, this:

  • Provisions an instance to run Rundeck (http://rundeck.org/)
  • Provisions an EBS volume based off of a snapshot. The snapshot in this case is just an empty ext4 partition.
  • Attaches the voulme to the instance
  • Mounts the volume inside the instance, and then creates some directories to store data in

This works fine the first time it's run. But any time we:

  • make a change to the instance configuration (i.e. change the value of var.aws_ami_rundeck) or
  • make a change to the provisioner config of the volume attachment resource

Terraform then tries to detach the extant volume from the instance, and this task fails every time. I believe this is because you are meant to unmount the ebs volume from inside the instance before detaching the volume. The problem is, I can't work out how to get terraform to unmount the volume inside the instance before trying to detach the volume.

It's almost like I need a provisioner to run before the resource is created, or a provisioner to run on destroy (obviously #386 comes to mind).

This feels like it would be a common problem for anyone working with persistent EBS volumes using terraform, but my googling hasn't really found anyone even having this problem.

Am I simply doing it wrong? I'm not worried about how I get there specifically, I just would like to be able to provision persistent EBS volumes, and then attach and detach that volume to my instances in an automated fashion.

@jarias

This comment has been minimized.

Copy link

commented Aug 13, 2015

Having the same issue here.

@febbraro

This comment has been minimized.

Copy link

commented Aug 17, 2015

I'm also having this issue. I have to detach the volume manually in the AWC Console for Terraform to complete my apply operation.

@tobyclemson

This comment has been minimized.

Copy link

commented Aug 19, 2015

I too am having this problem. Would it be enough to destroy the instance rather than trying to destroy the volume association?

@danabr

This comment has been minimized.

Copy link

commented Aug 28, 2015

We're also having the same issue.

One solution is to stop the instance that has mounted the volume before running terraform apply. From the AWS CLI documentation:
"Make sure to unmount any file systems on the device within your operating system before detaching the volume. Failure to do so results in the volume being stuck in a busy state while detaching."

This might be what we are seeing here.

@james-s-nduka

This comment has been minimized.

Copy link

commented Sep 3, 2015

This bug has become quite critical to us. Is anyone looking into this currently?

@Pryz

This comment has been minimized.

Copy link
Contributor

commented Sep 8, 2015

Same issue here. Any update ? Thanks

@danabr

This comment has been minimized.

Copy link

commented Sep 9, 2015

One solution would be to stop the associated instance before removing the volume attachment. Perhaps this is to intrusive to do automatically, though.

@ryedin

This comment has been minimized.

Copy link

commented Sep 24, 2015

same issue... and I don't think udev helps here (does udev publish an event when a device is attempting to detach?)

EDIT: tried adding force_detach option... no dice

@bitoiu

This comment has been minimized.

Copy link

commented Sep 28, 2015

Same issue here 😢

@JesperTerkelsen

This comment has been minimized.

Copy link

commented Sep 30, 2015

I guess terraform should order terminating instances before removing attachments, by default on a full terraform destroy ?

@simonluijk

This comment has been minimized.

Copy link

commented Sep 30, 2015

@JesperTerkelsen As long as your application can shutdown gracefully within the 20 seconds given by AWS that makes sense.

@nimbusscale

This comment has been minimized.

Copy link

commented Sep 30, 2015

Me too!

@j0nesin

This comment has been minimized.

Copy link

commented Oct 23, 2015

I also needed to persist ebs volumes between instance re-creates and experienced this problem when trying to use volume_attachments. My workaround solution is to drop the "aws_volume_attachment"s and have each instance use the aws cli at bootup time to self-attach the volume it is paired with. When the instance is re-created terraform first destroys the instance which detaches the volume and makes it available for the next instance coming up.

In the instance user-data include the following template script
elasticsearch_mount_vol.sh

INSTANCE_ID=`curl http://169.254.169.254/latest/meta-data/instance-id`

# wait for ebs volume to be attached
while :
do
    # self-attach ebs volume
    aws --region us-east-1 ec2 attach-volume --volume-id ${volume_id} --instance-id $INSTANCE_ID --device ${device_name}

    if lsblk | grep ${lsblk_name}; then
        echo "attached"
        break
    else
        sleep 5
    fi
done

# create fs if needed
if file -s ${device_name} | grep "${device_name}: data"; then
    echo "creating fs"
    mkfs -t ext4 ${device_name}
fi

# mount it
mkdir ${mount_point}
echo "${device_name}       ${mount_point}   ext4    defaults,nofail  0 2" >> /etc/fstab
echo "mounting"
mount -a
resource "aws_ebs_volume" "elasticsearch_master" {
    count = 3
    availability_zone = "${lookup(var.azs, count.index)}"
    size = 8
    type = "gp2"
    tags {
        Name = "elasticsearch_master_az${count.index}.${var.env_name}"
    }
}

resource "template_file" "elasticsearch_mount_vol_sh" {
    filename = "${path.module}/elasticsearch_mount_vol.sh"
    count = 3
    vars {
        volume_id = "${element(aws_ebs_volume.elasticsearch_master.*.id, count.index)}"
        lsblk_name = "xvdf"
        device_name = "/dev/xvdf"
        mount_point = "/esvolume"
    }
}
resource "aws_instance" "elasticsearch_master" {
    count = 3
    ...
    user_data = <<SCRIPT
#!/bin/bash

# Attach and Mount ES EBS volume
${element(template_file.elasticsearch_mount_vol_sh.*.rendered, count.index)}

SCRIPT
}
@jimconner

This comment has been minimized.

Copy link

commented Nov 9, 2015

Same issue here - would be nice if terraform had support for 'deprovisioners' so that we could execute some steps (such as a shutdown -h now) before machine destruction is attempted. We did find that if we did a terraform taint on the instance before terraform destroy then the destruction is completed successfully, so we'll use that as a workaround for now.

jimconner pushed a commit to alphagov/paas-alpha-tsuru-terraform that referenced this issue Nov 9, 2015

Jim Conner
Updated readme with extra step for destruction.
Due to a [bug in Terraform](hashicorp/terraform#2957)
removal of disk attachements is attempted whilst the volume is still in use.
To work around this bug, it is necessary to first `taint` the instance before
destroying it. When the instance is tainted, terraform doesn't wait for the
volume to be unused before destruction.

jimconner pushed a commit to alphagov/paas-alpha-tsuru-terraform that referenced this issue Nov 10, 2015

Jim Conner
Updated readme with extra step for destruction.
Due to a [bug in Terraform](hashicorp/terraform#2957)
removal of disk attachements is attempted whilst the volume is still in use.
To work around this bug, it is necessary to first `taint` the instance before
destroying it. When the instance is tainted, terraform doesn't wait for the
volume to be unused before destruction.
@jniesen

This comment has been minimized.

Copy link

commented Nov 14, 2015

I have a related issue with instance and EBS volume. I think a solution to my problem my fix this as well. With version 0.6.3 when destroying it seems that the volume attachment is always destroyed before the instance.

consul_keys.ami: Refreshing state... (ID: consul)
aws_security_group.elb_sg: Refreshing state... (ID: sg-xxxx)
aws_ebs_volume.jenkins_master_data: Refreshing state... (ID: vol-xxxx)
aws_security_group.jenkins_sg: Refreshing state... (ID: sg-xxxx)
aws_instance.jenkins_master: Refreshing state... (ID: i-xxxx)
aws_elb.jenkins_elb: Refreshing state... (ID: jniesen-jenkins-master-elb)
aws_volume_attachment.jenkins_master_data_mount: Refreshing state... (ID: vai-xxxx)
aws_route53_record.jenkins: Refreshing state... (ID: xxxx)
aws_volume_attachment.jenkins_master_data_mount: Destroying...
aws_route53_record.jenkins: Destroying...
aws_route53_record.jenkins: Destruction complete
aws_elb.jenkins_elb: Destroying...
aws_elb.jenkins_elb: Destruction complete
Error applying plan:

1 error(s) occurred:

* aws_volume_attachment.jenkins_master_data_mount: Error waiting for Volume (vol-xxxx) to detach from Instance: i-xxxx

I thought that I could get around this by having a systemd unit stop the process using the mounted ebs volume and then unmount whenever the instance receives a halt or shutdown. The problem is that doesn't ever happen before the EBS volume destroy is attempted. I think if the order could be forced, and I could have the instance destroyed before the volume, things would go more smoothly.

@j0nesin

This comment has been minimized.

Copy link

commented Nov 14, 2015

If you use 'depends_on' in the instance definition to depend on the ebs volume, then the destroy sequence will destroy the instance before trying to destroy the volume.

@jniesen

This comment has been minimized.

Copy link

commented Nov 14, 2015

The error comes when destroying the volume_attachment which would cause the volume to just detach. I mis-spoke in my last paragraph. I can't make the instance depend on the attachment explicitly because the attachment already depends on the instance implicitly because I'm referencing the instances id.

@james-masson

This comment has been minimized.

Copy link

commented Nov 24, 2015

+1 agree with @jniesen

A persistent data disk, separate from OS/instance would be a great feature, if it worked!

Creation of related aws_ebs_volume, aws_instance and aws_volume_attachment resources work fine.

Any apply that involves the re-creation of the aws_instance hangs, as the aws_volume_attachment implicitly depends on the aws_instance references , and is destroyed first - causing the volume unmount to hang.

For this to work in an elegant fashion, the VM would have to be destroyed first, to get a clean unmount.

@opokhvalit

This comment has been minimized.

Copy link

commented Dec 28, 2015

got the same problem. Workaround with taint+debug is work fine, thanks @jimconner

@ghost

This comment has been minimized.

Copy link

commented Dec 29, 2015

+1 to a fix. If the attached EBS volume is in use by the OS by say a daemon process (e.g., Docker) then some mechanisms has to be provided by Terraform to allow OS level calls for clean service stop and umount. Some of the ideas listed herein are possible works around, but not tenable long term solutions.

@sudochop

This comment has been minimized.

Copy link

commented Dec 30, 2015

+1 Same problem here. Thanks for the workaround @jimconner

@arthurschreiber

This comment has been minimized.

Copy link

commented Jan 13, 2016

I'm also running into this issue. If both the aws_instance as well as the linked aws_volume_attachment are scheduled to be deleted, the instance needs to be deleted first.

@arthurschreiber

This comment has been minimized.

Copy link

commented Jan 13, 2016

See #4643 for a similar problem, and the feature request in #622 which would provide an easy fix for this.

@mitchellh

This comment has been minimized.

Copy link
Member

commented Nov 17, 2016

This is pretty much the same as #2761, I'm sure there are other places this is being tracked too... going to close this one. (The reference here will link them, too)

@redbaron

This comment has been minimized.

Copy link

commented Nov 18, 2016

@mitchellh , arguably this issue has bigger "community" and should be considered main point of contact to track all dependency problems which can't be expressed using simplistic graph model TF is currently using.

#2761 is valid issue too,but it has got 5 comments and 9 subscribers, strange choice to keep that one and close this.

@carterjones

This comment has been minimized.

Copy link

commented Jan 3, 2017

I know this thread was closed in favor of #2761, but given that that issue is still open, I wanted to leave this here for anyone else still experiencing this particular issue.

I was able to set skip_destroy to true on the volume attachment to solve this issue.
Details here: https://www.terraform.io/docs/providers/aws/r/volume_attachment.html#skip_destroy

Note: in order for it to work, I had to do the following

  1. set skip_destroy to true on the volume attachment
  2. run terraform apply
  3. make the other changes to the instance that caused it to be terminated/recreated (changing the AMI in my case)
  4. run terraform apply again

Leaving this here in case anyone else finds it useful.

@mpalmer

This comment has been minimized.

Copy link

commented Oct 5, 2017

I can't get the above workaround to do the trick using 0.10.6. Looks like whatever bug was being exploited to make this work got closed.

@Gary-Armstrong

This comment has been minimized.

Copy link

commented Oct 5, 2017

I'm still only provisioning ephemerals in TF.

In fact, I am specifying four of them for every instance, every time. I then have some ruby/chef that will determine how many are really there (0-4) and do the needful to partition, lvm stripe, then mount as a single ext4.

I still use Chef to config all EBS from creation to fs mount. Works great. EBS persist unless defined otherwise. Mentally assigning all volume management to the OS arena has gotten me where I want to be.

@exolab

This comment has been minimized.

Copy link

commented Oct 9, 2017

This is still an issue 26 months after the issue was first created.

@c4milo

This comment has been minimized.

Copy link
Contributor

commented Oct 9, 2017

@exolab, It is not. You need to use destroy-time provisioners in order to unmount the EBS volume.

@exolab

This comment has been minimized.

Copy link

commented Oct 9, 2017

Sorry if I am a bit daft. How so?

Is this what you are suggesting?

provisioner "remote-exec" {
    inline = ["umount -A"]

    when   = "destroy"
  }
@Mykolaichenko

This comment has been minimized.

Copy link

commented Oct 25, 2017

Also with @mpalmer not working fix with skip_destroy using terraform 10.6 😞

@GarrisonD

This comment has been minimized.

Copy link

commented Dec 15, 2017

Fix with skip_destroy does not work using terraform 11.1 😢

@smastrorocco

This comment has been minimized.

Copy link

commented Feb 21, 2018

+1

@Fazered

This comment has been minimized.

Copy link

commented Mar 5, 2018

Still an issue (and a big issue for us) in v0.11.3

@jangrewe

This comment has been minimized.

Copy link

commented Mar 19, 2018

Still an issue in v0.11.4

@devsecops-dba

This comment has been minimized.

Copy link

commented Aug 20, 2018

terraform v0.11.7 -- have same issue with volumeattachment when running destroy;
skip_destroy = true in volume attachment resource is not helping either - destroy keeps trying.
went ahead force detached from console - then tried destroy moved forward at that time.
Is there default timeout for TF - script kept running destroy until I ctrl C out of it -- trying to detach ebs ovl.

@mmacdermaid

This comment has been minimized.

Copy link

commented Aug 28, 2018

On Terraform v0.11.7 I was able to get around this by creating the volume attachment with

force_detach = true

if you created it without the force detach to be true it will still fail. I had to terminate the instance, allow the edit or recreation of the volume attachment to have force detach, and then all subsequent detaches work for me.

@davidvuong

This comment has been minimized.

Copy link

commented Oct 30, 2018

Using force_detect = true worked for me as well (v0.11.7).

Originally created the volume without force_detect so I had go manually force detach in the AWS console, then delete the volume (in Terraform) and re-create (also in Terraform) before it worked.

@JasonGilholme

This comment has been minimized.

Copy link

commented Dec 2, 2018

Still an issue.

Is there any issue using force_detach? I'm assuming that processes could still be trying to use the volume. (?) Is there a way to stop the instance prior to detaching the volume and then terminate it?

@aaronpi

This comment has been minimized.

Copy link

commented Jun 12, 2019

Still an issue.

Is there any issue using force_detach? I'm assuming that processes could still be trying to use the volume. (?) Is there a way to stop the instance prior to detaching the volume and then terminate it?

I know this issue is closed, but just as a example workaround for this for people finding this, I'll post what I've done. I have a volume I want to persist between machine rebuilds (gets rebuilt from a snapshot if deleted but otherwise persisted). What I did was grab the older instance id in TF, then a local-exec (can't use remote-exec with how direct access to the machine is gated) to use the aws cli to to shutdown the machine the volume is being detached from first before destroy and rebuild of the machine and the volume attachment:

//data source to get previous instance id for TF workaround below
data "aws_instance" "example_previous_instance" {
  filter {
    name = "tag:Name"
    values = ["${var.example_instance_values}"]
  }
}

//volume attachment
resource "aws_volume_attachment" "example_volume_attachment" {
  device_name = "/dev/xvdf"
  volume_id   = "${aws_ebs_volume.example_volume.id}"
  instance_id = "${aws_instance.example_instance.id}"
  //below is a workaround for TF not detaching volumes correctly on rebuilds.
  //additionally the 10 second wait is too short for detachment and force_detach is ineffective currently
  //so we're using a workaround: using the AWS CLI to gracefully shutdown the previous instance before detachment and instance destruction
  provisioner "local-exec" {
    when   = "destroy"
    command = "ENV=${var.env} aws ec2 stop-instances --instance-ids ${data.aws_instance.example_previous_instance.id}"
  }
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.