Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error when instance changed that has EBS volume attached #2957

Closed
bloopletech opened this issue Aug 7, 2015 · 82 comments
Closed

Error when instance changed that has EBS volume attached #2957

bloopletech opened this issue Aug 7, 2015 · 82 comments

Comments

@bloopletech
Copy link

@bloopletech bloopletech commented Aug 7, 2015

This is the specific error I get from terraform:

aws_volume_attachment.admin_rundeck: Destroying...
aws_volume_attachment.admin_rundeck: Error: 1 error(s) occurred:

* Error waiting for Volume (<vol id>) to detach from Instance: <instance id>
Error applying plan:

3 error(s) occurred:

* Error waiting for Volume (<vol id>) to detach from Instance: <instance id>
* aws_instance.admin_rundeck: diffs didn't match during apply. This is a bug with Terraform and should be reported.
* aws_volume_attachment.admin_rundeck: diffs didn't match during apply. This is a bug with Terraform and should be reported.

Terraform does not automatically rollback in the face of errors.
Instead, your Terraform state file has been partially updated with
any resources that successfully completed. Please address the error
above and apply again to incrementally change your infrastructure.

We are building out some infrastructure in EC2 using terraform (v0.6.0). I'm currently working out our persistent storage setup. The strategy I'm planning is to have the root volume of every instance be ephemeral, and to move all persistent data to a separate EBS volume (one persistent volume per instance). We want this to be as automated as possible of course.

Here is a relevant excerpt from our terraform config:

resource "aws_instance" "admin_rundeck" {
  ami = "${var.aws_ami_rundeck}"
  instance_type = "${var.aws_instance_type}"
  subnet_id = "${aws_subnet.admin_private.id}"
  vpc_security_group_ids = ["${aws_security_group.base.id}", "${aws_security_group.admin_rundeck.id}"]
  key_name = "Administration"

  root_block_device {
    delete_on_termination = false
  }

  tags {
    Name = "admin-rundeck-01"
    Role = "rundeck"
    Application = "rundeck"
    Project = "Administration"
  }
}

resource "aws_ebs_volume" "admin_rundeck" {
  size = 500
  availability_zone = "${var.default_aws_az}"
  snapshot_id = "snap-66fc2258"
  tags = {
    Name = "Rundeck Data Volume"
  }
}

resource "aws_volume_attachment" "admin_rundeck" {
  device_name = "/dev/xvdf"
  instance_id = "${aws_instance.admin_rundeck.id}"
  volume_id = "${aws_ebs_volume.admin_rundeck.id}"

  depends_on = "aws_route53_record.admin_rundeck"

  connection {
    host = "admin-rundeck-01.<domain name>"
    bastion_host = "${aws_instance.admin_jumpbox.public_ip}"
    timeout = "1m"
    key_file = "~/.ssh/admin.pem"
    user = "ubuntu"
  }

  provisioner "remote-exec" {
    script = "mount.sh"
  }

  provisioner "remote-exec" {
    inline = [
      "sudo mkdir -m 2775 /data/rundeck",
      "sudo mkdir /data/rundeck/data /data/rundeck/projects && sudo chown -R rundeck:rundeck /data/rundeck",
      "sudo service rundeckd restart"
    ]
  }
}

And mount.sh:

#!/bin/bash

while [ ! -e /dev/xvdf ]; do sleep 1; done

fstab_string='/dev/xvdf /data ext4 defaults,nofail,nobootwait 0 2'
if grep -q -F -v "$fstab_string" /etc/fstab; then
  echo "$fstab_string" | sudo tee -a /etc/fstab
fi

sudo mkdir -p /data && sudo mount -t ext4 /dev/xvdf /data

As you can see, this:

  • Provisions an instance to run Rundeck (http://rundeck.org/)
  • Provisions an EBS volume based off of a snapshot. The snapshot in this case is just an empty ext4 partition.
  • Attaches the voulme to the instance
  • Mounts the volume inside the instance, and then creates some directories to store data in

This works fine the first time it's run. But any time we:

  • make a change to the instance configuration (i.e. change the value of var.aws_ami_rundeck) or
  • make a change to the provisioner config of the volume attachment resource

Terraform then tries to detach the extant volume from the instance, and this task fails every time. I believe this is because you are meant to unmount the ebs volume from inside the instance before detaching the volume. The problem is, I can't work out how to get terraform to unmount the volume inside the instance before trying to detach the volume.

It's almost like I need a provisioner to run before the resource is created, or a provisioner to run on destroy (obviously #386 comes to mind).

This feels like it would be a common problem for anyone working with persistent EBS volumes using terraform, but my googling hasn't really found anyone even having this problem.

Am I simply doing it wrong? I'm not worried about how I get there specifically, I just would like to be able to provision persistent EBS volumes, and then attach and detach that volume to my instances in an automated fashion.

@jarias
Copy link

@jarias jarias commented Aug 13, 2015

Having the same issue here.

@febbraro
Copy link

@febbraro febbraro commented Aug 17, 2015

I'm also having this issue. I have to detach the volume manually in the AWC Console for Terraform to complete my apply operation.

@tobyclemson
Copy link

@tobyclemson tobyclemson commented Aug 19, 2015

I too am having this problem. Would it be enough to destroy the instance rather than trying to destroy the volume association?

@danabr
Copy link

@danabr danabr commented Aug 28, 2015

We're also having the same issue.

One solution is to stop the instance that has mounted the volume before running terraform apply. From the AWS CLI documentation:
"Make sure to unmount any file systems on the device within your operating system before detaching the volume. Failure to do so results in the volume being stuck in a busy state while detaching."

This might be what we are seeing here.

@james-s-nduka
Copy link

@james-s-nduka james-s-nduka commented Sep 3, 2015

This bug has become quite critical to us. Is anyone looking into this currently?

@Pryz
Copy link
Contributor

@Pryz Pryz commented Sep 8, 2015

Same issue here. Any update ? Thanks

@danabr
Copy link

@danabr danabr commented Sep 9, 2015

One solution would be to stop the associated instance before removing the volume attachment. Perhaps this is to intrusive to do automatically, though.

@ryedin
Copy link

@ryedin ryedin commented Sep 24, 2015

same issue... and I don't think udev helps here (does udev publish an event when a device is attempting to detach?)

EDIT: tried adding force_detach option... no dice

@bitoiu
Copy link

@bitoiu bitoiu commented Sep 28, 2015

Same issue here 😢

@JesperTerkelsen
Copy link

@JesperTerkelsen JesperTerkelsen commented Sep 30, 2015

I guess terraform should order terminating instances before removing attachments, by default on a full terraform destroy ?

@simonluijk
Copy link

@simonluijk simonluijk commented Sep 30, 2015

@JesperTerkelsen As long as your application can shutdown gracefully within the 20 seconds given by AWS that makes sense.

@nimbusscale
Copy link

@nimbusscale nimbusscale commented Sep 30, 2015

Me too!

@j0nesin
Copy link

@j0nesin j0nesin commented Oct 23, 2015

I also needed to persist ebs volumes between instance re-creates and experienced this problem when trying to use volume_attachments. My workaround solution is to drop the "aws_volume_attachment"s and have each instance use the aws cli at bootup time to self-attach the volume it is paired with. When the instance is re-created terraform first destroys the instance which detaches the volume and makes it available for the next instance coming up.

In the instance user-data include the following template script
elasticsearch_mount_vol.sh

INSTANCE_ID=`curl http://169.254.169.254/latest/meta-data/instance-id`

# wait for ebs volume to be attached
while :
do
    # self-attach ebs volume
    aws --region us-east-1 ec2 attach-volume --volume-id ${volume_id} --instance-id $INSTANCE_ID --device ${device_name}

    if lsblk | grep ${lsblk_name}; then
        echo "attached"
        break
    else
        sleep 5
    fi
done

# create fs if needed
if file -s ${device_name} | grep "${device_name}: data"; then
    echo "creating fs"
    mkfs -t ext4 ${device_name}
fi

# mount it
mkdir ${mount_point}
echo "${device_name}       ${mount_point}   ext4    defaults,nofail  0 2" >> /etc/fstab
echo "mounting"
mount -a
resource "aws_ebs_volume" "elasticsearch_master" {
    count = 3
    availability_zone = "${lookup(var.azs, count.index)}"
    size = 8
    type = "gp2"
    tags {
        Name = "elasticsearch_master_az${count.index}.${var.env_name}"
    }
}

resource "template_file" "elasticsearch_mount_vol_sh" {
    filename = "${path.module}/elasticsearch_mount_vol.sh"
    count = 3
    vars {
        volume_id = "${element(aws_ebs_volume.elasticsearch_master.*.id, count.index)}"
        lsblk_name = "xvdf"
        device_name = "/dev/xvdf"
        mount_point = "/esvolume"
    }
}
resource "aws_instance" "elasticsearch_master" {
    count = 3
    ...
    user_data = <<SCRIPT
#!/bin/bash

# Attach and Mount ES EBS volume
${element(template_file.elasticsearch_mount_vol_sh.*.rendered, count.index)}

SCRIPT
}

@jimconner
Copy link

@jimconner jimconner commented Nov 9, 2015

Same issue here - would be nice if terraform had support for 'deprovisioners' so that we could execute some steps (such as a shutdown -h now) before machine destruction is attempted. We did find that if we did a terraform taint on the instance before terraform destroy then the destruction is completed successfully, so we'll use that as a workaround for now.

jimconner pushed a commit to alphagov/paas-alpha-tsuru-terraform that referenced this issue Nov 9, 2015
Due to a [bug in Terraform](hashicorp/terraform#2957)
removal of disk attachements is attempted whilst the volume is still in use.
To work around this bug, it is necessary to first `taint` the instance before
destroying it. When the instance is tainted, terraform doesn't wait for the
volume to be unused before destruction.
jimconner pushed a commit to alphagov/paas-alpha-tsuru-terraform that referenced this issue Nov 10, 2015
Due to a [bug in Terraform](hashicorp/terraform#2957)
removal of disk attachements is attempted whilst the volume is still in use.
To work around this bug, it is necessary to first `taint` the instance before
destroying it. When the instance is tainted, terraform doesn't wait for the
volume to be unused before destruction.
@jniesen
Copy link

@jniesen jniesen commented Nov 14, 2015

I have a related issue with instance and EBS volume. I think a solution to my problem my fix this as well. With version 0.6.3 when destroying it seems that the volume attachment is always destroyed before the instance.

consul_keys.ami: Refreshing state... (ID: consul)
aws_security_group.elb_sg: Refreshing state... (ID: sg-xxxx)
aws_ebs_volume.jenkins_master_data: Refreshing state... (ID: vol-xxxx)
aws_security_group.jenkins_sg: Refreshing state... (ID: sg-xxxx)
aws_instance.jenkins_master: Refreshing state... (ID: i-xxxx)
aws_elb.jenkins_elb: Refreshing state... (ID: jniesen-jenkins-master-elb)
aws_volume_attachment.jenkins_master_data_mount: Refreshing state... (ID: vai-xxxx)
aws_route53_record.jenkins: Refreshing state... (ID: xxxx)
aws_volume_attachment.jenkins_master_data_mount: Destroying...
aws_route53_record.jenkins: Destroying...
aws_route53_record.jenkins: Destruction complete
aws_elb.jenkins_elb: Destroying...
aws_elb.jenkins_elb: Destruction complete
Error applying plan:

1 error(s) occurred:

* aws_volume_attachment.jenkins_master_data_mount: Error waiting for Volume (vol-xxxx) to detach from Instance: i-xxxx

I thought that I could get around this by having a systemd unit stop the process using the mounted ebs volume and then unmount whenever the instance receives a halt or shutdown. The problem is that doesn't ever happen before the EBS volume destroy is attempted. I think if the order could be forced, and I could have the instance destroyed before the volume, things would go more smoothly.

@j0nesin
Copy link

@j0nesin j0nesin commented Nov 14, 2015

If you use 'depends_on' in the instance definition to depend on the ebs volume, then the destroy sequence will destroy the instance before trying to destroy the volume.

@jniesen
Copy link

@jniesen jniesen commented Nov 14, 2015

The error comes when destroying the volume_attachment which would cause the volume to just detach. I mis-spoke in my last paragraph. I can't make the instance depend on the attachment explicitly because the attachment already depends on the instance implicitly because I'm referencing the instances id.

@james-masson
Copy link

@james-masson james-masson commented Nov 24, 2015

+1 agree with @jniesen

A persistent data disk, separate from OS/instance would be a great feature, if it worked!

Creation of related aws_ebs_volume, aws_instance and aws_volume_attachment resources work fine.

Any apply that involves the re-creation of the aws_instance hangs, as the aws_volume_attachment implicitly depends on the aws_instance references , and is destroyed first - causing the volume unmount to hang.

For this to work in an elegant fashion, the VM would have to be destroyed first, to get a clean unmount.

@opokhvalit
Copy link

@opokhvalit opokhvalit commented Dec 28, 2015

got the same problem. Workaround with taint+debug is work fine, thanks @jimconner

@ghost
Copy link

@ghost ghost commented Dec 29, 2015

+1 to a fix. If the attached EBS volume is in use by the OS by say a daemon process (e.g., Docker) then some mechanisms has to be provided by Terraform to allow OS level calls for clean service stop and umount. Some of the ideas listed herein are possible works around, but not tenable long term solutions.

@sudochop
Copy link

@sudochop sudochop commented Dec 30, 2015

+1 Same problem here. Thanks for the workaround @jimconner

@arthurschreiber
Copy link

@arthurschreiber arthurschreiber commented Jan 13, 2016

I'm also running into this issue. If both the aws_instance as well as the linked aws_volume_attachment are scheduled to be deleted, the instance needs to be deleted first.

@arthurschreiber
Copy link

@arthurschreiber arthurschreiber commented Jan 13, 2016

See #4643 for a similar problem, and the feature request in #622 which would provide an easy fix for this.

@mitchellh
Copy link
Member

@mitchellh mitchellh commented Nov 17, 2016

This is pretty much the same as #2761, I'm sure there are other places this is being tracked too... going to close this one. (The reference here will link them, too)

@redbaron
Copy link

@redbaron redbaron commented Nov 18, 2016

@mitchellh , arguably this issue has bigger "community" and should be considered main point of contact to track all dependency problems which can't be expressed using simplistic graph model TF is currently using.

#2761 is valid issue too,but it has got 5 comments and 9 subscribers, strange choice to keep that one and close this.

@carterjones
Copy link

@carterjones carterjones commented Jan 3, 2017

I know this thread was closed in favor of #2761, but given that that issue is still open, I wanted to leave this here for anyone else still experiencing this particular issue.

I was able to set skip_destroy to true on the volume attachment to solve this issue.
Details here: https://www.terraform.io/docs/providers/aws/r/volume_attachment.html#skip_destroy

Note: in order for it to work, I had to do the following

  1. set skip_destroy to true on the volume attachment
  2. run terraform apply
  3. make the other changes to the instance that caused it to be terminated/recreated (changing the AMI in my case)
  4. run terraform apply again

Leaving this here in case anyone else finds it useful.

@mpalmer
Copy link

@mpalmer mpalmer commented Oct 5, 2017

I can't get the above workaround to do the trick using 0.10.6. Looks like whatever bug was being exploited to make this work got closed.

@Gary-Armstrong
Copy link

@Gary-Armstrong Gary-Armstrong commented Oct 5, 2017

I'm still only provisioning ephemerals in TF.

In fact, I am specifying four of them for every instance, every time. I then have some ruby/chef that will determine how many are really there (0-4) and do the needful to partition, lvm stripe, then mount as a single ext4.

I still use Chef to config all EBS from creation to fs mount. Works great. EBS persist unless defined otherwise. Mentally assigning all volume management to the OS arena has gotten me where I want to be.

@exolab
Copy link

@exolab exolab commented Oct 9, 2017

This is still an issue 26 months after the issue was first created.

@c4milo
Copy link
Contributor

@c4milo c4milo commented Oct 9, 2017

@exolab, It is not. You need to use destroy-time provisioners in order to unmount the EBS volume.

@exolab
Copy link

@exolab exolab commented Oct 9, 2017

Sorry if I am a bit daft. How so?

Is this what you are suggesting?

provisioner "remote-exec" {
    inline = ["umount -A"]

    when   = "destroy"
  }

@Mykolaichenko
Copy link

@Mykolaichenko Mykolaichenko commented Oct 25, 2017

Also with @mpalmer not working fix with skip_destroy using terraform 10.6 😞

@GarrisonD
Copy link

@GarrisonD GarrisonD commented Dec 15, 2017

Fix with skip_destroy does not work using terraform 11.1 😢

@smastrorocco
Copy link

@smastrorocco smastrorocco commented Feb 21, 2018

+1

@Fazered
Copy link

@Fazered Fazered commented Mar 5, 2018

Still an issue (and a big issue for us) in v0.11.3

@jangrewe
Copy link

@jangrewe jangrewe commented Mar 19, 2018

Still an issue in v0.11.4

@devsecops-dba
Copy link

@devsecops-dba devsecops-dba commented Aug 20, 2018

terraform v0.11.7 -- have same issue with volumeattachment when running destroy;
skip_destroy = true in volume attachment resource is not helping either - destroy keeps trying.
went ahead force detached from console - then tried destroy moved forward at that time.
Is there default timeout for TF - script kept running destroy until I ctrl C out of it -- trying to detach ebs ovl.

@mmacdermaid
Copy link

@mmacdermaid mmacdermaid commented Aug 28, 2018

On Terraform v0.11.7 I was able to get around this by creating the volume attachment with

force_detach = true

if you created it without the force detach to be true it will still fail. I had to terminate the instance, allow the edit or recreation of the volume attachment to have force detach, and then all subsequent detaches work for me.

@davidvuong
Copy link

@davidvuong davidvuong commented Oct 30, 2018

Using force_detect = true worked for me as well (v0.11.7).

Originally created the volume without force_detect so I had go manually force detach in the AWS console, then delete the volume (in Terraform) and re-create (also in Terraform) before it worked.

@JasonGilholme
Copy link

@JasonGilholme JasonGilholme commented Dec 2, 2018

Still an issue.

Is there any issue using force_detach? I'm assuming that processes could still be trying to use the volume. (?) Is there a way to stop the instance prior to detaching the volume and then terminate it?

@aaronpi
Copy link

@aaronpi aaronpi commented Jun 12, 2019

Still an issue.

Is there any issue using force_detach? I'm assuming that processes could still be trying to use the volume. (?) Is there a way to stop the instance prior to detaching the volume and then terminate it?

I know this issue is closed, but just as a example workaround for this for people finding this, I'll post what I've done. I have a volume I want to persist between machine rebuilds (gets rebuilt from a snapshot if deleted but otherwise persisted). What I did was grab the older instance id in TF, then a local-exec (can't use remote-exec with how direct access to the machine is gated) to use the aws cli to to shutdown the machine the volume is being detached from first before destroy and rebuild of the machine and the volume attachment:

//data source to get previous instance id for TF workaround below
data "aws_instance" "example_previous_instance" {
  filter {
    name = "tag:Name"
    values = ["${var.example_instance_values}"]
  }
}

//volume attachment
resource "aws_volume_attachment" "example_volume_attachment" {
  device_name = "/dev/xvdf"
  volume_id   = "${aws_ebs_volume.example_volume.id}"
  instance_id = "${aws_instance.example_instance.id}"
  //below is a workaround for TF not detaching volumes correctly on rebuilds.
  //additionally the 10 second wait is too short for detachment and force_detach is ineffective currently
  //so we're using a workaround: using the AWS CLI to gracefully shutdown the previous instance before detachment and instance destruction
  provisioner "local-exec" {
    when   = "destroy"
    command = "ENV=${var.env} aws ec2 stop-instances --instance-ids ${data.aws_instance.example_previous_instance.id}"
  }
}

@ghost
Copy link

@ghost ghost commented Jul 25, 2019

I'm going to lock this issue because it has been closed for 30 days . This helps our maintainers find and focus on the active issues.

If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@hashicorp hashicorp locked and limited conversation to collaborators Jul 25, 2019
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

None yet