aws: Allow rolling updates for ASGs #1552

Closed
radeksimko opened this Issue Apr 16, 2015 · 80 comments

Comments

Projects
None yet
@radeksimko
Member

radeksimko commented Apr 16, 2015

Once #1109 is fixed, I'd like to be able to use Terraform to actually roll out the updated launch configuration and do it carefully.

Whoever decides to not roll out the update and change the LC association only should still be allowed to do so.

Here's an example from CloudFormation:
http://docs.aws.amazon.com/AWSCloudFormation/latest/UserGuide/aws-attribute-updatepolicy.html

How about something like this?

resource "aws_autoscaling_group" "test" {
  rolling_update_policy {
    max_batch_size = 1
    min_instances_in_service = 2
    pause_time = "PT0S"
    suspend_processes = ["launch", "terminate"]
    wait_on_resource_signals = true
  }
}

then if there's such policy defined, TF can use autoscaling API and shut down each EC2 instance separately and let ASG spin up a new one with an updated LC.

@pmoust

This comment has been minimized.

Show comment
Hide comment
@pmoust

pmoust Apr 16, 2015

Contributor

👍

Our use case pretty much depends on this to be crystal clean. Other wise rolling updates require some minimal external intervention which I am working on making obsolete.

Huge +1 on this

Contributor

pmoust commented Apr 16, 2015

👍

Our use case pretty much depends on this to be crystal clean. Other wise rolling updates require some minimal external intervention which I am working on making obsolete.

Huge +1 on this

@matthewford

This comment has been minimized.

Show comment
Hide comment

👍

@jessem

This comment has been minimized.

Show comment
Hide comment
@jessem

jessem Apr 16, 2015

👍 Likewise. Using a python script to handle this, which makes it a bit clunky to keep things simple with terraform.

jessem commented Apr 16, 2015

👍 Likewise. Using a python script to handle this, which makes it a bit clunky to keep things simple with terraform.

@chrisferry

This comment has been minimized.

Show comment
Hide comment

👍

@phinze

This comment has been minimized.

Show comment
Hide comment
@phinze

phinze Apr 16, 2015

Member

@radeksimko totally agreed that this is desirable behavior.

It's worth noting, though, that this is CloudFormation-specific behavior that's not exposed up at the AutoScaling API. I think the way we'd be able to achieve something similar would be to build some resources based on CodeDeploy:

https://docs.aws.amazon.com/codedeploy/latest/APIReference/Welcome.html

Member

phinze commented Apr 16, 2015

@radeksimko totally agreed that this is desirable behavior.

It's worth noting, though, that this is CloudFormation-specific behavior that's not exposed up at the AutoScaling API. I think the way we'd be able to achieve something similar would be to build some resources based on CodeDeploy:

https://docs.aws.amazon.com/codedeploy/latest/APIReference/Welcome.html

@johnrengelman

This comment has been minimized.

Show comment
Hide comment
@johnrengelman

johnrengelman Apr 18, 2015

Contributor

My experience with code deploy is that it's limited to installing software on running instances...so it can roll out updates to ASG instances but it doesn't know how to do a roll-out by 'terminating' instances like CF does.
This would be an awesome feature, but I don't really think it's something that fits into Terraform's current model very well. Perhaps if there was a separate lifecycle type hook that actions could be plugged into.

So something like:

resource 'aws_launch_configuration' 'main' {
}

resource 'aws_autoscaling_group' 'main' {
  lifecycle {
    on_update {
      properties ['launch_configuration']
      actions {
        action 'aws_autoscaling_group_terminate_instances' {
          batch_size = 1
        }
      }
    }
  }
}

These actions then could be a whole separate concept in Terraform.

Contributor

johnrengelman commented Apr 18, 2015

My experience with code deploy is that it's limited to installing software on running instances...so it can roll out updates to ASG instances but it doesn't know how to do a roll-out by 'terminating' instances like CF does.
This would be an awesome feature, but I don't really think it's something that fits into Terraform's current model very well. Perhaps if there was a separate lifecycle type hook that actions could be plugged into.

So something like:

resource 'aws_launch_configuration' 'main' {
}

resource 'aws_autoscaling_group' 'main' {
  lifecycle {
    on_update {
      properties ['launch_configuration']
      actions {
        action 'aws_autoscaling_group_terminate_instances' {
          batch_size = 1
        }
      }
    }
  }
}

These actions then could be a whole separate concept in Terraform.

@woodhull

This comment has been minimized.

Show comment
Hide comment
@woodhull

woodhull Apr 24, 2015

👍 just realized this. did not realize that this was implemented by AWS as a cloudformation primitive rather than an ASG primitive we could hook into.

👍 just realized this. did not realize that this was implemented by AWS as a cloudformation primitive rather than an ASG primitive we could hook into.

@woodhull

This comment has been minimized.

Show comment
Hide comment
@woodhull

woodhull Apr 24, 2015

Is anyone experimenting with using other tools to hack around this terraform limitation? Even if we were to combine terraform with external tooling like http://docs.ansible.com/ec2_asg_module.html I'm not sure where we'd hook in. A local-exec provisioner seems like the right thing -- but those are only triggered on creation, not modification. Maybe as a placeholder before implementing a native rolling update solution terraform could offer some sort of hook for launch configuration changes that we could use to trigger an external process?

Otherwise, I think we'll need to manage AMI version updates externally via ansible or some homegrown tool and then use terraform refresh to pull them in before doing plan or apply runs. It's all starting to drift away from the single-command infrastructure creation and mutation dream we had while starting to use terraform.

As part of our migration to terraform/packer away from a set of legacy tools we had been planning a deployment model based on updating pre-baked AMIs created with packer on a rolling basis. Any other ideas for workarounds that we could use until terraform is able to do this sort of thing out of the box?

Is anyone experimenting with using other tools to hack around this terraform limitation? Even if we were to combine terraform with external tooling like http://docs.ansible.com/ec2_asg_module.html I'm not sure where we'd hook in. A local-exec provisioner seems like the right thing -- but those are only triggered on creation, not modification. Maybe as a placeholder before implementing a native rolling update solution terraform could offer some sort of hook for launch configuration changes that we could use to trigger an external process?

Otherwise, I think we'll need to manage AMI version updates externally via ansible or some homegrown tool and then use terraform refresh to pull them in before doing plan or apply runs. It's all starting to drift away from the single-command infrastructure creation and mutation dream we had while starting to use terraform.

As part of our migration to terraform/packer away from a set of legacy tools we had been planning a deployment model based on updating pre-baked AMIs created with packer on a rolling basis. Any other ideas for workarounds that we could use until terraform is able to do this sort of thing out of the box?

@johnrengelman

This comment has been minimized.

Show comment
Hide comment
@johnrengelman

johnrengelman Apr 24, 2015

Contributor

I've been using bash around the AWS cli. It would be awesome to implement tasks in Go against the awslabs library and then just call them from terraform though.

Contributor

johnrengelman commented Apr 24, 2015

I've been using bash around the AWS cli. It would be awesome to implement tasks in Go against the awslabs library and then just call them from terraform though.

@woodhull

This comment has been minimized.

Show comment
Hide comment
@woodhull

woodhull Apr 27, 2015

Over the weekend I wrote some code to handle blue/green deploys for our particular use case in ruby, adding to our ever growing wrapper of custom code needed to actually make terraform useful.

Rather than hooking into terraform events it's ended up as a huge hack: it dynamically modifies the terraform plan to do the create of the new ASG with new AMI alongside the existing ASG. Then it uses ruby to make sure the new ASG is up and working correctly before removing the old ASG, regenerates the terraform plan file to match the new state, and finally calls terraform refresh so that the tfstate matches the new reality we've created.

Would be great if this sort of workflow for mutating a running app when we've updated an AMI were built in or if there were at least easy ways to hook into terraform to add custom behavior like this beyond learning Go and Terraform internals. In our case, even if terraform could handle the ASG operations for us, we'd still like to be able to run our quick, custom sanity check script to make sure everything is working properly on the new instances before removing the old ASG from the pool.

Over the weekend I wrote some code to handle blue/green deploys for our particular use case in ruby, adding to our ever growing wrapper of custom code needed to actually make terraform useful.

Rather than hooking into terraform events it's ended up as a huge hack: it dynamically modifies the terraform plan to do the create of the new ASG with new AMI alongside the existing ASG. Then it uses ruby to make sure the new ASG is up and working correctly before removing the old ASG, regenerates the terraform plan file to match the new state, and finally calls terraform refresh so that the tfstate matches the new reality we've created.

Would be great if this sort of workflow for mutating a running app when we've updated an AMI were built in or if there were at least easy ways to hook into terraform to add custom behavior like this beyond learning Go and Terraform internals. In our case, even if terraform could handle the ASG operations for us, we'd still like to be able to run our quick, custom sanity check script to make sure everything is working properly on the new instances before removing the old ASG from the pool.

@woodhull

This comment has been minimized.

Show comment
Hide comment
@woodhull

woodhull Apr 27, 2015

This feature might be the most straightforward (although awkwardly round-about) way to get rolling deploys working inside terraform: #1083

This feature might be the most straightforward (although awkwardly round-about) way to get rolling deploys working inside terraform: #1083

@ketzacoatl

This comment has been minimized.

Show comment
Hide comment
@ketzacoatl

ketzacoatl May 8, 2015

Contributor

+1.. this would make a huge difference in some of my current workflows

Contributor

ketzacoatl commented May 8, 2015

+1.. this would make a huge difference in some of my current workflows

@d4v3y0rk

This comment has been minimized.

Show comment
Hide comment
@d4v3y0rk

d4v3y0rk May 29, 2015

Contributor

+1

Contributor

d4v3y0rk commented May 29, 2015

+1

@ajmath

This comment has been minimized.

Show comment
Hide comment
@ajmath

ajmath Jun 22, 2015

Contributor

+1

Contributor

ajmath commented Jun 22, 2015

+1

@blewa

This comment has been minimized.

Show comment
Hide comment

blewa commented Jul 24, 2015

+1

@nathanielks

This comment has been minimized.

Show comment
Hide comment
@nathanielks

nathanielks Jul 30, 2015

Contributor

@woodhull that wrapper wouldn't be around somewhere we could take a looksee, would it? 😄

Contributor

nathanielks commented Jul 30, 2015

@woodhull that wrapper wouldn't be around somewhere we could take a looksee, would it? 😄

@woodhull

This comment has been minimized.

Show comment
Hide comment
@woodhull

woodhull Jul 31, 2015

@nathanielks I copy/pasted some fragments of the code here: https://gist.github.com/woodhull/c56cbd0a68cb9b3fd1f4

It's wasn't designed for sharing or reusability, sorry about that! I hope it's enough of a code fragment to help.

@nathanielks I copy/pasted some fragments of the code here: https://gist.github.com/woodhull/c56cbd0a68cb9b3fd1f4

It's wasn't designed for sharing or reusability, sorry about that! I hope it's enough of a code fragment to help.

@nathanielks

This comment has been minimized.

Show comment
Hide comment
@nathanielks

nathanielks Jul 31, 2015

Contributor

@woodhull you're a gentlemen, thanks for sharing!

Contributor

nathanielks commented Jul 31, 2015

@woodhull you're a gentlemen, thanks for sharing!

@BrunoBonacci

This comment has been minimized.

Show comment
Hide comment

+1

@saliceti

This comment has been minimized.

Show comment
Hide comment
@saliceti

saliceti Aug 24, 2015

Contributor

+100

Contributor

saliceti commented Aug 24, 2015

+100

@ryan0x44

This comment has been minimized.

Show comment
Hide comment
@ryan0x44

ryan0x44 Aug 25, 2015

If anyone is trying to achieve Blue-Green deployments with AWS Auto Scaling and Terraform, I've made a note about how I did it here. Not sure if it's an ideal solution, but it's working for me and proving to be very reliable at the moment :)

If anyone is trying to achieve Blue-Green deployments with AWS Auto Scaling and Terraform, I've made a note about how I did it here. Not sure if it's an ideal solution, but it's working for me and proving to be very reliable at the moment :)

@nathanielks

This comment has been minimized.

Show comment
Hide comment
@nathanielks

nathanielks Aug 25, 2015

Contributor

Thanks @ryandjurovich!

Contributor

nathanielks commented Aug 25, 2015

Thanks @ryandjurovich!

@ernoaapa

This comment has been minimized.

Show comment
Hide comment

👍

@farridav

This comment has been minimized.

Show comment
Hide comment
@farridav

farridav Oct 28, 2015

+1 just about to look at how to do this, would love to avoid writing a script to down and up instances to use the new Launch Configuration will take a look at @ryandjurovich's script also :)

+1 just about to look at how to do this, would love to avoid writing a script to down and up instances to use the new Launch Configuration will take a look at @ryandjurovich's script also :)

@robmorgan

This comment has been minimized.

Show comment
Hide comment

👍

@titotp

This comment has been minimized.

Show comment
Hide comment

titotp commented Nov 19, 2015

+1

@farridav

This comment has been minimized.

Show comment
Hide comment
@farridav

farridav Nov 19, 2015

I'm sure there is a genuine need for rolling updates to an ASG, so wont detract from this too much, but anyone reading this issue that is OK with blue/green deployment as opposed to rolling updates, I've done so successfully and cleanly by using Terraform built in functionality, as suggested in this really useful post apparently its how HashiCorp do their production rollouts :)

@ryan0x44 I'm afraid your gist link is dead

I'm sure there is a genuine need for rolling updates to an ASG, so wont detract from this too much, but anyone reading this issue that is OK with blue/green deployment as opposed to rolling updates, I've done so successfully and cleanly by using Terraform built in functionality, as suggested in this really useful post apparently its how HashiCorp do their production rollouts :)

@ryan0x44 I'm afraid your gist link is dead

@BrunoBonacci

This comment has been minimized.

Show comment
Hide comment
@BrunoBonacci

BrunoBonacci Nov 19, 2015

Hi @farridav,

Blue/Green deployments are a better solution for stateless services (such as: web servers, web services), but for stateful services (such as: Databases, SearchIndexes, Caches, Queues) it is often not possible or not advisable to do so.
Immagine a 10 nodes db cluster with 10TB of data, spinning up a complete new cluster will cause the full resync of the 10TB of data from the old cluster to the new cluster all at once, this might saturate the network link and cause denial of service, where maybe all you wanted was to increase the number of connections.
And the size of the data is not the only problem. Clusters which support ring or grid topologies are have less problems to grow/shrink elastically, but master/slave topologies are more complicated as you can grow only slaves and you have to have a procedure to step down the master and elect a slave.
The possibility to do a controlled rolling update, in these type of situations is far simpler and has less impact (eg: rolling only one data node at the time, wait for data resync, then move to the next).

Hi @farridav,

Blue/Green deployments are a better solution for stateless services (such as: web servers, web services), but for stateful services (such as: Databases, SearchIndexes, Caches, Queues) it is often not possible or not advisable to do so.
Immagine a 10 nodes db cluster with 10TB of data, spinning up a complete new cluster will cause the full resync of the 10TB of data from the old cluster to the new cluster all at once, this might saturate the network link and cause denial of service, where maybe all you wanted was to increase the number of connections.
And the size of the data is not the only problem. Clusters which support ring or grid topologies are have less problems to grow/shrink elastically, but master/slave topologies are more complicated as you can grow only slaves and you have to have a procedure to step down the master and elect a slave.
The possibility to do a controlled rolling update, in these type of situations is far simpler and has less impact (eg: rolling only one data node at the time, wait for data resync, then move to the next).

@titotp

This comment has been minimized.

Show comment
Hide comment
@titotp

titotp Nov 19, 2015

@farridav as per the post https://groups.google.com/forum/#!msg/terraform-tool/7Gdhv1OAc80/iNQ93riiLwAJ

Everything seems to be working as expected
1 - Updating the ami creates new launch config
2 - Updates the autoscaling to new launch config name
However new machines are not launching automatically using the new LC/ASG created . What could be the issue here ?

titotp commented Nov 19, 2015

@farridav as per the post https://groups.google.com/forum/#!msg/terraform-tool/7Gdhv1OAc80/iNQ93riiLwAJ

Everything seems to be working as expected
1 - Updating the ami creates new launch config
2 - Updates the autoscaling to new launch config name
However new machines are not launching automatically using the new LC/ASG created . What could be the issue here ?

@jedineeper

This comment has been minimized.

Show comment
Hide comment
@jedineeper

jedineeper Nov 19, 2015

Contributor

@titotp check the activity history tab on the ASG, should tell you why its unable to launch the instances.

Contributor

jedineeper commented Nov 19, 2015

@titotp check the activity history tab on the ASG, should tell you why its unable to launch the instances.

@titotp

This comment has been minimized.

Show comment
Hide comment
@titotp

titotp Nov 19, 2015

I am not seeing any activity on the autoscaling group

titotp commented Nov 19, 2015

I am not seeing any activity on the autoscaling group

@ryan0x44

This comment has been minimized.

Show comment
Hide comment
@ryan0x44

ryan0x44 Nov 19, 2015

@farridav - the link should be working now, sorry!

@farridav - the link should be working now, sorry!

@ckelner

This comment has been minimized.

Show comment
Hide comment
@ckelner

ckelner Dec 4, 2015

This looks like a wonderful feature! 👍

ckelner commented Dec 4, 2015

This looks like a wonderful feature! 👍

@igoratencompass

This comment has been minimized.

Show comment
Hide comment
@rbowlby

This comment has been minimized.

Show comment
Hide comment
@rbowlby

rbowlby Jan 23, 2016

+1

I'd enjoy the simplicity of immutable infra. Moving away from deploying new artifacts to existing instances, and instead just phasing in new instances/containers. Without a terraform construct to do so it is a hard sell.

Additionally, having to hack together a solution that requires external scripting and/or more than one apply limits the ability to make use of Atlas. I would enjoy throwing out ansible deploy playbooks, aws code-deploy, fab, and all the complexity they introduce.

rbowlby commented Jan 23, 2016

+1

I'd enjoy the simplicity of immutable infra. Moving away from deploying new artifacts to existing instances, and instead just phasing in new instances/containers. Without a terraform construct to do so it is a hard sell.

Additionally, having to hack together a solution that requires external scripting and/or more than one apply limits the ability to make use of Atlas. I would enjoy throwing out ansible deploy playbooks, aws code-deploy, fab, and all the complexity they introduce.

@pmoust

This comment has been minimized.

Show comment
Hide comment
@pmoust

pmoust Mar 2, 2016

Contributor

@brikis98 Yeap, that's the way to go at the moment.

Contributor

pmoust commented Mar 2, 2016

@brikis98 Yeap, that's the way to go at the moment.

@maruina

This comment has been minimized.

Show comment
Hide comment
@maruina

maruina Mar 3, 2016

@brikis98 how do you associate an aws_autoscaling_policy to an ASG generated by Cloudformation?

The autoscaling_group_name is dynamically generated by Cloudformation itself and not exported by aws_cloudformation_stack; is something that is done manually?

maruina commented Mar 3, 2016

@brikis98 how do you associate an aws_autoscaling_policy to an ASG generated by Cloudformation?

The autoscaling_group_name is dynamically generated by Cloudformation itself and not exported by aws_cloudformation_stack; is something that is done manually?

@brikis98

This comment has been minimized.

Show comment
Hide comment
@brikis98

brikis98 Mar 3, 2016

Contributor

@maruina: You can add an output to the CloudFormation template and use it without any manual steps:

resource "aws_cloudformation_stack" "autoscaling_group" {
  name = "my-asg"
  template_body = <<EOF
{
  "Resources": {
    "MyAsg": {
      "Type": "AWS::AutoScaling::AutoScalingGroup",
      "Properties": {
        "AvailabilityZones": ["us-east-1a", "us-east-1b", "us-east-1d"],
        "LaunchConfigurationName": "${aws_launch_configuration.launch_configuration.name}",
        "MaxSize": "4",
        "MinSize": "2",
        "LoadBalancerNames": ["${aws_elb.elb.name}"],
        "TerminationPolicies": ["OldestLaunchConfiguration", "OldestInstance"],
        "HealthCheckType": "ELB"
      },
      "UpdatePolicy": {
        "AutoScalingRollingUpdate": {
          "MinInstancesInService": "2",
          "MaxBatchSize": "2",
          "PauseTime": "PT0S"
        }
      }
    }
  },
  "Outputs": {
    "AsgName": {
      "Description": "The name of the auto scaling group",
       "Value": {"Ref": "MyAsg"}
    }
  }
}
EOF
}

resource "aws_autoscaling_policy" "my_policy" {
  name = "my-policy"
  scaling_adjustment = 4
  adjustment_type = "ChangeInCapacity"
  cooldown = 300
  autoscaling_group_name = "${aws_cloudformation_stack.autoscaling_group.outputs.AsgName}"
}

Note the use of aws_cloudformation_stack.autoscaling_group.outputs.AsgName in the aws_autoscaling_policy.

Contributor

brikis98 commented Mar 3, 2016

@maruina: You can add an output to the CloudFormation template and use it without any manual steps:

resource "aws_cloudformation_stack" "autoscaling_group" {
  name = "my-asg"
  template_body = <<EOF
{
  "Resources": {
    "MyAsg": {
      "Type": "AWS::AutoScaling::AutoScalingGroup",
      "Properties": {
        "AvailabilityZones": ["us-east-1a", "us-east-1b", "us-east-1d"],
        "LaunchConfigurationName": "${aws_launch_configuration.launch_configuration.name}",
        "MaxSize": "4",
        "MinSize": "2",
        "LoadBalancerNames": ["${aws_elb.elb.name}"],
        "TerminationPolicies": ["OldestLaunchConfiguration", "OldestInstance"],
        "HealthCheckType": "ELB"
      },
      "UpdatePolicy": {
        "AutoScalingRollingUpdate": {
          "MinInstancesInService": "2",
          "MaxBatchSize": "2",
          "PauseTime": "PT0S"
        }
      }
    }
  },
  "Outputs": {
    "AsgName": {
      "Description": "The name of the auto scaling group",
       "Value": {"Ref": "MyAsg"}
    }
  }
}
EOF
}

resource "aws_autoscaling_policy" "my_policy" {
  name = "my-policy"
  scaling_adjustment = 4
  adjustment_type = "ChangeInCapacity"
  cooldown = 300
  autoscaling_group_name = "${aws_cloudformation_stack.autoscaling_group.outputs.AsgName}"
}

Note the use of aws_cloudformation_stack.autoscaling_group.outputs.AsgName in the aws_autoscaling_policy.

@maruina

This comment has been minimized.

Show comment
Hide comment
@maruina

maruina Mar 3, 2016

Sweet, thank you! 👍

maruina commented Mar 3, 2016

Sweet, thank you! 👍

@rvangundy

This comment has been minimized.

Show comment
Hide comment
@rvangundy

rvangundy May 10, 2016

@brikis98 Did you manage to get around the issue related to #1109? I'm getting Cannot delete launch configuration <my-lc-name> because it is attached to <my-asg>. The LC update still manages to apply, it just can't delete the previous one so I encounter an error each time...

@brikis98 Did you manage to get around the issue related to #1109? I'm getting Cannot delete launch configuration <my-lc-name> because it is attached to <my-asg>. The LC update still manages to apply, it just can't delete the previous one so I encounter an error each time...

@brikis98

This comment has been minimized.

Show comment
Hide comment
@brikis98

brikis98 May 10, 2016

Contributor

@rvangundy Do you have create_before_destroy = true on your LC?

Contributor

brikis98 commented May 10, 2016

@rvangundy Do you have create_before_destroy = true on your LC?

@rvangundy

This comment has been minimized.

Show comment
Hide comment
@rvangundy

rvangundy May 10, 2016

@brikis98 Yeah I sure do. What's weird is I'm seeing that as the CloudFormation stack is being modified, it's actively trying to destroy the previous LC as well,

module.my_service.aws_launch_configuration.lc: Creation complete
module.my_service.aws_launch_configuration.lc: Destroying...
module.my_service.aws_launch_configuration.lc: Destroying...
module.my_service.aws_launch_configuration.lc: Still destroying... (10s elapsed)
module.my_service.aws_launch_configuration.lc: Still destroying... (10s elapsed)
module.my_service.rolling_autoscaling_group.aws_cloudformation_stack.autoscaling_group: Still modifying... (10s elapsed)
module.my_service.aws_launch_configuration.lc: Still destroying... (18s elapsed)
module.my_service.aws_launch_configuration.lc: Still destroying... (20s elapsed)
module.my_service.aws_launch_configuration.lc: Still destroying... (20s elapsed)
module.my_service.rolling_autoscaling_group.aws_cloudformation_stack.autoscaling_group: Still modifying... (20s elapsed)
module.my_service.aws_launch_configuration.lc: Still destroying... (28s elapsed)
module.my_service.aws_launch_configuration.lc: Still destroying... (30s elapsed)

Note also that I've encapsulated my CloudFormation+ASG configuration in a module, so I may try to pull that out next...

rvangundy commented May 10, 2016

@brikis98 Yeah I sure do. What's weird is I'm seeing that as the CloudFormation stack is being modified, it's actively trying to destroy the previous LC as well,

module.my_service.aws_launch_configuration.lc: Creation complete
module.my_service.aws_launch_configuration.lc: Destroying...
module.my_service.aws_launch_configuration.lc: Destroying...
module.my_service.aws_launch_configuration.lc: Still destroying... (10s elapsed)
module.my_service.aws_launch_configuration.lc: Still destroying... (10s elapsed)
module.my_service.rolling_autoscaling_group.aws_cloudformation_stack.autoscaling_group: Still modifying... (10s elapsed)
module.my_service.aws_launch_configuration.lc: Still destroying... (18s elapsed)
module.my_service.aws_launch_configuration.lc: Still destroying... (20s elapsed)
module.my_service.aws_launch_configuration.lc: Still destroying... (20s elapsed)
module.my_service.rolling_autoscaling_group.aws_cloudformation_stack.autoscaling_group: Still modifying... (20s elapsed)
module.my_service.aws_launch_configuration.lc: Still destroying... (28s elapsed)
module.my_service.aws_launch_configuration.lc: Still destroying... (30s elapsed)

Note also that I've encapsulated my CloudFormation+ASG configuration in a module, so I may try to pull that out next...

@brikis98

This comment has been minimized.

Show comment
Hide comment
@brikis98

brikis98 May 10, 2016

Contributor

@rvangundy Hmm, not sure. If you post your code, I can see if there are any obvious differences, but for the most part, the ASG setup with CloudFormation has worked reasonably well for me.

Contributor

brikis98 commented May 10, 2016

@rvangundy Hmm, not sure. If you post your code, I can see if there are any obvious differences, but for the most part, the ASG setup with CloudFormation has worked reasonably well for me.

@rvangundy

This comment has been minimized.

Show comment
Hide comment
@rvangundy

rvangundy May 10, 2016

@brikis98 Fixed it! It was because I had externalized my CloudFormation template in to a template_file resource which decouples the dependency graph between the LC and CloudFormation stack. I inlined it using <<EOF...EOF and that did the trick.

rvangundy commented May 10, 2016

@brikis98 Fixed it! It was because I had externalized my CloudFormation template in to a template_file resource which decouples the dependency graph between the LC and CloudFormation stack. I inlined it using <<EOF...EOF and that did the trick.

@brikis98

This comment has been minimized.

Show comment
Hide comment
@brikis98

brikis98 May 10, 2016

Contributor

@rvangundy Ah, I have mine inlined with a heredoc, which is probably why I didn't hit that issue. Good to know, thanks!

Contributor

brikis98 commented May 10, 2016

@rvangundy Ah, I have mine inlined with a heredoc, which is probably why I didn't hit that issue. Good to know, thanks!

@shuoy

This comment has been minimized.

Show comment
Hide comment
@shuoy

shuoy May 17, 2016

Regarding "Immagine a 10 nodes db cluster with 10TB of data, spinning up a complete new cluster will cause the full resync of the 10TB of data from the old cluster to the new cluster all at once, this might saturate the network link and cause denial of service, where maybe all you wanted was to increase the number of connections."

@BrunoBonacci , we are facing the same situation about rolling update. Imagine we want to bump up one version of the software running on a data node, we need the kind of rolling update "in-place". It looks like the rolling update with TF is not going to get you there. Maybe we should consider something like Ansible to deal with that?

shuoy commented May 17, 2016

Regarding "Immagine a 10 nodes db cluster with 10TB of data, spinning up a complete new cluster will cause the full resync of the 10TB of data from the old cluster to the new cluster all at once, this might saturate the network link and cause denial of service, where maybe all you wanted was to increase the number of connections."

@BrunoBonacci , we are facing the same situation about rolling update. Imagine we want to bump up one version of the software running on a data node, we need the kind of rolling update "in-place". It looks like the rolling update with TF is not going to get you there. Maybe we should consider something like Ansible to deal with that?

@BrunoBonacci

This comment has been minimized.

Show comment
Hide comment
@BrunoBonacci

BrunoBonacci May 18, 2016

@shuoy Certainly Ansible is a way to "script" this behaviour but the idea of using terraform (at least for me) is to replace previous scripting tools such as Chef, puppet, Ansible as so on.

I've seen different approaches to rolling update around. Kubernetes allow you the set a grace time to wait between the update of a machine to the next. This certainly could solve some of the issues, however it would work only for quite short grace time.
If you have 10TB to replicate in a cloud environment it could take a while, specially if you throttle the speed to avoid network saturation. Having a grace time of 10-20 hours wouldn't be reasonable.

I think the right approach it would be to provide an attribute which can be set to choose if the rolling update has to be performed automatically (based on grace period) or in stages.

something like:

## just a suggestion
rolling_update = "auto"
grace_period = "3m"

Which it will wait 3 minutes between a successful update to the next.

While the stage rolling update could work as follow:

## just a suggestion
rolling_update = "staged"
instances_updated = 1

In this case terraform would look at the ASG for example of 10 machines, and just update one.
Then the operator would wait for the new instance to be fully in service before
bumping the instances_updated = 2 and repeat terraform apply.
In such way the rolling update could be performed in stages giving enough time
to the app to replicate necessary data and become fully operative.
Once all 10 instances have been updated then update can be considered successful.

Again, this is just a suggestion to allow a declarative (non scripted) approach to rolling update,
which would work for stateful clusters (such as databases).

@shuoy Certainly Ansible is a way to "script" this behaviour but the idea of using terraform (at least for me) is to replace previous scripting tools such as Chef, puppet, Ansible as so on.

I've seen different approaches to rolling update around. Kubernetes allow you the set a grace time to wait between the update of a machine to the next. This certainly could solve some of the issues, however it would work only for quite short grace time.
If you have 10TB to replicate in a cloud environment it could take a while, specially if you throttle the speed to avoid network saturation. Having a grace time of 10-20 hours wouldn't be reasonable.

I think the right approach it would be to provide an attribute which can be set to choose if the rolling update has to be performed automatically (based on grace period) or in stages.

something like:

## just a suggestion
rolling_update = "auto"
grace_period = "3m"

Which it will wait 3 minutes between a successful update to the next.

While the stage rolling update could work as follow:

## just a suggestion
rolling_update = "staged"
instances_updated = 1

In this case terraform would look at the ASG for example of 10 machines, and just update one.
Then the operator would wait for the new instance to be fully in service before
bumping the instances_updated = 2 and repeat terraform apply.
In such way the rolling update could be performed in stages giving enough time
to the app to replicate necessary data and become fully operative.
Once all 10 instances have been updated then update can be considered successful.

Again, this is just a suggestion to allow a declarative (non scripted) approach to rolling update,
which would work for stateful clusters (such as databases).

@shuoy

This comment has been minimized.

Show comment
Hide comment
@shuoy

shuoy May 18, 2016

@BrunoBonacci "If you have 10TB to replicate in a cloud environment it could take a while, specially if you throttle the speed to avoid network saturation. Having a grace time of 10-20 hours wouldn't be reasonable."
-- My point is this. For a node that following into this category, Chef/Ansible is exactly the tool you will be using to have an "in-place update way"; the destroying the old node and bringing up a new node (immutable way) is exactly what you want to avoid.
-- In the in-place-update way (chef/ansible), you basically apt-get update your mysqld (which take 1-2 minutes); yet the "destroy old and bring a new one way" (immutable way) takes hours unless you are using EBS volume. But I think you meant to say non-EBS way, or maybeI missed your point above.

Do we have scenario that large volume of data reside on ephemeral disk? Yes, e.g., for people use EC2 as their Hadoop cluster capacity, data is saved on ephemeral disk from cost saving perspective (EBS gives you extra cost).

So, in short, I think Terraform is great at certain scenarios, but it's not accurate that Chef/Ansible can be totally replaced. Particularly in the update use case for stateful nodes.

shuoy commented May 18, 2016

@BrunoBonacci "If you have 10TB to replicate in a cloud environment it could take a while, specially if you throttle the speed to avoid network saturation. Having a grace time of 10-20 hours wouldn't be reasonable."
-- My point is this. For a node that following into this category, Chef/Ansible is exactly the tool you will be using to have an "in-place update way"; the destroying the old node and bringing up a new node (immutable way) is exactly what you want to avoid.
-- In the in-place-update way (chef/ansible), you basically apt-get update your mysqld (which take 1-2 minutes); yet the "destroy old and bring a new one way" (immutable way) takes hours unless you are using EBS volume. But I think you meant to say non-EBS way, or maybeI missed your point above.

Do we have scenario that large volume of data reside on ephemeral disk? Yes, e.g., for people use EC2 as their Hadoop cluster capacity, data is saved on ephemeral disk from cost saving perspective (EBS gives you extra cost).

So, in short, I think Terraform is great at certain scenarios, but it's not accurate that Chef/Ansible can be totally replaced. Particularly in the update use case for stateful nodes.

@jdoss

This comment has been minimized.

Show comment
Hide comment
@jdoss

jdoss Jul 8, 2016

@brikis98 and others that are using the output example posted here, it seems to not work if you upgrade to 0.7.0-rc2. Here is the Error that is kicking out:

* Resource 'aws_cloudformation_stack.web_autoscaling_group' does not have attribute 'outputs.AsgName' for variable 'aws_cloudformation_stack.web_autoscaling_group.outputs.AsgName'

I am still trying to get outputs working but if anyone has any advice on how to get this working again with 0.7.x that would be awesome.

jdoss commented Jul 8, 2016

@brikis98 and others that are using the output example posted here, it seems to not work if you upgrade to 0.7.0-rc2. Here is the Error that is kicking out:

* Resource 'aws_cloudformation_stack.web_autoscaling_group' does not have attribute 'outputs.AsgName' for variable 'aws_cloudformation_stack.web_autoscaling_group.outputs.AsgName'

I am still trying to get outputs working but if anyone has any advice on how to get this working again with 0.7.x that would be awesome.

@brikis98

This comment has been minimized.

Show comment
Hide comment
@brikis98

brikis98 Jul 8, 2016

Contributor

@jdoss You may want to file a separate bug agains the 0.7 release for that. The changelog mentions that for terraform_remote_state they no longer use the .output map. Perhaps they changed something similar on aws_cloudformation_stack but didn't document it?

Contributor

brikis98 commented Jul 8, 2016

@jdoss You may want to file a separate bug agains the 0.7 release for that. The changelog mentions that for terraform_remote_state they no longer use the .output map. Perhaps they changed something similar on aws_cloudformation_stack but didn't document it?

@phinze

This comment has been minimized.

Show comment
Hide comment
@phinze

phinze Jul 8, 2016

Member

@jdoss on first glance, it looks like that reference is using the pre-0.7 map dot-index notation. in 0.7, maps are indexed like aws_instance.tag["sometag"]

So try changing this line to use square bracket indexing and see if that fixes it for you!

  autoscaling_group_name = "${aws_cloudformation_stack.autoscaling_group.outputs["AsgName"]}"

If that doesn't work I'd welcome a fresh bug report and we can dig in. 👍

Member

phinze commented Jul 8, 2016

@jdoss on first glance, it looks like that reference is using the pre-0.7 map dot-index notation. in 0.7, maps are indexed like aws_instance.tag["sometag"]

So try changing this line to use square bracket indexing and see if that fixes it for you!

  autoscaling_group_name = "${aws_cloudformation_stack.autoscaling_group.outputs["AsgName"]}"

If that doesn't work I'd welcome a fresh bug report and we can dig in. 👍

@jdoss

This comment has been minimized.

Show comment
Hide comment
@jdoss

jdoss Jul 8, 2016

@phinze You da best! That was the issue. 😄

jdoss commented Jul 8, 2016

@phinze You da best! That was the issue. 😄

@moofish32

This comment has been minimized.

Show comment
Hide comment
@moofish32

moofish32 Jul 25, 2016

Contributor

@rvangundy -- did you keep the lifecycle hook create_before_destroy=true or remove it? I think I am thrashing between this error and this #3916

Contributor

moofish32 commented Jul 25, 2016

@rvangundy -- did you keep the lifecycle hook create_before_destroy=true or remove it? I think I am thrashing between this error and this #3916

@rvangundy

This comment has been minimized.

Show comment
Hide comment
@rvangundy

rvangundy Jul 25, 2016

@moofish32 I have create_before_destroy=true on my aws_launch_configuration

@moofish32 I have create_before_destroy=true on my aws_launch_configuration

@antonbabenko

This comment has been minimized.

Show comment
Hide comment
@antonbabenko

antonbabenko Aug 4, 2016

Contributor

@brikis98 - Thank you for the piece of CloudFormation code which does rolling deployments. It does not work for me, because I use spot instances in my launch configuration and when I specify MinInstancesInService greater than zero (because I really want to not terminate all instances before rolling) I have this error from AWS:

Autoscaling rolling updates cannot be performed because the current launch configuration is using spot instances and MinInstancesInService is greater than zero.

I wonder if anyone has any ideas how to implement ASG rolling update for spot instances using less moving parts?

My initial idea was to make a shell script which would work similarly to aws-ha-release.

I prefer to use just Terraform and CloudFormation and avoid implementation of orchestration magic.

UPD: Having MinInstancesInService=0 and MaxBatchSize=1 (which is smaller than amount of instances in ASG) helped. I am happy.

Contributor

antonbabenko commented Aug 4, 2016

@brikis98 - Thank you for the piece of CloudFormation code which does rolling deployments. It does not work for me, because I use spot instances in my launch configuration and when I specify MinInstancesInService greater than zero (because I really want to not terminate all instances before rolling) I have this error from AWS:

Autoscaling rolling updates cannot be performed because the current launch configuration is using spot instances and MinInstancesInService is greater than zero.

I wonder if anyone has any ideas how to implement ASG rolling update for spot instances using less moving parts?

My initial idea was to make a shell script which would work similarly to aws-ha-release.

I prefer to use just Terraform and CloudFormation and avoid implementation of orchestration magic.

UPD: Having MinInstancesInService=0 and MaxBatchSize=1 (which is smaller than amount of instances in ASG) helped. I am happy.

@radeksimko radeksimko self-assigned this Aug 4, 2016

@iadknet

This comment has been minimized.

Show comment
Hide comment

iadknet commented Oct 5, 2016

+1

@shaharmor

This comment has been minimized.

Show comment
Hide comment
@shaharmor

shaharmor Nov 14, 2016

Is there any plan to implement this in the near/far future?

Is there any plan to implement this in the near/far future?

@cultavix

This comment has been minimized.

Show comment
Hide comment

+1

@w32-blaster

This comment has been minimized.

Show comment
Hide comment
@w32-blaster

w32-blaster Dec 9, 2016

Contributor

+1

Contributor

w32-blaster commented Dec 9, 2016

+1

@kenoir

This comment has been minimized.

Show comment
Hide comment

kenoir commented Jan 25, 2017

+1

meatballhat added a commit to travis-ci/terraform-config that referenced this issue Feb 2, 2017

Switch to cloud formation ASGs
for automatic rollout of new launch configurations

See: hashicorp/terraform#1552 (comment)

@meatballhat meatballhat referenced this issue in travis-ci/terraform-config Feb 2, 2017

Merged

Switch to cloud formation ASGs #98

@sthulb

This comment has been minimized.

Show comment
Hide comment
@sthulb

sthulb Feb 21, 2017

Contributor

@phinze How about implementing this as a data source? it doesn't seem like the worst way of implementing it using current code structure.

I immediately regret this suggestion.

I don't mind implementing this feature, but I'm kind of at a loss on how to implement it in HCL. As the comment above would suggest, I thought about using the data source format, but linking it back to the ASG resource (Something like a UpdatePolicy = "${data.aws_asg_policy.foo}"; this concept doesn't seem horrid, but leaves room for interpretation.

Please advise on the most idiomatic way this could be represented in HCL.

Contributor

sthulb commented Feb 21, 2017

@phinze How about implementing this as a data source? it doesn't seem like the worst way of implementing it using current code structure.

I immediately regret this suggestion.

I don't mind implementing this feature, but I'm kind of at a loss on how to implement it in HCL. As the comment above would suggest, I thought about using the data source format, but linking it back to the ASG resource (Something like a UpdatePolicy = "${data.aws_asg_policy.foo}"; this concept doesn't seem horrid, but leaves room for interpretation.

Please advise on the most idiomatic way this could be represented in HCL.

@pseudobeer

This comment has been minimized.

Show comment
Hide comment

+1

@andrleite

This comment has been minimized.

Show comment
Hide comment

+1

@sudochop

This comment has been minimized.

Show comment
Hide comment

sudochop commented Mar 2, 2017

+1

@addisonbair

This comment has been minimized.

Show comment
Hide comment

+1

@radeksimko

This comment has been minimized.

Show comment
Hide comment
@radeksimko

radeksimko Mar 16, 2017

Member

Hi folks,
we do appreciate the +1's if these don't generate notifications. 😅

Therefore it's more helpful for everyone to use reactions as we can then sort issues by the number of 👍 :
https://github.com/hashicorp/terraform/issues?q=is%3Aissue+is%3Aopen+sort%3Areactions-%2B1-desc+label%3Aprovider%2Faws

The 👍 reactions do count and we're more than happy for people to use those and prefer over "+1" comments for the mentioned reasons.

Thanks.

Member

radeksimko commented Mar 16, 2017

Hi folks,
we do appreciate the +1's if these don't generate notifications. 😅

Therefore it's more helpful for everyone to use reactions as we can then sort issues by the number of 👍 :
https://github.com/hashicorp/terraform/issues?q=is%3Aissue+is%3Aopen+sort%3Areactions-%2B1-desc+label%3Aprovider%2Faws

The 👍 reactions do count and we're more than happy for people to use those and prefer over "+1" comments for the mentioned reasons.

Thanks.

@w32-blaster

This comment has been minimized.

Show comment
Hide comment
@w32-blaster

w32-blaster Mar 16, 2017

Contributor

👍
:trollface:

Contributor

w32-blaster commented Mar 16, 2017

👍
:trollface:

@hashicorp hashicorp locked and limited conversation to collaborators Mar 16, 2017

@apparentlymart

This comment has been minimized.

Show comment
Hide comment
@apparentlymart

apparentlymart Mar 29, 2017

Contributor

Hi everyone,

While this sort of sophisticated behavior would definitely be useful, we (the Terraform team) don't have any short-term plans to implement this ourselves since we're generally recommending that people with these needs consider other solutions such as EC2 Container Service and Nomad, which either have or are more likely to grow sophisticated mechanisms for safe rollout and are in a better position to do so due to the ability to manage such multi-step state transitions.

We're trying to prune stale feature requests (that aren't likely to be addressed soon) by closing them. In this case we're currently leaning towards not implementing significant additional behaviors on top of what the EC2 API natively supports, so I'm going to close this.

Contributor

apparentlymart commented Mar 29, 2017

Hi everyone,

While this sort of sophisticated behavior would definitely be useful, we (the Terraform team) don't have any short-term plans to implement this ourselves since we're generally recommending that people with these needs consider other solutions such as EC2 Container Service and Nomad, which either have or are more likely to grow sophisticated mechanisms for safe rollout and are in a better position to do so due to the ability to manage such multi-step state transitions.

We're trying to prune stale feature requests (that aren't likely to be addressed soon) by closing them. In this case we're currently leaning towards not implementing significant additional behaviors on top of what the EC2 API natively supports, so I'm going to close this.

@radeksimko

This comment has been minimized.

Show comment
Hide comment
@radeksimko

radeksimko Mar 30, 2017

Member

After chatting with @apparentlymart privately I just want to add a few more things.

We do not suggest everyone should use containers (nor that containers solve the problem entirely) and for those who prefer not to, there's a workaround - you can use aws_cloudformation_stack similar way the folks at Travis did.

Also I'm tracking this issue in my own TODO list, to not forget about it. I personally want to get this done, but it's currently pretty low on my list of priorities, PRs are certainly welcomed.

Member

radeksimko commented Mar 30, 2017

After chatting with @apparentlymart privately I just want to add a few more things.

We do not suggest everyone should use containers (nor that containers solve the problem entirely) and for those who prefer not to, there's a workaround - you can use aws_cloudformation_stack similar way the folks at Travis did.

Also I'm tracking this issue in my own TODO list, to not forget about it. I personally want to get this done, but it's currently pretty low on my list of priorities, PRs are certainly welcomed.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.