Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Waiting for cloud-config/user_data completion #4668

Closed
calvn opened this issue Jan 13, 2016 · 14 comments
Closed

Waiting for cloud-config/user_data completion #4668

calvn opened this issue Jan 13, 2016 · 14 comments

Comments

@calvn
Copy link

calvn commented Jan 13, 2016

On aws_instance (and presumably for other providers) there doesn't seem to be a way to wait for cloud-config to finish before moving on to other resources. If I have a cloud-config runcmd that creates a directory which is then used on a remote-exec, the remote-exec will fail because the resource gets run right after the creation of the instance and not after its cloud-config is completed.

On cloud formation, you can send a signal and have that be caught by ResourceSignal which changes the status from Pending to Complete.

@justinclayton
Copy link
Contributor

I'm also looking for a clean way to solve for this. I feel like there's something in the null_resource area that can be hacked up, but I'm not sure that's the best path forward.

@calvn
Copy link
Author

calvn commented Jan 15, 2016

My "hack" at the moment around this is to treat a file as a resource signal, and have the remote-exec block the rest of the execution until that file exists.

cloud-init.yml:

runcmd:
  - mkdir -p /etc/consul.d
  - touch /tmp/signal

consul.tf:

resource "null_resource" consul-config {
  ...
  provisioner "remote-exec" {
    inline = [
      "while [ ! -f /tmp/signal ]; do sleep 2; done",
      ...
    ]
  }
  ...
}

@justinclayton
Copy link
Contributor

Well, that’s the cleanest version of this ugly hack I’ve seen yet. :-)

On Jan 14, 2016, at 5:15 PM, Calvin Leung Huang notifications@github.com wrote:

My "hack" at the moment around this is to treat a file as a resource signal, and have the remote-exec block the rest of the execution until that file exists.

cloud-init.yml:

runcmd:

  • mkdir -p /etc/consul.d
  • touch /tmp/signal
    consul.tf:

provisioner "remote-exec" {
inline = [
"while [ ! -f /tmp/signal ]; do sleep 2; done",
...
]
}

Reply to this email directly or view it on GitHub #4668 (comment).

@calvn
Copy link
Author

calvn commented Jan 15, 2016

Thanks, hope that helps :)

@apparentlymart
Copy link
Contributor

@cleung2010 could you share an example of how this looks in CloudFormation? I'm not too familiar with it so I'd like to try to understand a bit better how it solves this case and thus how/whether that solution might be used by Terraform.

@calvn
Copy link
Author

calvn commented Jan 18, 2016

@apparentlymart There is an example on the use of cfn-signal here, and also in the official AWS guide for setting up a Consul cluster on ECS here

@apparentlymart
Copy link
Contributor

Okay, so I think I'm understanding better the CloudFormation workflow:

  • Cloudformation is configured to start an EC2 instance with user-data that will cause cloud-init to eventually run the cfn-signal program.
  • The cfn-signal program calls SignalResource to tell CloudFormation that the initialization either succeeded or failed.
  • CloudFormation waits for that call and then uses it to decide what to do next.

The key difference between CloudFormation and Terraform here is that of course Terraform doesn't have a server that the instance can contact to signal its success. However, as you noticed you can use provisioners in conjunction with state outside of Terraform (in your case, a file showing up on disk) to approximate the same thing.

If we frame the problem as having the instance send a signal somewhere and having Terraform listen for that signal, then there's a number of different signalling mechanisms that Terraform could hypothetically support via provisioners, and which can be implemented in the mean time using remote-exec scripts:

  • Run Consul Agent on the instances, and ensure that the instances join a Consul cluster once they've successfully booted. Then the remote-exec script returns only once the instance shows up in the Consul registry, or once its checks are healthy.
  • Put an ELB in front of your instances, and then use a remote-exec script that polls the ELB's instance table until the instance in question switches to the InService state.
  • If you don't want to use an external service to transport the signal, you could put a FIFO (named pipe) in a predictable place on the filesystem in the AMI, and then make cloud-init write a byte to it. Then use a remote-exec script that reads from the FIFO. This is basically the same thing as your solution of polling for a file, except that the FIFO avoids the need to poll because FIFO operations block until both a writer and a reader are present. This is actually a two-way synchronization, unlike the other approaches here: Terraform's provisioner will block on the user-data write, and the user-data write itself will block on Terraform's provisioner.

Alternatively, Terraform has an aws_cloudformation_stack resource which you can use to delegate the creation of instances to CloudFormation, and then you can use the cfn-signal mechanism; AFAIK the aws_cloudformation_stack resource is not considered complete until CloudFormation is satisfied that the stack is complete.

@lbernail
Copy link

I found this issue because we have a similar need to migrate some existing cloudformation templates to terraform.

I think we will use the work-around with the null_resource but instead of using a file on the server and a remote-exec to check for it, we will use an S3 key and local-exec (in some cases we do not have ssh access to the servers, and simply need to know that the service they provide is ready before continuing).

(I think cloudformation also relies on s3: http://docs.aws.amazon.com/AWSCloudFormation/latest/UserGuide/using-cfn-waitcondition.html)

Regardless of the backend where you store the information (S3, consul, dynamo,...), I think it would be very convenient to have a generic wait-signal mechanism which manages things like unique id (to identify the resource sending the signal), retries and timeouts for instance.

@apparentlymart
Copy link
Contributor

Thanks for opening this feature request @calvn, and thanks to everyone else for the great discussion. Sorry we let this sit here idle for so long.

After some reflection, it seems like this is not a feature that Terraform can easily support natively since it requires somewhere to send the notification that the instance has booted and Terraform is not a hosted service.

Therefore we (the Terraform team) recommend pursuing alternative approaches such as the ones I enumerated in my earlier comment above, each of which makes use of a specific system outside of Terraform to maintain the necessary state. Since we don't have any near-term plans to work on this, I'm going to close this as part of our effort to prune some stale issues that don't have short-term action plans.

Thanks again for the discussion here!

@alexlance
Copy link

I use this:

provisioner "remote-exec" {
    inline = [
      "/bin/bash -c \"timeout 300 sed '/finished-user-data/q' <(tail -f /var/log/cloud-init-output.log)\""
    ]
}
  • The last line in the user_data.sh script runs a touch /tmp/finished-user-data
  • The user_data bash script has a set -euxo pipefail at the top
  • So it won't get to touch the marker file if any part of the user_data failed
  • This method also prints out the cloud-init-output.log file to the screen (saves you having to ssh over to the instance to see why it failed booting)

@deeptechs
Copy link

Had a similar problem. I was using "runcmd" to create file and write some content to it. I was taking errors sometimes if I didn't wait enough. I didn't want to solve it waiting in the instant creation script, it is not a clean solution mentioned by @justinclayton
I solved it using "write_files"

I am not facing anymore "the file not found" errors.

@roberthutto
Copy link

roberthutto commented Sep 7, 2018

Cloud init has a status wait command

provisioner "remote-exec" {
  inline = [
    "cloud-init status --wait"
}

@pannadi
Copy link

pannadi commented May 22, 2019

I have some issues with the ecs agent I guess.. Do anyone have the documentation how to use a CIS Centos linux 7 to create an AMI with docker and ecs agent installed (https://docs.aws.amazon.com/AmazonECS/latest/developerguide/ecs-agent-install.html#ecs-agent-install-nonamazonlinux) ....but the instances after providing the AMI id to the CFN and after deploying it aren't running the tasks.

@ghost
Copy link

ghost commented Jul 25, 2019

I'm going to lock this issue because it has been closed for 30 days ⏳. This helps our maintainers find and focus on the active issues.

If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@ghost ghost locked and limited conversation to collaborators Jul 25, 2019
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

8 participants