Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cannot start a rolling update on an unhealthy Managed Instance Group #10648

Assignees
Labels

Comments

@ghost
Copy link

ghost commented Nov 29, 2021

Community Note

  • Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request.
  • Please do not leave +1 or me too comments, they generate extra noise for issue followers and do not help prioritize the request.
  • If you are interested in working on this issue or have submitted a pull request, please leave a comment.
  • If an issue is assigned to the modular-magician user, it is either in the process of being autogenerated, or is planned to be autogenerated soon. If an issue is assigned to a user, that user is claiming responsibility for the issue. If an issue is assigned to hashibot, a community member has claimed the issue already.

Terraform Version

Terraform v1.0.11

Affected Resource(s)

  • google_compute_instance_group_manager
  • google_compute_region_instance_group_manager

Terraform Configuration Files

Please see this issue for example code.

Debug Output

Panic Output

n/a

Expected Behavior

I should be able to make code changes to start a new rolling update when the managed instance group has an existing operation in progress.

Use Case 1

I need to make configuration updates to my instance template which will require a rolling update; however, I initially make a mistake and the rolling update stalls indefinitely while waiting for the new node(s) to pass health checks. The MIG continues to auto-heal according to the policy, trying to get the MIG back into a healthy state. According to the Google docs for rolling updates, a rollback is performed by starting another rolling update. In order to fix the MIG, I need to be able to make the changes in terraform and I expect terraform to be able to start the rolling update.

Use Case 2

I have a deployed MIG in production, but one of my nodes starts failing, putting the MIG in an unhealthy state. The failure is caused by something external to the MIG which prevents auto-healing from resolving the issue as new nodes also fail. I need to be able to make code changes through terraform and start a rolling update to the unhealthy MIG.

In both of these use-cases, terraform should be able to start a new rolling update regardless of the current state of the MIG, just like an operator would do manually with gcloud compute instance-groups managed rolling-action start-update.

Actual Behavior

Terraform times out during the apply phase, waiting for the MIG to become healthy.

Steps to Reproduce

Important Factoids

References

Please review the history in this issue that discusses the previous problem with even doing a plan against an unhealthy MIG.

@ghost ghost added the bug label Nov 29, 2021
@ghost
Copy link
Author

ghost commented Nov 29, 2021

@ScottSuarez I'm tagging you in this issue since you resolved the last issue and probably have the most context already. Thanks for taking a look!

@ScottSuarez
Copy link
Collaborator

This is what wait for instances` does and how terraform works. Unfortunately there is very little we can do here.

The ask is contradictory. You want us to wait for the status. If terraform never acquires the desired status then we need to fail. If we allow it to pass then the field is not actuating anything and just delaying deployment. I understand how this can be difficult but I would suggest not uses wait for instances and monitoring the status manually. To allow this to pass would contradict terraform's deployment methology

@ghost
Copy link
Author

ghost commented Nov 30, 2021

@ScottSuarez I gotcha, that makes more sense with how you are viewing wait_for_instances. The way I was viewing it was that it would wait for the rolling update triggered by the terraform run to complete, and that part is indeed working as expected. However, if a terraform run did not trigger a rolling update, I should still be able to let terraform start a rolling update, using wait_for_instances to wait for it to complete. For example, let's say I deploy a MIG using wait_for_instances and everything succeeds. Now let's consider my Use Case 2 where something externally causes the MIG to reach an unhealthy state and start autohealing. When I apply a hotfix through terraform, I am unable to deploy it because wait_for_instances will wait until the autohealing is complete even though terraform did not trigger anything.

I think the issue is that even during the apply phase of terraform, wait_for_instances is waiting before taking any action rather than waiting after for an action to complete. It would make more sense to me that I can continue controlling all of the changes through terraform, and when a rolling update is started, wait_for_instances will wait for it to complete. Stated another way, wait_for_instances is preventing terraform from starting rolling update instead of letting terraform start the rolling update and waiting for it to complete.

I don't believe this expectation to be contradictory and still seems very aligned with how terraform operates and the normal workflow. Does that make more sense?

@ScottSuarez
Copy link
Collaborator

yes that makes sense to me, I've put out a fix for your scenario.

@ghost
Copy link
Author

ghost commented Dec 3, 2021

Thank you so much!

@github-actions
Copy link

github-actions bot commented Jan 3, 2022

I'm going to lock this issue because it has been closed for 30 days ⏳. This helps our maintainers find and focus on the active issues.
If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Jan 3, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.