Cannot start a rolling update on an unhealthy Managed Instance Group #10648

ghost · 2021-11-29T21:59:58Z

Community Note

Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request.
Please do not leave +1 or me too comments, they generate extra noise for issue followers and do not help prioritize the request.
If you are interested in working on this issue or have submitted a pull request, please leave a comment.
If an issue is assigned to the modular-magician user, it is either in the process of being autogenerated, or is planned to be autogenerated soon. If an issue is assigned to a user, that user is claiming responsibility for the issue. If an issue is assigned to hashibot, a community member has claimed the issue already.

Terraform Version

Terraform v1.0.11

Affected Resource(s)

google_compute_instance_group_manager
google_compute_region_instance_group_manager

Terraform Configuration Files

Please see this issue for example code.

Debug Output

Panic Output

n/a

Expected Behavior

I should be able to make code changes to start a new rolling update when the managed instance group has an existing operation in progress.

Use Case 1

I need to make configuration updates to my instance template which will require a rolling update; however, I initially make a mistake and the rolling update stalls indefinitely while waiting for the new node(s) to pass health checks. The MIG continues to auto-heal according to the policy, trying to get the MIG back into a healthy state. According to the Google docs for rolling updates, a rollback is performed by starting another rolling update. In order to fix the MIG, I need to be able to make the changes in terraform and I expect terraform to be able to start the rolling update.

Use Case 2

I have a deployed MIG in production, but one of my nodes starts failing, putting the MIG in an unhealthy state. The failure is caused by something external to the MIG which prevents auto-healing from resolving the issue as new nodes also fail. I need to be able to make code changes through terraform and start a rolling update to the unhealthy MIG.

In both of these use-cases, terraform should be able to start a new rolling update regardless of the current state of the MIG, just like an operator would do manually with gcloud compute instance-groups managed rolling-action start-update.

Actual Behavior

Terraform times out during the apply phase, waiting for the MIG to become healthy.

Steps to Reproduce

Important Factoids

References

Please review the history in this issue that discusses the previous problem with even doing a plan against an unhealthy MIG.

instance_group_manager marked tainted if healthcheck failing #9631

The text was updated successfully, but these errors were encountered:

ghost · 2021-11-29T22:00:37Z

@ScottSuarez I'm tagging you in this issue since you resolved the last issue and probably have the most context already. Thanks for taking a look!

ScottSuarez · 2021-11-30T01:25:10Z

This is what wait for instances` does and how terraform works. Unfortunately there is very little we can do here.

The ask is contradictory. You want us to wait for the status. If terraform never acquires the desired status then we need to fail. If we allow it to pass then the field is not actuating anything and just delaying deployment. I understand how this can be difficult but I would suggest not uses wait for instances and monitoring the status manually. To allow this to pass would contradict terraform's deployment methology

ghost · 2021-11-30T01:42:51Z

@ScottSuarez I gotcha, that makes more sense with how you are viewing wait_for_instances. The way I was viewing it was that it would wait for the rolling update triggered by the terraform run to complete, and that part is indeed working as expected. However, if a terraform run did not trigger a rolling update, I should still be able to let terraform start a rolling update, using wait_for_instances to wait for it to complete. For example, let's say I deploy a MIG using wait_for_instances and everything succeeds. Now let's consider my Use Case 2 where something externally causes the MIG to reach an unhealthy state and start autohealing. When I apply a hotfix through terraform, I am unable to deploy it because wait_for_instances will wait until the autohealing is complete even though terraform did not trigger anything.

I think the issue is that even during the apply phase of terraform, wait_for_instances is waiting before taking any action rather than waiting after for an action to complete. It would make more sense to me that I can continue controlling all of the changes through terraform, and when a rolling update is started, wait_for_instances will wait for it to complete. Stated another way, wait_for_instances is preventing terraform from starting rolling update instead of letting terraform start the rolling update and waiting for it to complete.

I don't believe this expectation to be contradictory and still seems very aligned with how terraform operates and the normal workflow. Does that make more sense?

ScottSuarez · 2021-12-01T21:45:10Z

yes that makes sense to me, I've put out a fix for your scenario.

ghost · 2021-12-03T20:26:27Z

Thank you so much!

github-actions · 2022-01-03T02:10:46Z

I'm going to lock this issue because it has been closed for 30 days ⏳. This helps our maintainers find and focus on the active issues.
If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

ghost added the bug label Nov 29, 2021

ScottSuarez mentioned this issue Dec 1, 2021

Only wait for IGM status after updating GoogleCloudPlatform/magic-modules#5506

Merged

ScottSuarez self-assigned this Dec 1, 2021

ScottSuarez closed this as completed in GoogleCloudPlatform/magic-modules#5506 Dec 3, 2021

This was referenced Dec 3, 2021

Only wait for IGM status after updating hashicorp/terraform-provider-google-beta#3893

Merged

Only wait for IGM status after updating #10680

Merged

danjamesmay mentioned this issue Dec 23, 2021

Managed instance group won't perform update if group is in unhealthy state #10792

Closed

This was referenced Dec 28, 2021

fixed scenario where region_instance_group_manager would not start update GoogleCloudPlatform/magic-modules#5582

Closed

fixed scenario where region_instance_group_manager would not start update GoogleCloudPlatform/magic-modules#5584

Merged

This was referenced Dec 30, 2021

fixed scenario where region_instance_group_manager would not start update hashicorp/terraform-provider-google-beta#3949

Merged

fixed scenario where region_instance_group_manager would not start update #10818

Merged

github-actions bot locked as resolved and limited conversation to collaborators Jan 3, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cannot start a rolling update on an unhealthy Managed Instance Group #10648

Cannot start a rolling update on an unhealthy Managed Instance Group #10648

ghost commented Nov 29, 2021

ghost commented Nov 29, 2021

ScottSuarez commented Nov 30, 2021

ghost commented Nov 30, 2021

ScottSuarez commented Dec 1, 2021

ghost commented Dec 3, 2021

github-actions bot commented Jan 3, 2022

Cannot start a rolling update on an unhealthy Managed Instance Group #10648

Cannot start a rolling update on an unhealthy Managed Instance Group #10648

Comments

ghost commented Nov 29, 2021

Community Note

Terraform Version

Affected Resource(s)

Terraform Configuration Files

Debug Output

Panic Output

Expected Behavior

Use Case 1

Use Case 2

Actual Behavior

Steps to Reproduce

Important Factoids

References

ghost commented Nov 29, 2021

ScottSuarez commented Nov 30, 2021

ghost commented Nov 30, 2021

ScottSuarez commented Dec 1, 2021

ghost commented Dec 3, 2021

github-actions bot commented Jan 3, 2022