New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We鈥檒l occasionally send you account related emails.
Already on GitHub? Sign in to your account
azurerm_kubernetes_cluster creation reports failure but actually succeeds, leaving state mismatched with reality #9342
Comments
I raised this with Azure too, at Azure/AKS#1972 The particular root cause that we saw was a bug at their end, and they are fixing it. So that's good news. But of course it would still be highly desirable if terraform could do better when it encountered such things. As I understand the conversation in that issue, "internal error" should be considered retriable - though I'm conscious that I am expert neither on the terraform provider nor the Azure API... Hope this is helpful. Please feel free to hijack that Azure issue to continue the conversation with Azure if that would be useful - or to raise another - they seem friendly! |
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
I should note that we are also seeing this issue in the AzureRM Provider 3.28.0 with Terraform 1.2.9. As far as I understanding this is an issue on the Azure API level. This is the error response we get from Terraform - i assume its the same issue
Is there a temporary workaround suggestion to this while we wait for an official API fix? So far the only workaround I found is
UpdateJust a quick note that I managed to fixed my issue - it seemed the root cause for mine was that because I had added IP restriction to the inbound control plane, I also had to add the NAT outbound IP of the AKS nodes (which were attached to a subnet). Once I did that (and recreated the cluster), the terraform provider could create it with no issue. |
Community Note
Terraform (and AzureRM Provider) Version
Terraform v0.13.5
Affected Resource(s)
azurerm_kubernetes_cluster
Debug Output
https://gist.github.com/dimbleby/c6792961f6f6e1f2d044328df953ca99
where the interesting bit is
Panic Output
Expected Behaviour
Either the creation of the kubernetes cluster should completely succeed, or it should completely fail.
Actual Behaviour
Terraform reports failure, as above. But Azure actually goes on to create the cluster successfully. Now the terraform state is out of sync with reality, and further management fails.
Steps to Reproduce
terraform apply
Important Factoids
We don't hit this every time, suspect it's more likely to happen when creating multiple clusters simultaneously.
That client ID is
AzureContainerService
:So it looks as though Azure is granting itself permission to do things, and that has maybe not propagated fast enough.
As promised by the Message, AKS continues to retry: and the cluster creation does succeed. That then gets awkward: the terraform state says that there is no cluster, but there is, and so further attempts to create or otherwise manage it fail.
Is this a case that the azurerm provider could reasonably detect and retry? Or should we be petitioning the Azure team to improve their API? (reporting failure and actually succeeding is not very helpful). Or both, or something else...?
Thanks!
The text was updated successfully, but these errors were encountered: