azurerm_kubernetes_cluster creation reports failure but actually succeeds, leaving state mismatched with reality #9342

dimbleby · 2020-11-16T15:35:49Z

Community Note

Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
Please do not leave "+1" or "me too" comments, they generate extra noise for issue followers and do not help prioritize the request
If you are interested in working on this issue or have submitted a pull request, please leave a comment

Terraform (and AzureRM Provider) Version

Terraform v0.13.5

provider registry.terraform.io/hashicorp/azurerm v2.31.1

Affected Resource(s)

azurerm_kubernetes_cluster

Debug Output

https://gist.github.com/dimbleby/c6792961f6f6e1f2d044328df953ca99

where the interesting bit is

waiting for creation of Managed Kubernetes Cluster "simpl-cluster-integration-2-aks-k8s" (Resource Group "simpl-cluster-integration-2-aks"): Code="InternalOperationError" Message="AKS encountered an internal error while attempting the requested Creating operation. AKS will continuously retry the requested operation until successful or a retry timeout is hit. Check back to see if the operation requires resubmission. Correlation ID: d0f19ff8-d361-8447-7250-139174c4389d, Operation ID: 0c29feeb-a9d0-418f-b429-8b1e520e120a, Timestamp: 2020-11-16T14:23:54Z, Error: addonIdentityReconciler retry failed: autorest/azure: Service returned an error. Status=403 Code=\"AuthorizationFailed\" Message=\"The client '329189cb-bb7e-47b0-852d-9535eee6569f' with object id '329189cb-bb7e-47b0-852d-9535eee6569f' does not have authorization to perform action 'Microsoft.ManagedIdentity/userAssignedIdentities/write' over scope '/subscriptions/<redacted>/resourceGroups/MC_simpl-cluster-integration-2-aks_simpl-cluster-integration-2-aks-k8s_uksouth/providers/Microsoft.ManagedIdentity/userAssignedIdentities/omsagent-simpl-cluster-integration-2-aks-k8s' or the scope is invalid. If access was recently granted, please refresh your credentials.\"."

Panic Output

Expected Behaviour

Either the creation of the kubernetes cluster should completely succeed, or it should completely fail.

Actual Behaviour

Terraform reports failure, as above. But Azure actually goes on to create the cluster successfully. Now the terraform state is out of sync with reality, and further management fails.

Steps to Reproduce

terraform apply

Important Factoids

We don't hit this every time, suspect it's more likely to happen when creating multiple clusters simultaneously.

That client ID is AzureContainerService:

Get-AzureADObjectByObjectId -ObjectIDs 329189cb-bb7e-47b0-852d-9535eee6569f

ObjectId                             AppId                                DisplayName
--------                             -----                                -----------
329189cb-bb7e-47b0-852d-9535eee6569f 7319c514-987d-4e9b-ac3d-d38c4f427f4c AzureContainerService

So it looks as though Azure is granting itself permission to do things, and that has maybe not propagated fast enough.

As promised by the Message, AKS continues to retry: and the cluster creation does succeed. That then gets awkward: the terraform state says that there is no cluster, but there is, and so further attempts to create or otherwise manage it fail.

Is this a case that the azurerm provider could reasonably detect and retry? Or should we be petitioning the Azure team to improve their API? (reporting failure and actually succeeding is not very helpful). Or both, or something else...?

Thanks!

The text was updated successfully, but these errors were encountered:

dimbleby · 2020-11-23T11:25:00Z

I raised this with Azure too, at Azure/AKS#1972

The particular root cause that we saw was a bug at their end, and they are fixing it. So that's good news.

But of course it would still be highly desirable if terraform could do better when it encountered such things. As I understand the conversation in that issue, "internal error" should be considered retriable - though I'm conscious that I am expert neither on the terraform provider nor the Azure API...

Hope this is helpful. Please feel free to hijack that Azure issue to continue the conversation with Azure if that would be useful - or to raise another - they seem friendly!

mechastorm · 2022-10-18T06:27:16Z

I should note that we are also seeing this issue in the AzureRM Provider 3.28.0 with Terraform 1.2.9. As far as I understanding this is an issue on the Azure API level.

This is the error response we get from Terraform - i assume its the same issue

Error: waiting for creation of Cluster: (Managed Cluster Name "myenv-aks" / Resource Group "myenv"): Code="CreateVMSSAgentPoolFailed" Message="Unable to establish connection from agents to Kubernetes API server, please see https://aka.ms/aks-required-ports-and-addresses for more information. Details: Code=\"VMExtensionProvisioningError\" Message=\"VM has reported a failure when processing extension 'vmssCSE'. Error message: \\\"Enable failed: failed to execute command: command terminated with exit status=51\\n[stdout]\\n{ \\\"ExitCode\\\": \\\"51\\\", \\\"Output\\\": \\\".azmk8s.io 443\\\\n+ '[' 124 -eq 0 ']'\\\\n+ '[' 49 -eq 50 ']'\\\\n+ sleep 1\\\\n+ for i in $(seq 1 $r)\\\\n+ timeout 10 nc -vz myenv-8ed50328.hcp.eastus.azmk8s.io 443\\\\n+ '[' 124 -eq 0 ']'\\\\n+ '[' 50 -eq 50 ']'\\\\n+ return 1\\\\n+ ret=1\\\\n++ date '+%F %T.%3N'\\\\n+ local 'endTime=2022-10-18 05:18:57.288'\\\\n++ jq -n --arg Timestamp '2022-10-18 05:09:48.107' --arg OperationId '2022-10-18 05:18:57.288' --arg Version 1.23 --arg TaskName AKS.CSE.apiserverNC --arg EventLevel Informational --arg Message 'Completed: retrycmd_if_failure 50 1 10 nc -vz myenv-8ed50328.hcp.eastus.azmk8s.io 443' --arg EventPid 0 --arg EventTid 0 '{Timestamp: $Timestamp, OperationId: $OperationId, Version: $Version, TaskName: $TaskName, EventLevel: $EventLevel, Message: $Message, EventPid: $EventPid, EventTid: $EventTid}'\\\\n+ json_string='{\\\\n \\\\\\\"Timestamp\\\\\\\": \\\\\\\"2022-10-18 05:09:48.107\\\\\\\",\\\\n \\\\\\\"OperationId\\\\\\\": \\\\\\\"2022-10-18 05:18:57.288\\\\\\\",\\\\n \\\\\\\"Version\\\\\\\": \\\\\\\"1.23\\\\\\\",\\\\n \\\\\\\"TaskName\\\\\\\": \\\\\\\"AKS.CSE.apiserverNC\\\\\\\",\\\\n \\\\\\\"EventLevel\\\\\\\": \\\\\\\"Informational\\\\\\\",\\\\n \\\\\\\"Message\\\\\\\": \\\\\\\"Completed: retrycmd_if_failure 50 1 10 nc -vz myenv-8ed50328.hcp.eastus.azmk8s.io 443\\\\\\\",\\\\n \\\\\\\"EventPid\\\\\\\": \\\\\\\"0\\\\\\\",\\\\n \\\\\\\"EventTid\\\\\\\": \\\\\\\"0\\\\\\\"\\\\n}'\\\\n+ echo '{' '\\\\\\\"Timestamp\\\\\\\":' '\\\\\\\"2022-10-18' '05:09:48.107\\\\\\\",' '\\\\\\\"OperationId\\\\\\\":' '\\\\\\\"2022-10-18' '05:18:57.288\\\\\\\",' '\\\\\\\"Version\\\\\\\":' '\\\\\\\"1.23\\\\\\\",' '\\\\\\\"TaskName\\\\\\\":' '\\\\\\\"AKS.CSE.apiserverNC\\\\\\\",' '\\\\\\\"EventLevel\\\\\\\":' '\\\\\\\"Informational\\\\\\\",' '\\\\\\\"Message\\\\\\\":' '\\\\\\\"Completed:' retrycmd_if_failure 50 1 10 nc -vz myenv-8ed50328.hcp.eastus.azmk8s.io '443\\\\\\\",' '\\\\\\\"EventPid\\\\\\\":' '\\\\\\\"0\\\\\\\",' '\\\\\\\"EventTid\\\\\\\":' '\\\\\\\"0\\\\\\\"' '}'\\\\n+ '[' 1 '!=' 0 ']'\\\\n+ return 1\\\\n+ nc -vz myenv-8ed50328.hcp.eastus.azmk8s.io 443\\\\nnc: connect to myenv-8ed50328.hcp.eastus.azmk8s.io port 443 (tcp) failed: Connection timed out\\\\n\\\\nreal\\\\t2m10.373s\\\\nuser\\\\t0m0.000s\\\\nsys\\\\t0m0.002s\\\\n+ VALIDATION_ERR=51\\\\n+ [[ ubuntu != \\\\\\\\m\\\\\\\\a\\\\\\\\r\\\\\\\\i\\\\\\\\n\\\\\\\\e\\\\\\\\r ]]\\\\n++ date\\\\n+ echo 'Recreating man-db auto-update flag file and kicking off man-db update process at Tue Oct 18 05:21:07 UTC 2022'\\\\nRecreating man-db auto-update flag file and kicking off man-db update process at Tue Oct 18 05:21:07 UTC 2022\\\\n+ createManDbAutoUpdateFlagFile\\\\n+ touch /var/lib/man-db/auto-update\\\\n+ REBOOTREQUIRED=false\\\\n+ false\\\\n+ [[ UBUNTU == UBUNTU ]]\\\\n+ /usr/lib/apt/apt.systemd.daily\\\\n+ echo 'Custom script finished. API server connection check code:' 51\\\\nCustom script finished. API server connection check code: 51\\\\n+ aptmarkWALinuxAgent unhold\\\\n++ date\\\\n++ date\\\\n+ /usr/bin/mandb\\\\n++ hostname\\\\n++ hostname\\\\n+ echo Tue Oct 18 05:21:07 UTC 2022,aks-nodepool-35512790-vmss000000, startAptmarkWALinuxAgent unhold\\\\nTue Oct 18 05:21:07 UTC 2022,aks-nodepool-35512790-vmss000000, startAptmarkWALinuxAgent unhold\\\\n+ wait_for_apt_locks\\\\n+ fuser /var/lib/dpkg/lock /var/lib/apt/lists/lock /var/cache/apt/archives/lock\\\\n+ echo Tue Oct 18 05:21:07 UTC 2022,aks-nodepool-35512790-vmss000000, endcustomscript\\\\n+ mkdir -p /opt/azure/containers\\\\n+ touch /opt/azure/containers/provision.complete\\\\n+ exit 51\\\\n+ retrycmd_if_failure 120 5 25 apt-mark unhold walinuxagent\\\\n+ r=120\\\\n+ w=5\\\\n+ t=25\\\\n+ shift\\\\n+ shift\\\\n+ shift\\\\n++ seq 1 120\\\\n+ for i in $(seq 1 $r)\\\\n+ timeout 25 apt-mark unhold walinuxagent\\\", \\\"Error\\\": \\\"\\\", \\\"ExecDuration\\\": \\\"715\\\", \\\"KernelStartTime\\\": \\\"Tue 2022-10-18 05:08:42 UTC\\\", \\\"CSEStartTime\\\": \\\"Tue Oct 18 05:09:12 UTC 2022\\\", \\\"GuestAgentStartTime\\\": \\\"Tue 2022-10-18 05:09:05 UTC\\\", \\\"SystemdSummary\\\": \\\"Startup finished in 724ms (firmware) + 12.343s (loader) + 5.528s (kernel) + 31.855s (userspace) = 50.451s\\\\ngraphical.target reached after 19.924s in userspace\\\", \\\"BootDatapoints\\\": { \\\"KernelStartTime\\\": \\\"Tue 2022-10-18 05:08:42 UTC\\\", \\\"CSEStartTime\\\": \\\"Tue Oct 18 05:09:12 UTC 2022\\\", \\\"GuestAgentStartTime\\\": \\\"Tue 2022-10-18 05:09:05 UTC\\\", \\\"KubeletStartTime\\\": \\\"Tue 2022-10-18 05:09:47 UTC\\\" } }\\n\\n[stderr]\\n\\\"\\r\\n\\r\\nMore information on troubleshooting is available at https://aka.ms/VMExtensionCSELinuxTroubleshoot \""

Is there a temporary workaround suggestion to this while we wait for an official API fix? So far the only workaround I found is

Encounter the error on the first apply
Run a terraform import command to import it
Next apply, is successful

Update

Just a quick note that I managed to fixed my issue - it seemed the root cause for mine was that because I had added IP restriction to the inbound control plane, I also had to add the NAT outbound IP of the AKS nodes (which were attached to a subnet). Once I did that (and recreated the cluster), the terraform provider could create it with no issue.

tombuildsstuff added question service/kubernetes-cluster labels Nov 17, 2020

dimbleby mentioned this issue Nov 18, 2020

Cluster creation reports failure but goes on to succeed - confuses automation tooling Azure/AKS#1972

Closed

This comment has been minimized.

Sign in to view

calebak404 mentioned this issue May 25, 2021

virtual_machine_extension intermittent timeouts when deploying with linux vm's #11940

Open

This comment has been minimized.

Sign in to view

This was referenced Feb 15, 2022

azurerm_kubernetes_cluster does not create an aks cluster with the apiserver authorized api ranges feature #9604

Closed

AKS cluster creation reports failure but actually succeeds in Azure #11179

Closed

catriona-m added the v/2.x (legacy) label Jul 19, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

azurerm_kubernetes_cluster creation reports failure but actually succeeds, leaving state mismatched with reality #9342

azurerm_kubernetes_cluster creation reports failure but actually succeeds, leaving state mismatched with reality #9342

dimbleby commented Nov 16, 2020

dimbleby commented Nov 23, 2020

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

mechastorm commented Oct 18, 2022 •

edited

azurerm_kubernetes_cluster creation reports failure but actually succeeds, leaving state mismatched with reality #9342

azurerm_kubernetes_cluster creation reports failure but actually succeeds, leaving state mismatched with reality #9342

Comments

dimbleby commented Nov 16, 2020

Community Note

Terraform (and AzureRM Provider) Version

Affected Resource(s)

Debug Output

Panic Output

Expected Behaviour

Actual Behaviour

Steps to Reproduce

Important Factoids

dimbleby commented Nov 23, 2020

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

mechastorm commented Oct 18, 2022 • edited

Update

mechastorm commented Oct 18, 2022 •

edited