Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We鈥檒l occasionally send you account related emails.

Already on GitHub? Sign in to your account

azurerm_kubernetes_cluster creation reports failure but actually succeeds, leaving state mismatched with reality #9342

Open
dimbleby opened this issue Nov 16, 2020 · 6 comments

Comments

@dimbleby
Copy link
Contributor

Community Note

  • Please vote on this issue by adding a 馃憤 reaction to the original issue to help the community and maintainers prioritize this request
  • Please do not leave "+1" or "me too" comments, they generate extra noise for issue followers and do not help prioritize the request
  • If you are interested in working on this issue or have submitted a pull request, please leave a comment

Terraform (and AzureRM Provider) Version

Terraform v0.13.5

  • provider registry.terraform.io/hashicorp/azurerm v2.31.1

Affected Resource(s)

  • azurerm_kubernetes_cluster

Debug Output

https://gist.github.com/dimbleby/c6792961f6f6e1f2d044328df953ca99

where the interesting bit is

waiting for creation of Managed Kubernetes Cluster "simpl-cluster-integration-2-aks-k8s" (Resource Group "simpl-cluster-integration-2-aks"): Code="InternalOperationError" Message="AKS encountered an internal error while attempting the requested Creating operation. AKS will continuously retry the requested operation until successful or a retry timeout is hit. Check back to see if the operation requires resubmission. Correlation ID: d0f19ff8-d361-8447-7250-139174c4389d, Operation ID: 0c29feeb-a9d0-418f-b429-8b1e520e120a, Timestamp: 2020-11-16T14:23:54Z, Error: addonIdentityReconciler retry failed: autorest/azure: Service returned an error. Status=403 Code=\"AuthorizationFailed\" Message=\"The client '329189cb-bb7e-47b0-852d-9535eee6569f' with object id '329189cb-bb7e-47b0-852d-9535eee6569f' does not have authorization to perform action 'Microsoft.ManagedIdentity/userAssignedIdentities/write' over scope '/subscriptions/<redacted>/resourceGroups/MC_simpl-cluster-integration-2-aks_simpl-cluster-integration-2-aks-k8s_uksouth/providers/Microsoft.ManagedIdentity/userAssignedIdentities/omsagent-simpl-cluster-integration-2-aks-k8s' or the scope is invalid. If access was recently granted, please refresh your credentials.\"."

Panic Output

Expected Behaviour

Either the creation of the kubernetes cluster should completely succeed, or it should completely fail.

Actual Behaviour

Terraform reports failure, as above. But Azure actually goes on to create the cluster successfully. Now the terraform state is out of sync with reality, and further management fails.

Steps to Reproduce

  1. terraform apply

Important Factoids

We don't hit this every time, suspect it's more likely to happen when creating multiple clusters simultaneously.

That client ID is AzureContainerService:

Get-AzureADObjectByObjectId -ObjectIDs 329189cb-bb7e-47b0-852d-9535eee6569f

ObjectId                             AppId                                DisplayName
--------                             -----                                -----------
329189cb-bb7e-47b0-852d-9535eee6569f 7319c514-987d-4e9b-ac3d-d38c4f427f4c AzureContainerService

So it looks as though Azure is granting itself permission to do things, and that has maybe not propagated fast enough.

As promised by the Message, AKS continues to retry: and the cluster creation does succeed. That then gets awkward: the terraform state says that there is no cluster, but there is, and so further attempts to create or otherwise manage it fail.

Is this a case that the azurerm provider could reasonably detect and retry? Or should we be petitioning the Azure team to improve their API? (reporting failure and actually succeeding is not very helpful). Or both, or something else...?

Thanks!

@dimbleby
Copy link
Contributor Author

I raised this with Azure too, at Azure/AKS#1972

The particular root cause that we saw was a bug at their end, and they are fixing it. So that's good news.

But of course it would still be highly desirable if terraform could do better when it encountered such things. As I understand the conversation in that issue, "internal error" should be considered retriable - though I'm conscious that I am expert neither on the terraform provider nor the Azure API...

Hope this is helpful. Please feel free to hijack that Azure issue to continue the conversation with Azure if that would be useful - or to raise another - they seem friendly!

@dimbleby

This comment has been minimized.

@rhollins

This comment has been minimized.

@palmerabollo

This comment has been minimized.

@JimFran

This comment has been minimized.

@mechastorm
Copy link

mechastorm commented Oct 18, 2022

I should note that we are also seeing this issue in the AzureRM Provider 3.28.0 with Terraform 1.2.9. As far as I understanding this is an issue on the Azure API level.

This is the error response we get from Terraform - i assume its the same issue

Error: waiting for creation of Cluster: (Managed Cluster Name "myenv-aks" / Resource Group "myenv"): Code="CreateVMSSAgentPoolFailed" Message="Unable to establish connection from agents to Kubernetes API server, please see https://aka.ms/aks-required-ports-and-addresses for more information. Details: Code=\"VMExtensionProvisioningError\" Message=\"VM has reported a failure when processing extension 'vmssCSE'. Error message: \\\"Enable failed: failed to execute command: command terminated with exit status=51\\n[stdout]\\n{ \\\"ExitCode\\\": \\\"51\\\", \\\"Output\\\": \\\".azmk8s.io 443\\\\n+ '[' 124 -eq 0 ']'\\\\n+ '[' 49 -eq 50 ']'\\\\n+ sleep 1\\\\n+ for i in $(seq 1 $r)\\\\n+ timeout 10 nc -vz myenv-8ed50328.hcp.eastus.azmk8s.io 443\\\\n+ '[' 124 -eq 0 ']'\\\\n+ '[' 50 -eq 50 ']'\\\\n+ return 1\\\\n+ ret=1\\\\n++ date '+%F %T.%3N'\\\\n+ local 'endTime=2022-10-18 05:18:57.288'\\\\n++ jq -n --arg Timestamp '2022-10-18 05:09:48.107' --arg OperationId '2022-10-18 05:18:57.288' --arg Version 1.23 --arg TaskName AKS.CSE.apiserverNC --arg EventLevel Informational --arg Message 'Completed: retrycmd_if_failure 50 1 10 nc -vz myenv-8ed50328.hcp.eastus.azmk8s.io 443' --arg EventPid 0 --arg EventTid 0 '{Timestamp: $Timestamp, OperationId: $OperationId, Version: $Version, TaskName: $TaskName, EventLevel: $EventLevel, Message: $Message, EventPid: $EventPid, EventTid: $EventTid}'\\\\n+ json_string='{\\\\n \\\\\\\"Timestamp\\\\\\\": \\\\\\\"2022-10-18 05:09:48.107\\\\\\\",\\\\n \\\\\\\"OperationId\\\\\\\": \\\\\\\"2022-10-18 05:18:57.288\\\\\\\",\\\\n \\\\\\\"Version\\\\\\\": \\\\\\\"1.23\\\\\\\",\\\\n \\\\\\\"TaskName\\\\\\\": \\\\\\\"AKS.CSE.apiserverNC\\\\\\\",\\\\n \\\\\\\"EventLevel\\\\\\\": \\\\\\\"Informational\\\\\\\",\\\\n \\\\\\\"Message\\\\\\\": \\\\\\\"Completed: retrycmd_if_failure 50 1 10 nc -vz myenv-8ed50328.hcp.eastus.azmk8s.io 443\\\\\\\",\\\\n \\\\\\\"EventPid\\\\\\\": \\\\\\\"0\\\\\\\",\\\\n \\\\\\\"EventTid\\\\\\\": \\\\\\\"0\\\\\\\"\\\\n}'\\\\n+ echo '{' '\\\\\\\"Timestamp\\\\\\\":' '\\\\\\\"2022-10-18' '05:09:48.107\\\\\\\",' '\\\\\\\"OperationId\\\\\\\":' '\\\\\\\"2022-10-18' '05:18:57.288\\\\\\\",' '\\\\\\\"Version\\\\\\\":' '\\\\\\\"1.23\\\\\\\",' '\\\\\\\"TaskName\\\\\\\":' '\\\\\\\"AKS.CSE.apiserverNC\\\\\\\",' '\\\\\\\"EventLevel\\\\\\\":' '\\\\\\\"Informational\\\\\\\",' '\\\\\\\"Message\\\\\\\":' '\\\\\\\"Completed:' retrycmd_if_failure 50 1 10 nc -vz myenv-8ed50328.hcp.eastus.azmk8s.io '443\\\\\\\",' '\\\\\\\"EventPid\\\\\\\":' '\\\\\\\"0\\\\\\\",' '\\\\\\\"EventTid\\\\\\\":' '\\\\\\\"0\\\\\\\"' '}'\\\\n+ '[' 1 '!=' 0 ']'\\\\n+ return 1\\\\n+ nc -vz myenv-8ed50328.hcp.eastus.azmk8s.io 443\\\\nnc: connect to myenv-8ed50328.hcp.eastus.azmk8s.io port 443 (tcp) failed: Connection timed out\\\\n\\\\nreal\\\\t2m10.373s\\\\nuser\\\\t0m0.000s\\\\nsys\\\\t0m0.002s\\\\n+ VALIDATION_ERR=51\\\\n+ [[ ubuntu != \\\\\\\\m\\\\\\\\a\\\\\\\\r\\\\\\\\i\\\\\\\\n\\\\\\\\e\\\\\\\\r ]]\\\\n++ date\\\\n+ echo 'Recreating man-db auto-update flag file and kicking off man-db update process at Tue Oct 18 05:21:07 UTC 2022'\\\\nRecreating man-db auto-update flag file and kicking off man-db update process at Tue Oct 18 05:21:07 UTC 2022\\\\n+ createManDbAutoUpdateFlagFile\\\\n+ touch /var/lib/man-db/auto-update\\\\n+ REBOOTREQUIRED=false\\\\n+ false\\\\n+ [[ UBUNTU == UBUNTU ]]\\\\n+ /usr/lib/apt/apt.systemd.daily\\\\n+ echo 'Custom script finished. API server connection check code:' 51\\\\nCustom script finished. API server connection check code: 51\\\\n+ aptmarkWALinuxAgent unhold\\\\n++ date\\\\n++ date\\\\n+ /usr/bin/mandb\\\\n++ hostname\\\\n++ hostname\\\\n+ echo Tue Oct 18 05:21:07 UTC 2022,aks-nodepool-35512790-vmss000000, startAptmarkWALinuxAgent unhold\\\\nTue Oct 18 05:21:07 UTC 2022,aks-nodepool-35512790-vmss000000, startAptmarkWALinuxAgent unhold\\\\n+ wait_for_apt_locks\\\\n+ fuser /var/lib/dpkg/lock /var/lib/apt/lists/lock /var/cache/apt/archives/lock\\\\n+ echo Tue Oct 18 05:21:07 UTC 2022,aks-nodepool-35512790-vmss000000, endcustomscript\\\\n+ mkdir -p /opt/azure/containers\\\\n+ touch /opt/azure/containers/provision.complete\\\\n+ exit 51\\\\n+ retrycmd_if_failure 120 5 25 apt-mark unhold walinuxagent\\\\n+ r=120\\\\n+ w=5\\\\n+ t=25\\\\n+ shift\\\\n+ shift\\\\n+ shift\\\\n++ seq 1 120\\\\n+ for i in $(seq 1 $r)\\\\n+ timeout 25 apt-mark unhold walinuxagent\\\", \\\"Error\\\": \\\"\\\", \\\"ExecDuration\\\": \\\"715\\\", \\\"KernelStartTime\\\": \\\"Tue 2022-10-18 05:08:42 UTC\\\", \\\"CSEStartTime\\\": \\\"Tue Oct 18 05:09:12 UTC 2022\\\", \\\"GuestAgentStartTime\\\": \\\"Tue 2022-10-18 05:09:05 UTC\\\", \\\"SystemdSummary\\\": \\\"Startup finished in 724ms (firmware) + 12.343s (loader) + 5.528s (kernel) + 31.855s (userspace) = 50.451s\\\\ngraphical.target reached after 19.924s in userspace\\\", \\\"BootDatapoints\\\": { \\\"KernelStartTime\\\": \\\"Tue 2022-10-18 05:08:42 UTC\\\", \\\"CSEStartTime\\\": \\\"Tue Oct 18 05:09:12 UTC 2022\\\", \\\"GuestAgentStartTime\\\": \\\"Tue 2022-10-18 05:09:05 UTC\\\", \\\"KubeletStartTime\\\": \\\"Tue 2022-10-18 05:09:47 UTC\\\" } }\\n\\n[stderr]\\n\\\"\\r\\n\\r\\nMore information on troubleshooting is available at https://aka.ms/VMExtensionCSELinuxTroubleshoot \""

Is there a temporary workaround suggestion to this while we wait for an official API fix? So far the only workaround I found is

  • Encounter the error on the first apply
  • Run a terraform import command to import it
  • Next apply, is successful

Update

Just a quick note that I managed to fixed my issue - it seemed the root cause for mine was that because I had added IP restriction to the inbound control plane, I also had to add the NAT outbound IP of the AKS nodes (which were attached to a subnet). Once I did that (and recreated the cluster), the terraform provider could create it with no issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

7 participants