Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CI: azure aks deploy sometimes fail to find the resources group #8989

Closed
wainersm opened this issue Feb 1, 2024 · 0 comments · Fixed by #8994 or #9007
Closed

CI: azure aks deploy sometimes fail to find the resources group #8989

wainersm opened this issue Feb 1, 2024 · 0 comments · Fixed by #8994 or #9007
Labels
area/ci Issues affecting the continuous integration bug Incorrect behaviour needs-review Needs to be assessed by the team.

Comments

@wainersm
Copy link
Contributor

wainersm commented Feb 1, 2024

Recently I've noticed some CI jobs that relies on AKS provisioning failing because the Azure resource group (which is created on-demand before creating the cluster) is not found:

For example, https://github.com/kata-containers/kata-containers/actions/runs/7724334694/job/21074189970?pr=8839 and https://github.com/kata-containers/kata-containers/actions/runs/7722583170/job/21076489139?pr=8974, you will see an error similar to:

ERROR: (ResourceGroupNotFound) Resource group 'kataCI-k8s-8839-27c908f10757-qemu-ubuntu-amd64-s' could not be found.
Code: ResourceGroupNotFound
Message: Resource group 'kataCI-k8s-8839-27c908f10757-qemu-ubuntu-amd64-s' could not be found.
{
  "id": "/subscriptions/***/resourceGroups/kataCI-k8s-8839-27c908f10757-qemu-ubuntu-amd64-s",
  "location": "eastus2",
  "managedBy": null,
  "name": "kataCI-k8s-8839-27c908f10757-qemu-ubuntu-amd64-s",
  "properties": {
    "provisioningState": "Succeeded"
  },
  "tags": null,
  "type": "Microsoft.Resources/resourceGroups"
}
WARNING: The behavior of this command has been altered by the following extension: aks-preview
WARNING: SSH key files '/home/runner/.ssh/id_rsa' and '/home/runner/.ssh/id_rsa.pub' have been generated under ~/.ssh to allow SSH access to the VM. If using machines without permanent storage like Azure Cloud Shell without an attached file share, back up your keys to a safe location
ERROR: (ResourceGroupNotFound) Create or update resource group node-kataCI-k8s-8839-27c908f10757-qemu-ubuntu-amd64-s failed. autorest/azure: Service returned an error. Status=404 Code="ResourceGroupNotFound" Message="Resource group 'node-kataCI-k8s-8839-27c908f10757-qemu-ubuntu-amd64-s' could not be found."
Code: ResourceGroupNotFound
Message: Create or update resource group node-kataCI-k8s-8839-27c908f10757-qemu-ubuntu-amd64-s failed. autorest/azure: Service returned an error. Status=404 Code="ResourceGroupNotFound" Message="Resource group 'node-kataCI-k8s-8839-27c908f10757-qemu-ubuntu-amd64-s' could not be found."
Exception Details:	(ResourceGroupNotFound) Resource group 'node-kataCI-k8s-8839-27c908f10757-qemu-ubuntu-amd64-s' could not be found.
	Code: ResourceGroupNotFound
	Message: Resource group 'node-kataCI-k8s-8839-27c908f10757-qemu-ubuntu-amd64-s' could not be found.
Error: Process completed with exit code 1.

On https://github.com/kata-containers/kata-containers/blob/main/tests/gha-run-k8s-common.sh#L68 it is requested the creation the aforementioned resource group but it doesn't wait the resource to effectively exist. That might be the problem.

@wainersm wainersm added bug Incorrect behaviour needs-review Needs to be assessed by the team. area/ci Issues affecting the continuous integration labels Feb 1, 2024
wainersm added a commit to wainersm/kata-containers that referenced this issue Feb 1, 2024
To provision k8s on azure (AKS) there should be created a
temporary resources group before. The script sends the request to get it
created but doesn't wait the operation to finish, so sometimes it tries
to use a resources group what doesn't exist and then bail out.

This added some `az group wait` check points. Even on deletion we want
to ensure there won't be dangling resources groups.

Fixes kata-containers#8989
Signed-off-by: Wainer dos Santos Moschetta <wainersm@redhat.com>
@katacontainersbot katacontainersbot moved this from To do to In progress in Issue backlog Feb 1, 2024
sprt added a commit to sprt/kata-containers that referenced this issue Feb 1, 2024
This addresses an internal AKS issue that intermittently prevents
clusters from getting created. The fix has been rolled out to eastus but
not yet eastus2, so we unblock the CI by switching. No downsides in
general.

This supersedes kata-containers#8990.

Fixes: kata-containers#8989

Signed-off-by: Aurélien Bombo <abombo@microsoft.com>
wainersm added a commit to wainersm/kata-containers that referenced this issue Feb 2, 2024
delete_cluster() has tried to delete the az resources group regardless
if it exists. In some cases the result of that operation is ignored,
i.e., fail to resource group not found, but the log messages get a
little dirty. Let's delete the RG only if it exists then.

Fixes kata-containers#8989
Signed-off-by: Wainer dos Santos Moschetta <wainersm@redhat.com>
wainersm added a commit to wainersm/kata-containers that referenced this issue Feb 2, 2024
delete_cluster() has tried to delete the az resources group regardless
if it exists. In some cases the result of that operation is ignored,
i.e., fail to resource group not found, but the log messages get a
little dirty. Let's delete the RG only if it exists then.

Fixes kata-containers#8989
Signed-off-by: Wainer dos Santos Moschetta <wainersm@redhat.com>
c3d pushed a commit to c3d/kata-containers that referenced this issue Feb 23, 2024
This addresses an internal AKS issue that intermittently prevents
clusters from getting created. The fix has been rolled out to eastus but
not yet eastus2, so we unblock the CI by switching. No downsides in
general.

This supersedes kata-containers#8990.

Fixes: kata-containers#8989

Signed-off-by: Aurélien Bombo <abombo@microsoft.com>
c3d pushed a commit to c3d/kata-containers that referenced this issue Feb 23, 2024
delete_cluster() has tried to delete the az resources group regardless
if it exists. In some cases the result of that operation is ignored,
i.e., fail to resource group not found, but the log messages get a
little dirty. Let's delete the RG only if it exists then.

Fixes kata-containers#8989
Signed-off-by: Wainer dos Santos Moschetta <wainersm@redhat.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/ci Issues affecting the continuous integration bug Incorrect behaviour needs-review Needs to be assessed by the team.
Projects
Issue backlog
  
In progress
1 participant