Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

tests: list the current k8s pods #8772

Merged

Conversation

danmihai1
Copy link
Contributor

@danmihai1 danmihai1 commented Jan 4, 2024

Log the list of the current pods between tests because these pods might be related to cluster nodes occasionally running out of memory.

Fixes: #8769

@katacontainersbot katacontainersbot added the size/small Small and simple task label Jan 4, 2024
@danmihai1 danmihai1 added wip Work in Progress (PR incomplete - needs more work or rework) ok-to-test labels Jan 4, 2024
@katacontainersbot katacontainersbot added size/large Task of significant size and removed size/small Small and simple task labels Jan 4, 2024
@stevenhorsman
Copy link
Member

Hey @danmihai1 - I might have misunderstood something, but it looks like wait=true the default value from https://kubernetes.io/docs/reference/generated/kubectl/kubectl-commands#delete ?

@danmihai1
Copy link
Contributor Author

danmihai1 commented Jan 4, 2024

Hey @danmihai1 - I might have misunderstood something, but it looks like wait=true the default value from https://kubernetes.io/docs/reference/generated/kubectl/kubectl-commands#delete ?

Thanks Steve! I didn't notice the Default column in that doc when I looked at it yesterday.

It looks like "delete" returns without waiting for some of the K8s YAML files that I tested. Unfortunately --wait=true doesn't change that behavior. For example, my 3 VMs below are still around after deleting my Deployment. I will keep looking for a way to avoid those temporary zombie VMs - unless you already know what the solution is.

k delete --wait=true -f appsv1deployment.yaml && k get pods
deployment.apps "expose-test-deployment" deleted
NAME                                                    READY   STATUS
expose-test-deployment-5d5f68bb97-bln6q                 1/1     Terminating
expose-test-deployment-5d5f68bb97-s5nrl                 1/1     Terminating
expose-test-deployment-5d5f68bb97-xbspg                 1/1     Terminating

This delete behavior is similar with vanilla runc containers too, not just with Kata. My YAML fille:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: expose-test-deployment
  labels:
    name: expose-test-deployment
spec:
  replicas: 3
  selector:
    matchLabels:
      name: nginx
  template:
    metadata:
      labels:
        name: nginx
    spec:
      containers:
        - name: nginx
          image: nginx:latest
          ports:
            - containerPort: 80

@stevenhorsman
Copy link
Member

It looks like "delete" returns without waiting for some of the K8s YAML files that I tested. Unfortunately --wait=true doesn't change that behavior. For example, my 3 VMs below are still around after deleting my Deployment.

Thanks a lot for the recreate steps that lets me have a play. I think what I've seen so far is that the problem here is that kubectl delete waits for the deployment resource to be deleted, which seems to work, but not the pods:

# kubectl delete -f deploy.yaml && kubectl get deployment &&  kubectl get pods
deployment.apps "expose-test-deployment" deleted
No resources found in default namespace.
NAME                                      READY   STATUS        RESTARTS   AGE
expose-test-deployment-5d5f68bb97-2kbkz   1/1     Terminating   0          109s
expose-test-deployment-5d5f68bb97-glpq6   1/1     Terminating   0          109s
expose-test-deployment-5d5f68bb97-mmf5r   1/1     Terminating   0          109s

I'll have a play and see if I find a nicer way to wait for the pods as well and if I do, I'll let you know.

@stevenhorsman
Copy link
Member

So, at first attempt, the only way I can see to wait for the deletion of the pods in a deployment is using the deployment's selector. e.g. the above example has a selector name: nginx, so I can do the following:

kubectl delete -f deploy.yaml && kubectl wait --for delete pod --selector=name=nginx && kubectl get pods
deployment.apps "expose-test-deployment" deleted
pod/expose-test-deployment-5d5f68bb97-bp6z9 condition met
pod/expose-test-deployment-5d5f68bb97-dlm5f condition met
pod/expose-test-deployment-5d5f68bb97-fscjk condition met
No resources found in default namespace.

That's admittedly not super helpful for the generic case, where we don't necessarily know the labels, but from a quick scan through the PR, it looks like most of the cases are directly deleting the pod, which should just wait ok IIUC?

@danmihai1
Copy link
Contributor Author

That sounds very promising, thanks! I will give this a try when I can come back to it - hopefully in the next couple of days.

Log the list of the current pods between tests because these pods
might be related to cluster nodes occasionally running out of memory.

Fixes: kata-containers#8769

Signed-off-by: Dan Mihai <dmihai@microsoft.com>
@katacontainersbot katacontainersbot added size/tiny Smallest and simplest task and removed size/large Task of significant size labels Jan 5, 2024
@danmihai1 danmihai1 changed the title tests: wait in "kubectl delete" tests: list the current k8s pods Jan 6, 2024
@danmihai1 danmihai1 removed the wip Work in Progress (PR incomplete - needs more work or rework) label Jan 6, 2024
@danmihai1
Copy link
Contributor Author

/test

@danmihai1
Copy link
Contributor Author

danmihai1 commented Jan 8, 2024

It would be nice to see some evidence that Kata CI is dealing with leaked (for short term or longer term) pods, before making relatively intrusive changes to wait for pod termination. I didn't get such evidence here, so maybe my guess about zombie pods running the node out of memory was incorrect.

So, I think we could proceed in one of the following ways:

  1. Merge this PR and review the new log output when the K8s tests will run out of memory again in the future.
  2. Abandon this PR. (maybe it does more harm than good?!?)

I vote for (1), but I wouldn't mind (2) either.

@stevenhorsman
Copy link
Member

So, I think we could proceed in one of the following ways:

  1. Merge this PR and review the new log output when the K8s tests will run out of memory again in the future.
  2. Abandon this PR. (maybe it does more harm than good?!?)

I vote for (1), but I wouldn't mind (2) either.

I feel like getting more information is a good plan if we don't quite understand what's going wrong. I don't think that it seems like an excessive amount and harmful, so I'm okay with it.

Copy link
Member

@stevenhorsman stevenhorsman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM - as discussed in the comments. Thanks

@danmihai1 danmihai1 merged commit de61b4d into kata-containers:main Jan 9, 2024
210 of 325 checks passed
@danmihai1 danmihai1 deleted the danmihai1/wait-for-delete branch January 21, 2024 16:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ok-to-test size/tiny Smallest and simplest task
Projects
None yet
Development

Successfully merging this pull request may close these issues.

sporadic kata-containers-ci-on-push / run-k8s-tests-on-aks / run-k8s-tests (cbl-mariner, clh) failures
4 participants