tests: list the current k8s pods #8772

danmihai1 · 2024-01-04T03:10:32Z

Log the list of the current pods between tests because these pods might be related to cluster nodes occasionally running out of memory.

Fixes: #8769

stevenhorsman · 2024-01-04T10:30:11Z

Hey @danmihai1 - I might have misunderstood something, but it looks like wait=true the default value from https://kubernetes.io/docs/reference/generated/kubectl/kubectl-commands#delete ?

danmihai1 · 2024-01-04T16:21:26Z

Hey @danmihai1 - I might have misunderstood something, but it looks like wait=true the default value from https://kubernetes.io/docs/reference/generated/kubectl/kubectl-commands#delete ?

Thanks Steve! I didn't notice the Default column in that doc when I looked at it yesterday.

It looks like "delete" returns without waiting for some of the K8s YAML files that I tested. Unfortunately --wait=true doesn't change that behavior. For example, my 3 VMs below are still around after deleting my Deployment. I will keep looking for a way to avoid those temporary zombie VMs - unless you already know what the solution is.

k delete --wait=true -f appsv1deployment.yaml && k get pods
deployment.apps "expose-test-deployment" deleted
NAME                                                    READY   STATUS
expose-test-deployment-5d5f68bb97-bln6q                 1/1     Terminating
expose-test-deployment-5d5f68bb97-s5nrl                 1/1     Terminating
expose-test-deployment-5d5f68bb97-xbspg                 1/1     Terminating

This delete behavior is similar with vanilla runc containers too, not just with Kata. My YAML fille:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: expose-test-deployment
  labels:
    name: expose-test-deployment
spec:
  replicas: 3
  selector:
    matchLabels:
      name: nginx
  template:
    metadata:
      labels:
        name: nginx
    spec:
      containers:
        - name: nginx
          image: nginx:latest
          ports:
            - containerPort: 80

stevenhorsman · 2024-01-04T16:33:51Z

It looks like "delete" returns without waiting for some of the K8s YAML files that I tested. Unfortunately --wait=true doesn't change that behavior. For example, my 3 VMs below are still around after deleting my Deployment.

Thanks a lot for the recreate steps that lets me have a play. I think what I've seen so far is that the problem here is that kubectl delete waits for the deployment resource to be deleted, which seems to work, but not the pods:

# kubectl delete -f deploy.yaml && kubectl get deployment &&  kubectl get pods
deployment.apps "expose-test-deployment" deleted
No resources found in default namespace.
NAME                                      READY   STATUS        RESTARTS   AGE
expose-test-deployment-5d5f68bb97-2kbkz   1/1     Terminating   0          109s
expose-test-deployment-5d5f68bb97-glpq6   1/1     Terminating   0          109s
expose-test-deployment-5d5f68bb97-mmf5r   1/1     Terminating   0          109s

I'll have a play and see if I find a nicer way to wait for the pods as well and if I do, I'll let you know.

stevenhorsman · 2024-01-04T16:46:31Z

So, at first attempt, the only way I can see to wait for the deletion of the pods in a deployment is using the deployment's selector. e.g. the above example has a selector name: nginx, so I can do the following:

kubectl delete -f deploy.yaml && kubectl wait --for delete pod --selector=name=nginx && kubectl get pods
deployment.apps "expose-test-deployment" deleted
pod/expose-test-deployment-5d5f68bb97-bp6z9 condition met
pod/expose-test-deployment-5d5f68bb97-dlm5f condition met
pod/expose-test-deployment-5d5f68bb97-fscjk condition met
No resources found in default namespace.

That's admittedly not super helpful for the generic case, where we don't necessarily know the labels, but from a quick scan through the PR, it looks like most of the cases are directly deleting the pod, which should just wait ok IIUC?

danmihai1 · 2024-01-04T16:55:25Z

That sounds very promising, thanks! I will give this a try when I can come back to it - hopefully in the next couple of days.

Log the list of the current pods between tests because these pods might be related to cluster nodes occasionally running out of memory. Fixes: kata-containers#8769 Signed-off-by: Dan Mihai <dmihai@microsoft.com>

danmihai1 · 2024-01-06T00:41:38Z

/test

danmihai1 · 2024-01-08T18:28:30Z

It would be nice to see some evidence that Kata CI is dealing with leaked (for short term or longer term) pods, before making relatively intrusive changes to wait for pod termination. I didn't get such evidence here, so maybe my guess about zombie pods running the node out of memory was incorrect.

So, I think we could proceed in one of the following ways:

Merge this PR and review the new log output when the K8s tests will run out of memory again in the future.
Abandon this PR. (maybe it does more harm than good?!?)

I vote for (1), but I wouldn't mind (2) either.

stevenhorsman · 2024-01-08T18:36:14Z

So, I think we could proceed in one of the following ways:

Merge this PR and review the new log output when the K8s tests will run out of memory again in the future.

Abandon this PR. (maybe it does more harm than good?!?)

I vote for (1), but I wouldn't mind (2) either.

I feel like getting more information is a good plan if we don't quite understand what's going wrong. I don't think that it seems like an excessive amount and harmful, so I'm okay with it.

stevenhorsman

LGTM - as discussed in the comments. Thanks

katacontainersbot added the size/small Small and simple task label Jan 4, 2024

danmihai1 added wip Work in Progress (PR incomplete - needs more work or rework) ok-to-test labels Jan 4, 2024

danmihai1 force-pushed the danmihai1/wait-for-delete branch from e8120f5 to 7dc061d Compare January 4, 2024 06:42

katacontainersbot added size/large Task of significant size and removed size/small Small and simple task labels Jan 4, 2024

tests: list the current k8s pods

90c782f

Log the list of the current pods between tests because these pods might be related to cluster nodes occasionally running out of memory. Fixes: kata-containers#8769 Signed-off-by: Dan Mihai <dmihai@microsoft.com>

danmihai1 force-pushed the danmihai1/wait-for-delete branch from 7dc061d to 90c782f Compare January 5, 2024 16:45

katacontainersbot added size/tiny Smallest and simplest task and removed size/large Task of significant size labels Jan 5, 2024

danmihai1 changed the title ~~tests: wait in "kubectl delete"~~ tests: list the current k8s pods Jan 6, 2024

danmihai1 removed the wip Work in Progress (PR incomplete - needs more work or rework) label Jan 6, 2024

danmihai1 requested review from stevenhorsman, fidencio, sprt and GabyCT January 8, 2024 18:29

stevenhorsman approved these changes Jan 8, 2024

View reviewed changes

GabyCT approved these changes Jan 9, 2024

View reviewed changes

danmihai1 merged commit de61b4d into kata-containers:main Jan 9, 2024
210 of 325 checks passed

danmihai1 deleted the danmihai1/wait-for-delete branch January 21, 2024 16:07

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

tests: list the current k8s pods #8772

tests: list the current k8s pods #8772

danmihai1 commented Jan 4, 2024 •

edited

stevenhorsman commented Jan 4, 2024

danmihai1 commented Jan 4, 2024 •

edited

stevenhorsman commented Jan 4, 2024

stevenhorsman commented Jan 4, 2024

danmihai1 commented Jan 4, 2024

danmihai1 commented Jan 6, 2024

danmihai1 commented Jan 8, 2024 •

edited

stevenhorsman commented Jan 8, 2024

stevenhorsman left a comment

tests: list the current k8s pods #8772

tests: list the current k8s pods #8772

Conversation

danmihai1 commented Jan 4, 2024 • edited

stevenhorsman commented Jan 4, 2024

danmihai1 commented Jan 4, 2024 • edited

stevenhorsman commented Jan 4, 2024

stevenhorsman commented Jan 4, 2024

danmihai1 commented Jan 4, 2024

danmihai1 commented Jan 6, 2024

danmihai1 commented Jan 8, 2024 • edited

stevenhorsman commented Jan 8, 2024

stevenhorsman left a comment

Choose a reason for hiding this comment

danmihai1 commented Jan 4, 2024 •

edited

danmihai1 commented Jan 4, 2024 •

edited

danmihai1 commented Jan 8, 2024 •

edited