We read every piece of feedback, and take your input very seriously.
To see all available qualifiers, see our documentation.
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
请问下大佬,mount流程创建slave-pod之后,这个slave-pod是不是应该一直存在,直到removeGPU? 我这边这个slave-pod,running之后不一会就被kill了,然后再removeGPU就失败了,这块被kill是什么原因?有啥思路不?是被驱逐了?这块从哪里排查比较好?感谢!
sol-UniServer-R4900-G3:~/go/src/github.com/jason-gideon/GPUMounter/example$ kubectl -n gpu-pool describe pod gpu-pod-slave-pod-6ffc13 Name: gpu-pod-slave-pod-6ffc13 Namespace: gpu-pool Priority: 0 Service Account: default Node: software-dell-r740-015/10.115.0.253 Start Time: Tue, 06 Dec 2022 18:46:36 +0800 Labels: app=gpu-pool Annotations: cni.projectcalico.org/containerID: f3cbb407ae1601047a04a8e322b4eca80abd70df24f9de9e5f105586dd1d98fd cni.projectcalico.org/podIP: 10.42.1.143/32 cni.projectcalico.org/podIPs: 10.42.1.143/32 k8s.v1.cni.cncf.io/network-status: [{ "name": "", "ips": [ "10.42.1.143" ], "default": true, "dns": {} }] k8s.v1.cni.cncf.io/networks-status: [{ "name": "", "ips": [ "10.42.1.143" ], "default": true, "dns": {} }] Status: Terminating (lasts <invalid>) Termination Grace Period: 30s IP: 10.42.1.143 IPs: IP: 10.42.1.143 Controlled By: Pod/gpu-pod Containers: gpu-container: Container ID: docker://e7f1f51dd6c3996d93172e1f56b3f955042d6f15726b6fb71745eb2bb6499707 Image: alpine:latest Image ID: docker-pullable://alpine@sha256:8914eb54f968791faf6a8638949e480fef81e697984fba772b3976835194c6d4 Port: <none> Host Port: <none> Command: /bin/sh Args: -c while true; do echo this is a gpu pool container; sleep 10;done State: Running Started: Tue, 06 Dec 2022 18:46:41 +0800 Ready: True Restart Count: 0 Limits: nvidia.com/gpu: 1 Requests: nvidia.com/gpu: 1 Environment: <none> Mounts: /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-jhkdp (ro) Conditions: Type Status Initialized True Ready True ContainersReady True PodScheduled True Volumes: kube-api-access-jhkdp: Type: Projected (a volume that contains injected data from multiple sources) TokenExpirationSeconds: 3607 ConfigMapName: kube-root-ca.crt ConfigMapOptional: <nil> DownwardAPI: true QoS Class: BestEffort Node-Selectors: kubernetes.io/hostname=software-dell-r740-015 Tolerations: node.kubernetes.io/not-ready:NoExecute op=Exists for 300s node.kubernetes.io/unreachable:NoExecute op=Exists for 300s Events: Type Reason Age From Message ---- ------ ---- ---- ------- Normal Scheduled 30s default-scheduler Successfully assigned gpu-pool/gpu-pod-slave-pod-6ffc13 to software-dell-r740-015 Warning OwnerRefInvalidNamespace 30s garbage-collector-controller ownerRef [v1/Pod, namespace: gpu-pool, name: gpu-pod, uid: 6c482ef7-9acd-41ab-925e-101e166f75de] does not exist in namespace "gpu-pool" Normal AddedInterface 28s multus Add eth0 [10.42.1.143/32] Normal Pulling 28s kubelet Pulling image "alpine:latest" Normal Pulled 26s kubelet Successfully pulled image "alpine:latest" in 2.151216821s Normal Created 26s kubelet Created container gpu-container Normal Started 25s kubelet Started container gpu-container Normal Killing 10s kubelet Stopping container gpu-container
The text was updated successfully, but these errors were encountered:
@jason-gideon 确实存在一个待修复的已知问题 #9 ,当前slave pod的QoS是BestEffort,高负载情况下可能会被驱逐
但是看event应该与 #19 是同样的问题,k8s v1.20+不允许ownerRef跨namespaces
Sorry, something went wrong.
好的,明白了,感谢大佬热心回答!
No branches or pull requests
请问下大佬,mount流程创建slave-pod之后,这个slave-pod是不是应该一直存在,直到removeGPU?
我这边这个slave-pod,running之后不一会就被kill了,然后再removeGPU就失败了,这块被kill是什么原因?有啥思路不?是被驱逐了?这块从哪里排查比较好?感谢!
The text was updated successfully, but these errors were encountered: