Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

mount成功之后Slave-pod 过一会被杀死,导致不能卸载GPU #22

Closed
jason-gideon opened this issue Dec 6, 2022 · 2 comments
Closed

Comments

@jason-gideon
Copy link

jason-gideon commented Dec 6, 2022

请问下大佬,mount流程创建slave-pod之后,这个slave-pod是不是应该一直存在,直到removeGPU?
我这边这个slave-pod,running之后不一会就被kill了,然后再removeGPU就失败了,这块被kill是什么原因?有啥思路不?是被驱逐了?这块从哪里排查比较好?感谢!

sol-UniServer-R4900-G3:~/go/src/github.com/jason-gideon/GPUMounter/example$ kubectl -n gpu-pool describe pod gpu-pod-slave-pod-6ffc13
Name:                      gpu-pod-slave-pod-6ffc13
Namespace:                 gpu-pool
Priority:                  0
Service Account:           default
Node:                      software-dell-r740-015/10.115.0.253
Start Time:                Tue, 06 Dec 2022 18:46:36 +0800
Labels:                    app=gpu-pool
Annotations:               cni.projectcalico.org/containerID: f3cbb407ae1601047a04a8e322b4eca80abd70df24f9de9e5f105586dd1d98fd
                           cni.projectcalico.org/podIP: 10.42.1.143/32
                           cni.projectcalico.org/podIPs: 10.42.1.143/32
                           k8s.v1.cni.cncf.io/network-status:
                             [{
                                 "name": "",
                                 "ips": [
                                     "10.42.1.143"
                                 ],
                                 "default": true,
                                 "dns": {}
                             }]
                           k8s.v1.cni.cncf.io/networks-status:
                             [{
                                 "name": "",
                                 "ips": [
                                     "10.42.1.143"
                                 ],
                                 "default": true,
                                 "dns": {}
                             }]
Status:                    Terminating (lasts <invalid>)
Termination Grace Period:  30s
IP:                        10.42.1.143
IPs:
  IP:           10.42.1.143
Controlled By:  Pod/gpu-pod
Containers:
  gpu-container:
    Container ID:  docker://e7f1f51dd6c3996d93172e1f56b3f955042d6f15726b6fb71745eb2bb6499707
    Image:         alpine:latest
    Image ID:      docker-pullable://alpine@sha256:8914eb54f968791faf6a8638949e480fef81e697984fba772b3976835194c6d4
    Port:          <none>
    Host Port:     <none>
    Command:
      /bin/sh
    Args:
      -c
      while true; do echo this is a gpu pool container; sleep 10;done
    State:          Running
      Started:      Tue, 06 Dec 2022 18:46:41 +0800
    Ready:          True
    Restart Count:  0
    Limits:
      nvidia.com/gpu:  1
    Requests:
      nvidia.com/gpu:  1
    Environment:       <none>
    Mounts:
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-jhkdp (ro)
Conditions:
  Type              Status
  Initialized       True
  Ready             True
  ContainersReady   True
  PodScheduled      True
Volumes:
  kube-api-access-jhkdp:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  3607
    ConfigMapName:           kube-root-ca.crt
    ConfigMapOptional:       <nil>
    DownwardAPI:             true
QoS Class:                   BestEffort
Node-Selectors:              kubernetes.io/hostname=software-dell-r740-015
Tolerations:                 node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
                             node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
  Type     Reason                    Age   From                          Message
  ----     ------                    ----  ----                          -------
  Normal   Scheduled                 30s   default-scheduler             Successfully assigned gpu-pool/gpu-pod-slave-pod-6ffc13 to software-dell-r740-015
  Warning  OwnerRefInvalidNamespace  30s   garbage-collector-controller  ownerRef [v1/Pod, namespace: gpu-pool, name: gpu-pod, uid: 6c482ef7-9acd-41ab-925e-101e166f75de] does not exist in namespace "gpu-pool"
  Normal   AddedInterface            28s   multus                        Add eth0 [10.42.1.143/32]
  Normal   Pulling                   28s   kubelet                       Pulling image "alpine:latest"
  Normal   Pulled                    26s   kubelet                       Successfully pulled image "alpine:latest" in 2.151216821s
  Normal   Created                   26s   kubelet                       Created container gpu-container
  Normal   Started                   25s   kubelet                       Started container gpu-container
  Normal   Killing                   10s   kubelet                       Stopping container gpu-container
@jason-gideon jason-gideon changed the title mout成功之后Slave-pod 过一会被杀死,导致不能卸载GPU mount成功之后Slave-pod 过一会被杀死,导致不能卸载GPU Dec 6, 2022
@pokerfaceSad
Copy link
Owner

@jason-gideon
确实存在一个待修复的已知问题 #9 ,当前slave pod的QoS是BestEffort,高负载情况下可能会被驱逐

但是看event应该与 #19 是同样的问题,k8s v1.20+不允许ownerRef跨namespaces

@jason-gideon
Copy link
Author

好的,明白了,感谢大佬热心回答!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants