mount成功之后Slave-pod 过一会被杀死，导致不能卸载GPU #22

jason-gideon · 2022-12-06T10:51:49Z

请问下大佬，mount流程创建slave-pod之后，这个slave-pod是不是应该一直存在，直到removeGPU？
我这边这个slave-pod，running之后不一会就被kill了，然后再removeGPU就失败了，这块被kill是什么原因？有啥思路不?是被驱逐了？这块从哪里排查比较好？感谢！

sol-UniServer-R4900-G3:~/go/src/github.com/jason-gideon/GPUMounter/example$ kubectl -n gpu-pool describe pod gpu-pod-slave-pod-6ffc13
Name:                      gpu-pod-slave-pod-6ffc13
Namespace:                 gpu-pool
Priority:                  0
Service Account:           default
Node:                      software-dell-r740-015/10.115.0.253
Start Time:                Tue, 06 Dec 2022 18:46:36 +0800
Labels:                    app=gpu-pool
Annotations:               cni.projectcalico.org/containerID: f3cbb407ae1601047a04a8e322b4eca80abd70df24f9de9e5f105586dd1d98fd
                           cni.projectcalico.org/podIP: 10.42.1.143/32
                           cni.projectcalico.org/podIPs: 10.42.1.143/32
                           k8s.v1.cni.cncf.io/network-status:
                             [{
                                 "name": "",
                                 "ips": [
                                     "10.42.1.143"
                                 ],
                                 "default": true,
                                 "dns": {}
                             }]
                           k8s.v1.cni.cncf.io/networks-status:
                             [{
                                 "name": "",
                                 "ips": [
                                     "10.42.1.143"
                                 ],
                                 "default": true,
                                 "dns": {}
                             }]
Status:                    Terminating (lasts <invalid>)
Termination Grace Period:  30s
IP:                        10.42.1.143
IPs:
  IP:           10.42.1.143
Controlled By:  Pod/gpu-pod
Containers:
  gpu-container:
    Container ID:  docker://e7f1f51dd6c3996d93172e1f56b3f955042d6f15726b6fb71745eb2bb6499707
    Image:         alpine:latest
    Image ID:      docker-pullable://alpine@sha256:8914eb54f968791faf6a8638949e480fef81e697984fba772b3976835194c6d4
    Port:          <none>
    Host Port:     <none>
    Command:
      /bin/sh
    Args:
      -c
      while true; do echo this is a gpu pool container; sleep 10;done
    State:          Running
      Started:      Tue, 06 Dec 2022 18:46:41 +0800
    Ready:          True
    Restart Count:  0
    Limits:
      nvidia.com/gpu:  1
    Requests:
      nvidia.com/gpu:  1
    Environment:       <none>
    Mounts:
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-jhkdp (ro)
Conditions:
  Type              Status
  Initialized       True
  Ready             True
  ContainersReady   True
  PodScheduled      True
Volumes:
  kube-api-access-jhkdp:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  3607
    ConfigMapName:           kube-root-ca.crt
    ConfigMapOptional:       <nil>
    DownwardAPI:             true
QoS Class:                   BestEffort
Node-Selectors:              kubernetes.io/hostname=software-dell-r740-015
Tolerations:                 node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
                             node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
  Type     Reason                    Age   From                          Message
  ----     ------                    ----  ----                          -------
  Normal   Scheduled                 30s   default-scheduler             Successfully assigned gpu-pool/gpu-pod-slave-pod-6ffc13 to software-dell-r740-015
  Warning  OwnerRefInvalidNamespace  30s   garbage-collector-controller  ownerRef [v1/Pod, namespace: gpu-pool, name: gpu-pod, uid: 6c482ef7-9acd-41ab-925e-101e166f75de] does not exist in namespace "gpu-pool"
  Normal   AddedInterface            28s   multus                        Add eth0 [10.42.1.143/32]
  Normal   Pulling                   28s   kubelet                       Pulling image "alpine:latest"
  Normal   Pulled                    26s   kubelet                       Successfully pulled image "alpine:latest" in 2.151216821s
  Normal   Created                   26s   kubelet                       Created container gpu-container
  Normal   Started                   25s   kubelet                       Started container gpu-container
  Normal   Killing                   10s   kubelet                       Stopping container gpu-container

The text was updated successfully, but these errors were encountered:

pokerfaceSad · 2022-12-07T03:19:54Z

@jason-gideon
确实存在一个待修复的已知问题 #9 ，当前slave pod的QoS是BestEffort，高负载情况下可能会被驱逐

但是看event应该与 #19 是同样的问题，k8s v1.20+不允许ownerRef跨namespaces

jason-gideon · 2022-12-07T06:02:09Z

好的，明白了，感谢大佬热心回答！

jason-gideon changed the title ~~mout成功之后Slave-pod 过一会被杀死，导致不能卸载GPU~~ mount成功之后Slave-pod 过一会被杀死，导致不能卸载GPU Dec 6, 2022

jason-gideon closed this as completed Dec 7, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

mount成功之后Slave-pod 过一会被杀死，导致不能卸载GPU #22

mount成功之后Slave-pod 过一会被杀死，导致不能卸载GPU #22

jason-gideon commented Dec 6, 2022 •

edited

Loading

pokerfaceSad commented Dec 7, 2022

jason-gideon commented Dec 7, 2022

mount成功之后Slave-pod 过一会被杀死，导致不能卸载GPU #22

mount成功之后Slave-pod 过一会被杀死，导致不能卸载GPU #22

Comments

jason-gideon commented Dec 6, 2022 • edited Loading

pokerfaceSad commented Dec 7, 2022

jason-gideon commented Dec 7, 2022

jason-gideon commented Dec 6, 2022 •

edited

Loading