Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ubuntu 22.04: pods get killed when using nvidia-runtime and pod resources differ between limits and requests #7334

Closed
maaft opened this issue Apr 21, 2023 · 1 comment

Comments

@maaft
Copy link

maaft commented Apr 21, 2023

Not sure who's responsible here: (cri, k3s, nvidia-device-plugin ?).

related issue for k8s-device-plugin: https://gitlab.com/nvidia/kubernetes/device-plugin/-/issues/7

bug not present on ubuntu 20.04 !

tl;dr: The error is caused when any resource limits and requests differ. E.g. requests.cpu = 1 and limits.cpu = 2 -> pod gets killed. Doesn't matter if nvidia-gpu-resource is actually requested or not.

Steps to reproduce
0. install ubuntu22.04 on gpu node

  1. nodes needed: 1 node with GPU, 1 node without
  2. install k3s server on the node without GPU
  3. install k3s agent on the node with GPU
  4. install nvidia-device-plugin
---
apiVersion: node.k8s.io/v1
kind: RuntimeClass
metadata:
  name: nvidia
handler: "nvidia"
---
apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: nvidia-device-plugin-daemonset
  namespace: kube-system
spec:
  selector:
    matchLabels:
      name: nvidia-device-plugin-ds
  updateStrategy:
    type: RollingUpdate
  template:
    metadata:
      labels:
        name: nvidia-device-plugin-ds
    spec:
      runtimeClassName: nvidia
      tolerations:
        - key: nvidia.com/gpu
          operator: Exists
          effect: NoSchedule
      # Mark this pod as a critical add-on; when enabled, the critical add-on
      # scheduler reserves resources for critical add-on pods so that they can
      # be rescheduled after a failure.
      # See https://kubernetes.io/docs/tasks/administer-cluster/guaranteed-scheduling-critical-addon-pods/
      priorityClassName: "system-node-critical"
      containers:
        - image: nvcr.io/nvidia/k8s-device-plugin:v0.13.0
          name: nvidia-device-plugin-ctr
          env:
            - name: FAIL_ON_INIT_ERROR
              value: "false"
          securityContext:
            allowPrivilegeEscalation: false
            capabilities:
              drop: ["ALL"]
          volumeMounts:
            - name: device-plugin
              mountPath: /var/lib/kubelet/device-plugins
      volumes:
        - name: device-plugin
          hostPath:
            path: /var/lib/kubelet/device-plugins
  1. verify that GPU is allocatable with kubectl describe node agent
  2. create following test pod and observe that it is getting killed after roughly 5 seconds (grace period)
apiVersion: v1
kind: Pod
metadata:
 name: test
spec:
  restartPolicy: "Never"
  runtimeClassName: "nvidia"
  terminationGracePeriodSeconds: 5
  containers:
  - name: nvidia-smi
    image: "nvidia/cuda:12.1.0-base-ubuntu18.04"
    command:
      - "sleep"
    args:
      - "infinity"
    resources:
      requests:
        cpu: "1"
        memory: 1Gi
      limits:
        cpu: "2"
        memory: 1Gi
  1. create following pod and observer that the pod is not killed and runs infinitely
apiVersion: v1
kind: Pod
metadata:
 name: test2
spec:
  restartPolicy: "Never"
  runtimeClassName: "nvidia"
  terminationGracePeriodSeconds: 5
  containers:
  - name: nvidia-smi
    image: "nvidia/cuda:12.1.0-base-ubuntu18.04"
    command:
      - "sleep"
    args:
      - "infinity"
    resources:
      requests:
        cpu: "1"
        memory: 1Gi
      limits:
        cpu: "1"
        memory: 1Gi

Would be awesome if anyone could reproduce this!

@brandond
Copy link
Contributor

Duplicate of #7130 - but as discussed there, it appears that something in the nvidia device plugin is doing this, not K3s or containerd.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Archived in project
Development

No branches or pull requests

2 participants