Add GPU support #3257

samos123 · 2023-05-30T06:59:50Z

WIP, still need to add podman support

This implements GPU support as discussed in #3164

k8s-ci-robot · 2023-05-30T06:59:57Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: samos123
Once this PR has been reviewed and has the lgtm label, please assign bentheelder for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

neolit123 · 2023-05-30T08:49:47Z

pkg/apis/config/v1alpha4/types.go

@@ -118,6 +118,10 @@ type Node struct {
 	// binded to a host Port
 	ExtraPortMappings []PortMapping `yaml:"extraPortMappings,omitempty" json:"extraPortMappings,omitempty"`

+	// GPUs allows to access GPU devices from the kind node. Setting this to
+	// "all" will pass all the available GPUs to the kind node.
+	Gpus string `yaml:"gpus,omitempty" json:"gpus,omitempty"`


use GPUs string?

All letters in the acronym should have the same case, using the appropriate case for the situation

https://github.com/zecke/Kubernetes/blob/master/docs/devel/api-conventions.md#naming-conventions

+1, also this field should be validated (with "all" being the only valid value currently), and we should put a note that in the future we'll look at supporting specifying specific devices.

Commented more below and on the issue.

neolit123 · 2023-05-30T08:51:47Z

pkg/internal/apis/config/types.go

@@ -98,6 +98,10 @@ type Node struct {
 	// binded to a host Port
 	ExtraPortMappings []PortMapping

+	// GPUs allows to access GPU devices from the kind node. Setting this to
+	// "all" will pass all the available GPUs to the kind node.
+	Gpus string


neolit123 · 2023-05-30T08:53:11Z

site/content/docs/user/configuration.md

@@ -285,6 +285,20 @@ nodes:

 **Note**: Kubernetes versions are expressed as x.y.z, where x is the major version, y is the minor version, and z is the patch version, following [Semantic Versioning](https://semver.org/) terminology. For more information, see [Kubernetes Release Versioning.](https://github.com/kubernetes/sig-release/blob/master/release-engineering/versioning.md#kubernetes-release-versioning)

+### GPU Support
+
+Kind nodes can utilize GPUs by setting the following:


Suggested change

Kind nodes can utilize GPUs by setting the following:

Kind nodes can utilize GPU devices from the host, by setting the following:

aligning with the API godoc comment

BenTheElder · 2023-05-30T20:23:36Z

pkg/cluster/internal/providers/docker/provision.go

@@ -255,6 +255,10 @@ func runArgsForNode(node *config.Node, clusterIPFamily config.ClusterIPFamily, n
 		args = append(args, "-e", "KUBECONFIG=/etc/kubernetes/admin.conf")
 	}

+	if len(node.Gpus) > 0 {
+		args = append(args, fmt.Sprintf("--gpus=%v", node.Gpus))


let's not plumb this directly since the values are incompatible across backends

instead we can have a more structured format in the internal type (for now gpus true/false) and we need valdiatino on the external type.

I added validation to ensure right now only "all" can be passed so it's now safe to to use node.Gpus directly in the docker provisioner. I prefer to keep the code simpler and passthrough instead of adding another layer of validation in the docker provisioning code. Happy to add additional validation in docker/povision.go if you have strong opinion on it.

For podman we can not plumb it through directly. So I will work out a way to convert node.Gpus into a podman compatible format. I don't have podman installed myself.

Only 'all' will be supported for now

samos123 · 2023-06-01T06:35:51Z

One interesting thing I had to do was to run the following after kidn cluster came up:

helm install --wait --generate-name \
     -n gpu-operator --create-namespace \
     nvidia/gpu-operator \
     --set driver.enabled=false

docker exec -ti kind-control-plane bash
systemctl restart containerd
ln -s /sbin/ldconfig /sbin/ldconfig.real

So I'm thinking there might needs to be a change to the base image to include that symlink. This could also be due to my system being on Archlinux.

k8s-ci-robot · 2023-06-01T07:13:38Z

@samos123: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name	Commit	Details	Required	Rerun command
pull-kind-conformance-parallel-dual-stack-ipv4-ipv6	`cb9b34b`	link	true	`/test pull-kind-conformance-parallel-dual-stack-ipv4-ipv6`

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

klueska · 2023-06-01T09:48:00Z

Sorry to be a party pooper, but I do not believe this is the right approach for adding GPU support to kind.

The --gpus flag in docker has always been a bit of a hack, and not something that we (NVIDIA) would like to support long term. The proper way to support GPUs (and any device for that matter) in kind is to expose the --device flag through kind config file rather than the specialized --gpus flag.

With the upcoming CDI support that will be available in Docker 25, having the --device flag exposed will be sufficient to get equivalent functionality that the --gpus flag provides today (and more).

Once this is available in docker, both of these will work in a unified way:

podman run --rm --device nvidia.com/gpu=all ubuntu nvidia-smi -L
docker run --rm --device nvidia.com/gpu=all ubuntu nvidia-smi -L

klueska · 2023-06-26T11:38:45Z

It occurred to me over the weekend that we can actually enable GPU support in kind today by leveraging a feature we added to the nvidia-container-toolkit quite some time ago:

Read list of GPU devices from volume mounts instead of NVIDIA_VISIBLE_DEVICES

It relies on passing the list of GPUs you want to inject as volume mounts rather than via an environment variable.

Steps to enable this:

Add nvidia as your default runtime in /etc/docker/daemon.json
Restart docker (as necessary)
Set accept-nvidia-visible-devices-as-volume-mounts = true in /etc/nvidia-container-runtime/config.toml
Add the following to any kind nodes you want to have access to all GPUs in the system:

  extraMounts:
    - hostPath: /dev/null
      containerPath: /var/run/nvidia-container-devices/all

I've tested this in my local environment and it works as expected. The only caveat is that it only works for passing all GPUs into a node (i.e. I can't pick and choose some GPUs for one node and another set of GPUs for another node).

In general the nvidia-visible-devices-as-volume-mount feature allows you to do fine-grained injection of a subset of GPUs into a container. However, kind runs all of its nodes in --privileged containers, meaning that there is no way to prevent it from seeing all GPUs (even if you tell it you only want it to see a subset of the GPUs). If / when kind ever supports running worker nodes without --privileged this restriction should go away.

grokspeed · 2023-06-29T03:16:51Z

Steps to enable this:

Add nvidia as your default runtime in /etc/docker/daemon.json
[...]

Thanks @klueska I have linux server at home. But I have been running K8s training for my team using kind on their W11 laptops. Would like to dive into Kubeflow/Notebooks-with-GPUs scenario (I know, an overkill if not for training purpose, could have just used WSL2+CUDA). Would you be able translate the above to Docker Desktop on Windows environment?

The daemon config is accessible via the Docker Desktop UI. But I need some help on the nvidia-container-runtime/config.toml location on Windows. Though, as I read the Nvidia page on container runtime and toolkit I came away thinking they may not be applicable to WSL2, as:

docker run -it --rm --gpus all ubuntu nvidia-smi

seems to work with just installing CUDA support. But Kubeflow seems different in that Nvidia runtime is still needed for the new containers I would be building... if I understand the container runtime concept correctly?

samos123 · 2023-08-20T08:12:14Z

Edit: I posted a full tutorial on how to configure Kind + GPU support here: https://www.substratus.ai/blog/kind-with-gpus

I confirmed the steps provided by @klueska worked except hitting 1 minor issue. The only thing I had to do was run this:

docker exec -ti kind-control-plane ln -s /sbin/ldconfig /sbin/ldconfig.real

Should that be something that should be included in the base image?

Here were the full steps I used to verify:

nvidia_config_file="/etc/nvidia-container-runtime/config.toml"
if [ -e "${nvidia_config_file}" ]; then
sudo sed -i '/accept-nvidia-visible-devices-as-volume-mounts/c\accept-nvidia-visible-devices-as-volume-mounts = true' ${nvidia_config_file}
fi

kind create cluster --name kind --config - <<EOF
apiVersion: kind.x-k8s.io/v1alpha4
kind: Cluster
nodes:
- role: control-plane
  image: kindest/node:v1.27.3@sha256:3966ac761ae0136263ffdb6cfd4db23ef8a83cba8a463690e98317add2c9ba72
  extraPortMappings:
  - containerPort: 30080
    hostPort: 30080
  # part of GPU workaround
  extraMounts:
    - hostPath: /dev/null
      containerPath: /var/run/nvidia-container-devices/all
EOF

docker exec -ti kind-control-plane ln -s /sbin/ldconfig /sbin/ldconfig.real || true

helm repo add nvidia https://helm.ngc.nvidia.com/nvidia || true
helm repo update
helm install --wait --generate-name \
     -n gpu-operator --create-namespace \
     nvidia/gpu-operator --set driver.enabled=false

For more context see: kubernetes-sigs/kind#3257 (comment)

klueska · 2023-08-21T20:47:40Z

@samos123 You shouldn't need to create that symlink. Can you show me the contents of your /etc/nvidia-container-runtime/config.toml file?

samos123 · 2023-08-22T06:00:18Z

@klueska content of my file: https://gist.github.com/samos123/cc816b91a7a03651c71441e0949c3bb6

Note I don't seem to be the only one hitting this issue: NVIDIA/nvidia-docker#614 (comment) That's the source of the workaround

xussof · 2023-09-12T18:16:58Z

Following @samos123 steps I can't fully deploy nvidia-device-plugin-daemonset and nvidia-operator-validator when I have more than one node deployed by using this kind-config:

kind: Cluster
apiVersion: kind.x-k8s.io/v1alpha4
name: siv-dev-gpu
nodes:
  - role: control-plane
    image: kindest/node:v1.26.0
    extraMounts:
      - hostPath: /data/string-in-video/repos/git/github/string-in-video
        containerPath: /data/string-in-video/repos/git/github/string-in-video
      - hostPath: /data/string-in-video/files/dev
        containerPath: /data/string-in-video/files/dev
      - hostPath: /data/string-in-video/volumes/dev
        containerPath: /data/string-in-video/volumes/dev
      - hostPath: /dev/null
        containerPath: /var/run/nvidia-container-devices/all
    kubeadmConfigPatches:
      - |
        kind: InitConfiguration
        nodeRegistration:
          kubeletExtraArgs:
            node-labels: "ingress-ready=true"        
    extraPortMappings:
      - containerPort: 80
        hostPort: 80
        protocol: TCP
      - containerPort: 443
        hostPort: 443
        protocol: TCP

  - role: worker
    image: kindest/node:v1.26.0
    extraMounts:
      - hostPath: /data/string-in-video/repos/git/github/string-in-video
        containerPath: /data/string-in-video/repos/git/github/string-in-video
      - hostPath: /data/string-in-video/files/dev
        containerPath: /data/string-in-video/files/dev
      - hostPath: /data/string-in-video/volumes/dev
        containerPath: /data/string-in-video/volumes/dev

  - role: worker
    image: kindest/node:v1.26.0
    extraMounts:
      - hostPath: /data/string-in-video/repos/git/github/string-in-video
        containerPath: /data/string-in-video/repos/git/github/string-in-video
      - hostPath: /data/string-in-video/files/dev
        containerPath: /data/string-in-video/files/dev
      - hostPath: /data/string-in-video/volumes/dev
        containerPath: /data/string-in-video/volumes/dev
      - hostPath: /dev/null
        containerPath: /var/run/nvidia-container-devices/all

This is the error I get:

Failed to create pod sandbox: rpc error: code = Unknown desc = failed to get sandbox runtime: no runtime for "nvidia" is configured

But when I deploy it over a kind with a control-plane and a single worker it works.

samos123 · 2023-09-12T18:21:27Z

I think thats a limitation of the approach. Maybe @klueska as another cool workaround to solve that? :D

klueska · 2023-09-12T18:24:38Z

Only one of your workers has:

      - hostPath: /dev/null
        containerPath: /var/run/nvidia-container-devices/all

xussof · 2023-09-12T18:35:31Z

Only one of your workers has:

      - hostPath: /dev/null
        containerPath: /var/run/nvidia-container-devices/all

Yes because I only one one node to has gpu, the others I want to emulate that doesn't have GPU. Still, it fails if I add the hostpath to the another worker:

kind: Cluster
apiVersion: kind.x-k8s.io/v1alpha4
name: siv-dev-gpu
nodes:
  - role: control-plane
    image: kindest/node:v1.26.0
    extraMounts:
      - hostPath: /data/string-in-video/repos/git/github/string-in-video
        containerPath: /data/string-in-video/repos/git/github/string-in-video
      - hostPath: /data/string-in-video/files/dev
        containerPath: /data/string-in-video/files/dev
      - hostPath: /data/string-in-video/volumes/dev
        containerPath: /data/string-in-video/volumes/dev
      - hostPath: /dev/null
        containerPath: /var/run/nvidia-container-devices/all
    kubeadmConfigPatches:
      - |
        kind: InitConfiguration
        nodeRegistration:
          kubeletExtraArgs:
            node-labels: "ingress-ready=true"        
    extraPortMappings:
      - containerPort: 80
        hostPort: 80
        protocol: TCP
      - containerPort: 443
        hostPort: 443
        protocol: TCP

  - role: worker
    image: kindest/node:v1.26.0
    extraMounts:
      - hostPath: /data/string-in-video/repos/git/github/string-in-video
        containerPath: /data/string-in-video/repos/git/github/string-in-video
      - hostPath: /data/string-in-video/files/dev
        containerPath: /data/string-in-video/files/dev
      - hostPath: /data/string-in-video/volumes/dev
        containerPath: /data/string-in-video/volumes/dev
      - hostPath: /dev/null
        containerPath: /var/run/nvidia-container-devices/all
        
  - role: worker
    image: kindest/node:v1.26.0
    extraMounts:
      - hostPath: /data/string-in-video/repos/git/github/string-in-video
        containerPath: /data/string-in-video/repos/git/github/string-in-video
      - hostPath: /data/string-in-video/files/dev
        containerPath: /data/string-in-video/files/dev
      - hostPath: /data/string-in-video/volumes/dev
        containerPath: /data/string-in-video/volumes/dev
      - hostPath: /dev/null
        containerPath: /var/run/nvidia-container-devices/all

Error of nvidia-device-plugin-daemonset:

NVIDIA_DRIVER_ROOT=/ CONTAINER_DRIVER_ROOT=/host Starting nvidia-device-plugin I0912 18:33:00.988910 1 main.go:154] Starting FS watcher. E0912 18:33:00.988987 1 main.go:123] failed to create FS watcher: too many open files

Error on nvidia-operator-validator:
Failed to create pod sandbox: rpc error: code = Unknown desc = failed to get sandbox runtime: no runtime for "nvidia" is configured

xussof · 2023-09-12T18:43:19Z

Nevermind, it was a problem related to my sysctl limit related to this topic:
NVIDIA/gpu-operator#441
The solution was to increase the max:
NVIDIA/gpu-operator#441 (comment)

Now it works on 2 workers node! Thanks

cceyda · 2023-11-20T07:10:29Z

I had to downgrade the nvidia driver from 545->535 for it to work. Based on the compatibility stuff of the gpu-operator:
https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/latest/release-notes.html
then followed the tutorial #3257 (comment) (Thank you)

jiangxiaobin96 · 2023-12-19T07:41:13Z

It occurred to me over the weekend that we can actually enable GPU support in kind today by leveraging a feature we added to the nvidia-container-toolkit quite some time ago:

Read list of GPU devices from volume mounts instead of NVIDIA_VISIBLE_DEVICES

It relies on passing the list of GPUs you want to inject as volume mounts rather than via an environment variable.

Steps to enable this:

Add nvidia as your default runtime in /etc/docker/daemon.json

Restart docker (as necessary)

Set accept-nvidia-visible-devices-as-volume-mounts = true in /etc/nvidia-container-runtime/config.toml

Add the following to any kind nodes you want to have access to all GPUs in the system:
  extraMounts:
    - hostPath: /dev/null
      containerPath: /var/run/nvidia-container-devices/all
I've tested this in my local environment and it works as expected. The only caveat is that it only works for passing all GPUs into a node (i.e. I can't pick and choose some GPUs for one node and another set of GPUs for another node).

In general the nvidia-visible-devices-as-volume-mount feature allows you to do fine-grained injection of a subset of GPUs into a container. However, kind runs all of its nodes in --privileged containers, meaning that there is no way to prevent it from seeing all GPUs (even if you tell it you only want it to see a subset of the GPUs). If / when kind ever supports running worker nodes without --privileged this restriction should go away.

Hello, I follow this comment and meet new error

Error: failed to create containerd task: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: error during container init: error running hook #0: error running hook: exit status 1, stdout: , stderr: nvidia-container-cli.real: mount error: stat failed: /proc/driver/nvidia/capabilities: no such file or directory: unknown

How to solve this problem /proc/driver/nvidia/capabilities no such file or directory?

xihajun · 2023-12-20T00:48:44Z

Anyone can help with this issue?

Creating cluster "new-cluster" ...
 ✓ Ensuring node image (kindest/node:v1.27.3) 🖼
 ✓ Preparing nodes 📦 📦 📦
 ✓ Writing configuration 📜
 ✓ Starting control-plane 🕹️
 ✓ Installing CNI 🔌
 ✓ Installing StorageClass 💾
 ✗ Joining worker nodes 🚜
Deleted nodes: ["new-cluster-control-plane" "new-cluster-worker" "new-cluster-worker2"]
ERROR: failed to create cluster: failed to join node with kubeadm: command "docker exec --privileged new-cluster-worker kubeadm join --config /kind/kubeadm.conf --skip-phases=preflight --v=6" failed with error: exit status 1
Command Output: I1220 00:26:42.732281     158 join.go:412] [preflight] found NodeName empty; using OS hostname as NodeName
I1220 00:26:42.732373     158 joinconfiguration.go:76] loading configuration from "/kind/kubeadm.conf"
I1220 00:26:42.734232     158 controlplaneprepare.go:225] [download-certs] Skipping certs download
I1220 00:26:42.734258     158 join.go:529] [preflight] Discovering cluster-info
I1220 00:26:42.734283     158 token.go:80] [discovery] Created cluster-info discovery client, requesting info from "new-cluster-control-plane:6443"
I1220 00:26:42.754681     158 round_trippers.go:553] GET https://new-cluster-control-plane:6443/api/v1/namespaces/kube-public/configmaps/cluster-info?timeout=10s 200 OK in 19 milliseconds

Solved it, out of RAM

But now got

│ Failed to allocate device vector A (error code CUDA driver version is insufficient for CUDA runtime version)!                  │

jiangxiaobin96 · 2023-12-21T08:58:44Z

Anyone can help with this issue?

Creating cluster "new-cluster" ...
 ✓ Ensuring node image (kindest/node:v1.27.3) 🖼
 ✓ Preparing nodes 📦 📦 📦
 ✓ Writing configuration 📜
 ✓ Starting control-plane 🕹️
 ✓ Installing CNI 🔌
 ✓ Installing StorageClass 💾
 ✗ Joining worker nodes 🚜
Deleted nodes: ["new-cluster-control-plane" "new-cluster-worker" "new-cluster-worker2"]
ERROR: failed to create cluster: failed to join node with kubeadm: command "docker exec --privileged new-cluster-worker kubeadm join --config /kind/kubeadm.conf --skip-phases=preflight --v=6" failed with error: exit status 1
Command Output: I1220 00:26:42.732281     158 join.go:412] [preflight] found NodeName empty; using OS hostname as NodeName
I1220 00:26:42.732373     158 joinconfiguration.go:76] loading configuration from "/kind/kubeadm.conf"
I1220 00:26:42.734232     158 controlplaneprepare.go:225] [download-certs] Skipping certs download
I1220 00:26:42.734258     158 join.go:529] [preflight] Discovering cluster-info
I1220 00:26:42.734283     158 token.go:80] [discovery] Created cluster-info discovery client, requesting info from "new-cluster-control-plane:6443"
I1220 00:26:42.754681     158 round_trippers.go:553] GET https://new-cluster-control-plane:6443/api/v1/namespaces/kube-public/configmaps/cluster-info?timeout=10s 200 OK in 19 milliseconds

Solved it, out of RAM

But now got

│ Failed to allocate device vector A (error code CUDA driver version is insufficient for CUDA runtime version)!                  │

cuda version and driver version do not match.

Add nvidia-driver host map to kind node. Edited `generate_kind_config.py` to add: ```yaml # part of GPU workaround extraMounts: - hostPath: /dev/null containerPath: /var/run/nvidia-container-devices/all ``` For more details, see: kubernetes-sigs/kind#3257 (comment)

alexeadem · 2024-01-30T20:58:36Z

Steps to enable this:

Add nvidia as your default runtime in /etc/docker/daemon.json
[...]

Thanks @klueska I have linux server at home. But I have been running K8s training for my team using kind on their W11 laptops. Would like to dive into Kubeflow/Notebooks-with-GPUs scenario (I know, an overkill if not for training purpose, could have just used WSL2+CUDA). Would you be able translate the above to Docker Desktop on Windows environment?

The daemon config is accessible via the Docker Desktop UI. But I need some help on the nvidia-container-runtime/config.toml location on Windows. Though, as I read the Nvidia page on container runtime and toolkit I came away thinking they may not be applicable to WSL2, as:
docker run -it --rm --gpus all ubuntu nvidia-smi
seems to work with just installing CUDA support. But Kubeflow seems different in that Nvidia runtime is still needed for the new containers I would be building... if I understand the container runtime concept correctly?

You may want to look at this comment running Kubeflow, gpu-operator and wsl2 qbo with kind mages or directly into kind

…nd#3257

Add GPU support for docker

8ca69b3

This implements kubernetes-sigs#3164

k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label May 30, 2023

k8s-ci-robot added the area/provider/docker Issues or PRs related to docker label May 30, 2023

k8s-ci-robot requested review from aojea and neolit123 May 30, 2023 06:59

k8s-ci-robot added the size/S Denotes a PR that changes 10-29 lines, ignoring generated files. label May 30, 2023

samos123 changed the title ~~Add GPU support for docker~~ Add GPU support May 30, 2023

samos123 marked this pull request as draft May 30, 2023 07:00

k8s-ci-robot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label May 30, 2023

neolit123 reviewed May 30, 2023

View reviewed changes

BenTheElder reviewed May 30, 2023

View reviewed changes

samos123 mentioned this pull request May 30, 2023

Support GPUs #3164

Open

samos123 added 2 commits May 30, 2023 22:31

change Gpus to GPUs

00b38e3

add validation for gpus

7b5a8de

Only 'all' will be supported for now

k8s-ci-robot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. and removed size/S Denotes a PR that changes 10-29 lines, ignoring generated files. labels Jun 1, 2023

samos123 added 2 commits May 31, 2023 23:32

add unit test for gpus validation

897af43

address PR comment on doc change

c2d8ead

add gpus support to podman

cb9b34b

k8s-ci-robot added the area/provider/podman Issues or PRs related to podman label Jun 1, 2023

samos123 marked this pull request as ready for review June 1, 2023 06:42

k8s-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jun 1, 2023

k8s-ci-robot requested a review from neolit123 June 1, 2023 06:42

samos123 requested a review from BenTheElder June 1, 2023 06:42

klueska mentioned this pull request Jul 4, 2023

Add API for CDI --devices flag in Docker and Podman for mapping GPUs #3290

Open

klueska mentioned this pull request Jul 25, 2023

Device plugin not starting and pod showing 0/1 nodes are available: 1 node(s) had untolerated taint {gpu: }. preemption: 0/1 nodes are available: 1 Preemption is not helpful for scheduling.. NVIDIA/k8s-device-plugin#421

Open

samos123 mentioned this pull request Aug 20, 2023

kind GPU support substratusai/runbooks#201

Merged

samos123 closed this Aug 20, 2023

samos123 added a commit to substratusai/runbooks that referenced this pull request Aug 21, 2023

kind GPU support (#201)

876a0cf

For more context see: kubernetes-sigs/kind#3257 (comment)

luiztauffer mentioned this pull request Nov 8, 2023

Possibly easier method to run Kind with GPU? Tauffer-Consulting/domino#157

Open

creatorrr mentioned this pull request Jan 1, 2024

[k8s] Make sky local up support GPU skypilot-org/skypilot#2889

Closed

ehfd mentioned this pull request Jan 9, 2024

minikube cannot detect the GPUs ehfd/nvidia-dind#2

Open

jiangxiaobin96 mentioned this pull request Jan 10, 2024

/proc/driver/nvidia/capabilities: no such file or directory: unknown NVIDIA/nvidia-container-toolkit#198

Closed

klueska mentioned this pull request Jan 23, 2024

no runtime for "nvidia" is configured NVIDIA/gpu-operator#662

Open

10 tasks

alexeadem mentioned this pull request Jan 30, 2024

Getting GPU device minor number: Not Supported NVIDIA/k8s-device-plugin#332

Open

7 tasks

rohanarora added a commit to rohanarora/fmperf that referenced this pull request Jun 18, 2024

bump: updates to Kind setup based on kubernetes-sigs/kind#3257

01167ba

rohanarora added a commit to rohanarora/fmperf that referenced this pull request Jun 18, 2024

bump: leveraging out-of-box GPU support from kind: kubernetes-sigs/ki…

3c3c69a

…nd#3257

rohanarora mentioned this pull request Jun 18, 2024

Updates to Kind with GPUs setup fmperf-project/fmperf#21

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add GPU support #3257

Add GPU support #3257

samos123 commented May 30, 2023 •

edited

Loading

k8s-ci-robot commented May 30, 2023

neolit123 May 30, 2023 •

edited

Loading

BenTheElder May 30, 2023

BenTheElder May 30, 2023

samos123 Jun 1, 2023

neolit123 May 30, 2023

samos123 Jun 1, 2023

neolit123 May 30, 2023

samos123 Jun 1, 2023

BenTheElder May 30, 2023

samos123 Jun 1, 2023

samos123 commented Jun 1, 2023 •

edited

Loading

k8s-ci-robot commented Jun 1, 2023

klueska commented Jun 1, 2023 •

edited

Loading

klueska commented Jun 26, 2023 •

edited

Loading

grokspeed commented Jun 29, 2023

samos123 commented Aug 20, 2023 •

edited

Loading

klueska commented Aug 21, 2023

samos123 commented Aug 22, 2023

xussof commented Sep 12, 2023 •

edited

Loading

samos123 commented Sep 12, 2023

klueska commented Sep 12, 2023

xussof commented Sep 12, 2023 •

edited

Loading

xussof commented Sep 12, 2023

cceyda commented Nov 20, 2023

jiangxiaobin96 commented Dec 19, 2023 •

edited

Loading

xihajun commented Dec 20, 2023 •

edited

Loading

jiangxiaobin96 commented Dec 21, 2023

alexeadem commented Jan 30, 2024 •

edited

Loading

	Kind nodes can utilize GPUs by setting the following:
	Kind nodes can utilize GPU devices from the host, by setting the following:

Add GPU support #3257

Add GPU support #3257

Conversation

samos123 commented May 30, 2023 • edited Loading

k8s-ci-robot commented May 30, 2023

neolit123 May 30, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

samos123 commented Jun 1, 2023 • edited Loading

k8s-ci-robot commented Jun 1, 2023

klueska commented Jun 1, 2023 • edited Loading

klueska commented Jun 26, 2023 • edited Loading

grokspeed commented Jun 29, 2023

samos123 commented Aug 20, 2023 • edited Loading

klueska commented Aug 21, 2023

samos123 commented Aug 22, 2023

xussof commented Sep 12, 2023 • edited Loading

samos123 commented Sep 12, 2023

klueska commented Sep 12, 2023

xussof commented Sep 12, 2023 • edited Loading

xussof commented Sep 12, 2023

cceyda commented Nov 20, 2023

jiangxiaobin96 commented Dec 19, 2023 • edited Loading

xihajun commented Dec 20, 2023 • edited Loading

jiangxiaobin96 commented Dec 21, 2023

alexeadem commented Jan 30, 2024 • edited Loading

samos123 commented May 30, 2023 •

edited

Loading

neolit123 May 30, 2023 •

edited

Loading

samos123 commented Jun 1, 2023 •

edited

Loading

klueska commented Jun 1, 2023 •

edited

Loading

klueska commented Jun 26, 2023 •

edited

Loading

samos123 commented Aug 20, 2023 •

edited

Loading

xussof commented Sep 12, 2023 •

edited

Loading

xussof commented Sep 12, 2023 •

edited

Loading

jiangxiaobin96 commented Dec 19, 2023 •

edited

Loading

xihajun commented Dec 20, 2023 •

edited

Loading

alexeadem commented Jan 30, 2024 •

edited

Loading