Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add GPU support #3257

Closed
wants to merge 6 commits into from
Closed

Add GPU support #3257

wants to merge 6 commits into from

Conversation

samos123
Copy link

@samos123 samos123 commented May 30, 2023

WIP, still need to add podman support

This implements GPU support as discussed in #3164

@k8s-ci-robot k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label May 30, 2023
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: samos123
Once this PR has been reviewed and has the lgtm label, please assign bentheelder for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the area/provider/docker Issues or PRs related to docker label May 30, 2023
@k8s-ci-robot k8s-ci-robot added the size/S Denotes a PR that changes 10-29 lines, ignoring generated files. label May 30, 2023
@samos123 samos123 changed the title Add GPU support for docker Add GPU support May 30, 2023
@samos123 samos123 marked this pull request as draft May 30, 2023 07:00
@k8s-ci-robot k8s-ci-robot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label May 30, 2023
@@ -118,6 +118,10 @@ type Node struct {
// binded to a host Port
ExtraPortMappings []PortMapping `yaml:"extraPortMappings,omitempty" json:"extraPortMappings,omitempty"`

// GPUs allows to access GPU devices from the kind node. Setting this to
// "all" will pass all the available GPUs to the kind node.
Gpus string `yaml:"gpus,omitempty" json:"gpus,omitempty"`
Copy link
Member

@neolit123 neolit123 May 30, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use GPUs string?

All letters in the acronym should have the same case, using the appropriate case for the situation

https://github.com/zecke/Kubernetes/blob/master/docs/devel/api-conventions.md#naming-conventions

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, also this field should be validated (with "all" being the only valid value currently), and we should put a note that in the future we'll look at supporting specifying specific devices.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Commented more below and on the issue.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@@ -98,6 +98,10 @@ type Node struct {
// binded to a host Port
ExtraPortMappings []PortMapping

// GPUs allows to access GPU devices from the kind node. Setting this to
// "all" will pass all the available GPUs to the kind node.
Gpus string
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@@ -285,6 +285,20 @@ nodes:

**Note**: Kubernetes versions are expressed as x.y.z, where x is the major version, y is the minor version, and z is the patch version, following [Semantic Versioning](https://semver.org/) terminology. For more information, see [Kubernetes Release Versioning.](https://github.com/kubernetes/sig-release/blob/master/release-engineering/versioning.md#kubernetes-release-versioning)

### GPU Support

Kind nodes can utilize GPUs by setting the following:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Kind nodes can utilize GPUs by setting the following:
Kind nodes can utilize GPU devices from the host, by setting the following:

aligning with the API godoc comment

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@@ -255,6 +255,10 @@ func runArgsForNode(node *config.Node, clusterIPFamily config.ClusterIPFamily, n
args = append(args, "-e", "KUBECONFIG=/etc/kubernetes/admin.conf")
}

if len(node.Gpus) > 0 {
args = append(args, fmt.Sprintf("--gpus=%v", node.Gpus))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's not plumb this directly since the values are incompatible across backends

instead we can have a more structured format in the internal type (for now gpus true/false) and we need valdiatino on the external type.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added validation to ensure right now only "all" can be passed so it's now safe to to use node.Gpus directly in the docker provisioner. I prefer to keep the code simpler and passthrough instead of adding another layer of validation in the docker provisioning code. Happy to add additional validation in docker/povision.go if you have strong opinion on it.

For podman we can not plumb it through directly. So I will work out a way to convert node.Gpus into a podman compatible format. I don't have podman installed myself.

@samos123 samos123 mentioned this pull request May 30, 2023
Only 'all' will be supported for now
@k8s-ci-robot k8s-ci-robot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. and removed size/S Denotes a PR that changes 10-29 lines, ignoring generated files. labels Jun 1, 2023
@samos123
Copy link
Author

samos123 commented Jun 1, 2023

One interesting thing I had to do was to run the following after kidn cluster came up:

helm install --wait --generate-name \
     -n gpu-operator --create-namespace \
     nvidia/gpu-operator \
     --set driver.enabled=false

docker exec -ti kind-control-plane bash
systemctl restart containerd
ln -s /sbin/ldconfig /sbin/ldconfig.real

So I'm thinking there might needs to be a change to the base image to include that symlink. This could also be due to my system being on Archlinux.

@k8s-ci-robot k8s-ci-robot added the area/provider/podman Issues or PRs related to podman label Jun 1, 2023
@samos123 samos123 marked this pull request as ready for review June 1, 2023 06:42
@k8s-ci-robot k8s-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jun 1, 2023
@samos123 samos123 requested a review from BenTheElder June 1, 2023 06:42
@k8s-ci-robot
Copy link
Contributor

@samos123: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
pull-kind-conformance-parallel-dual-stack-ipv4-ipv6 cb9b34b link true /test pull-kind-conformance-parallel-dual-stack-ipv4-ipv6

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@klueska
Copy link

klueska commented Jun 1, 2023

Sorry to be a party pooper, but I do not believe this is the right approach for adding GPU support to kind.

The --gpus flag in docker has always been a bit of a hack, and not something that we (NVIDIA) would like to support long term. The proper way to support GPUs (and any device for that matter) in kind is to expose the --device flag through kind config file rather than the specialized --gpus flag.

With the upcoming CDI support that will be available in Docker 25, having the --device flag exposed will be sufficient to get equivalent functionality that the --gpus flag provides today (and more).

Once this is available in docker, both of these will work in a unified way:

podman run --rm --device nvidia.com/gpu=all ubuntu nvidia-smi -L
docker run --rm --device nvidia.com/gpu=all ubuntu nvidia-smi -L

@klueska
Copy link

klueska commented Jun 26, 2023

It occurred to me over the weekend that we can actually enable GPU support in kind today by leveraging a feature we added to the nvidia-container-toolkit quite some time ago:

Read list of GPU devices from volume mounts instead of NVIDIA_VISIBLE_DEVICES

It relies on passing the list of GPUs you want to inject as volume mounts rather than via an environment variable.

Steps to enable this:

  1. Add nvidia as your default runtime in /etc/docker/daemon.json
  2. Restart docker (as necessary)
  3. Set accept-nvidia-visible-devices-as-volume-mounts = true in /etc/nvidia-container-runtime/config.toml
  4. Add the following to any kind nodes you want to have access to all GPUs in the system:
  extraMounts:
    - hostPath: /dev/null
      containerPath: /var/run/nvidia-container-devices/all

I've tested this in my local environment and it works as expected. The only caveat is that it only works for passing all GPUs into a node (i.e. I can't pick and choose some GPUs for one node and another set of GPUs for another node).

In general the nvidia-visible-devices-as-volume-mount feature allows you to do fine-grained injection of a subset of GPUs into a container. However, kind runs all of its nodes in --privileged containers, meaning that there is no way to prevent it from seeing all GPUs (even if you tell it you only want it to see a subset of the GPUs). If / when kind ever supports running worker nodes without --privileged this restriction should go away.

@grokspeed
Copy link

Steps to enable this:

  1. Add nvidia as your default runtime in /etc/docker/daemon.json
    [...]

Thanks @klueska I have linux server at home. But I have been running K8s training for my team using kind on their W11 laptops. Would like to dive into Kubeflow/Notebooks-with-GPUs scenario (I know, an overkill if not for training purpose, could have just used WSL2+CUDA). Would you be able translate the above to Docker Desktop on Windows environment?

The daemon config is accessible via the Docker Desktop UI. But I need some help on the nvidia-container-runtime/config.toml location on Windows. Though, as I read the Nvidia page on container runtime and toolkit I came away thinking they may not be applicable to WSL2, as:

docker run -it --rm --gpus all ubuntu nvidia-smi

seems to work with just installing CUDA support. But Kubeflow seems different in that Nvidia runtime is still needed for the new containers I would be building... if I understand the container runtime concept correctly?

@samos123
Copy link
Author

samos123 commented Aug 20, 2023

Edit: I posted a full tutorial on how to configure Kind + GPU support here: https://www.substratus.ai/blog/kind-with-gpus

I confirmed the steps provided by @klueska worked except hitting 1 minor issue. The only thing I had to do was run this:

docker exec -ti kind-control-plane ln -s /sbin/ldconfig /sbin/ldconfig.real

Should that be something that should be included in the base image?

Here were the full steps I used to verify:

nvidia_config_file="/etc/nvidia-container-runtime/config.toml"
if [ -e "${nvidia_config_file}" ]; then
sudo sed -i '/accept-nvidia-visible-devices-as-volume-mounts/c\accept-nvidia-visible-devices-as-volume-mounts = true' ${nvidia_config_file}
fi

kind create cluster --name kind --config - <<EOF
apiVersion: kind.x-k8s.io/v1alpha4
kind: Cluster
nodes:
- role: control-plane
  image: kindest/node:v1.27.3@sha256:3966ac761ae0136263ffdb6cfd4db23ef8a83cba8a463690e98317add2c9ba72
  extraPortMappings:
  - containerPort: 30080
    hostPort: 30080
  # part of GPU workaround
  extraMounts:
    - hostPath: /dev/null
      containerPath: /var/run/nvidia-container-devices/all
EOF

docker exec -ti kind-control-plane ln -s /sbin/ldconfig /sbin/ldconfig.real || true

helm repo add nvidia https://helm.ngc.nvidia.com/nvidia || true
helm repo update
helm install --wait --generate-name \
     -n gpu-operator --create-namespace \
     nvidia/gpu-operator --set driver.enabled=false

@samos123 samos123 closed this Aug 20, 2023
samos123 added a commit to substratusai/runbooks that referenced this pull request Aug 21, 2023
@klueska
Copy link

klueska commented Aug 21, 2023

@samos123 You shouldn't need to create that symlink. Can you show me the contents of your /etc/nvidia-container-runtime/config.toml file?

@samos123
Copy link
Author

@klueska content of my file: https://gist.github.com/samos123/cc816b91a7a03651c71441e0949c3bb6

Note I don't seem to be the only one hitting this issue: NVIDIA/nvidia-docker#614 (comment) That's the source of the workaround

@xussof
Copy link

xussof commented Sep 12, 2023

Following @samos123 steps I can't fully deploy nvidia-device-plugin-daemonset and nvidia-operator-validator when I have more than one node deployed by using this kind-config:

kind: Cluster
apiVersion: kind.x-k8s.io/v1alpha4
name: siv-dev-gpu
nodes:
  - role: control-plane
    image: kindest/node:v1.26.0
    extraMounts:
      - hostPath: /data/string-in-video/repos/git/github/string-in-video
        containerPath: /data/string-in-video/repos/git/github/string-in-video
      - hostPath: /data/string-in-video/files/dev
        containerPath: /data/string-in-video/files/dev
      - hostPath: /data/string-in-video/volumes/dev
        containerPath: /data/string-in-video/volumes/dev
      - hostPath: /dev/null
        containerPath: /var/run/nvidia-container-devices/all
    kubeadmConfigPatches:
      - |
        kind: InitConfiguration
        nodeRegistration:
          kubeletExtraArgs:
            node-labels: "ingress-ready=true"        
    extraPortMappings:
      - containerPort: 80
        hostPort: 80
        protocol: TCP
      - containerPort: 443
        hostPort: 443
        protocol: TCP

  - role: worker
    image: kindest/node:v1.26.0
    extraMounts:
      - hostPath: /data/string-in-video/repos/git/github/string-in-video
        containerPath: /data/string-in-video/repos/git/github/string-in-video
      - hostPath: /data/string-in-video/files/dev
        containerPath: /data/string-in-video/files/dev
      - hostPath: /data/string-in-video/volumes/dev
        containerPath: /data/string-in-video/volumes/dev

  - role: worker
    image: kindest/node:v1.26.0
    extraMounts:
      - hostPath: /data/string-in-video/repos/git/github/string-in-video
        containerPath: /data/string-in-video/repos/git/github/string-in-video
      - hostPath: /data/string-in-video/files/dev
        containerPath: /data/string-in-video/files/dev
      - hostPath: /data/string-in-video/volumes/dev
        containerPath: /data/string-in-video/volumes/dev
      - hostPath: /dev/null
        containerPath: /var/run/nvidia-container-devices/all

This is the error I get:

image

Failed to create pod sandbox: rpc error: code = Unknown desc = failed to get sandbox runtime: no runtime for "nvidia" is configured

But when I deploy it over a kind with a control-plane and a single worker it works.

@samos123
Copy link
Author

I think thats a limitation of the approach. Maybe @klueska as another cool workaround to solve that? :D

@klueska
Copy link

klueska commented Sep 12, 2023

Only one of your workers has:

      - hostPath: /dev/null
        containerPath: /var/run/nvidia-container-devices/all

@xussof
Copy link

xussof commented Sep 12, 2023

Only one of your workers has:

      - hostPath: /dev/null
        containerPath: /var/run/nvidia-container-devices/all

Yes because I only one one node to has gpu, the others I want to emulate that doesn't have GPU. Still, it fails if I add the hostpath to the another worker:

kind: Cluster
apiVersion: kind.x-k8s.io/v1alpha4
name: siv-dev-gpu
nodes:
  - role: control-plane
    image: kindest/node:v1.26.0
    extraMounts:
      - hostPath: /data/string-in-video/repos/git/github/string-in-video
        containerPath: /data/string-in-video/repos/git/github/string-in-video
      - hostPath: /data/string-in-video/files/dev
        containerPath: /data/string-in-video/files/dev
      - hostPath: /data/string-in-video/volumes/dev
        containerPath: /data/string-in-video/volumes/dev
      - hostPath: /dev/null
        containerPath: /var/run/nvidia-container-devices/all
    kubeadmConfigPatches:
      - |
        kind: InitConfiguration
        nodeRegistration:
          kubeletExtraArgs:
            node-labels: "ingress-ready=true"        
    extraPortMappings:
      - containerPort: 80
        hostPort: 80
        protocol: TCP
      - containerPort: 443
        hostPort: 443
        protocol: TCP

  - role: worker
    image: kindest/node:v1.26.0
    extraMounts:
      - hostPath: /data/string-in-video/repos/git/github/string-in-video
        containerPath: /data/string-in-video/repos/git/github/string-in-video
      - hostPath: /data/string-in-video/files/dev
        containerPath: /data/string-in-video/files/dev
      - hostPath: /data/string-in-video/volumes/dev
        containerPath: /data/string-in-video/volumes/dev
      - hostPath: /dev/null
        containerPath: /var/run/nvidia-container-devices/all
        
  - role: worker
    image: kindest/node:v1.26.0
    extraMounts:
      - hostPath: /data/string-in-video/repos/git/github/string-in-video
        containerPath: /data/string-in-video/repos/git/github/string-in-video
      - hostPath: /data/string-in-video/files/dev
        containerPath: /data/string-in-video/files/dev
      - hostPath: /data/string-in-video/volumes/dev
        containerPath: /data/string-in-video/volumes/dev
      - hostPath: /dev/null
        containerPath: /var/run/nvidia-container-devices/all

image

Error of nvidia-device-plugin-daemonset:

NVIDIA_DRIVER_ROOT=/ CONTAINER_DRIVER_ROOT=/host Starting nvidia-device-plugin I0912 18:33:00.988910 1 main.go:154] Starting FS watcher. E0912 18:33:00.988987 1 main.go:123] failed to create FS watcher: too many open files

Error on nvidia-operator-validator:
Failed to create pod sandbox: rpc error: code = Unknown desc = failed to get sandbox runtime: no runtime for "nvidia" is configured

@xussof
Copy link

xussof commented Sep 12, 2023

Nevermind, it was a problem related to my sysctl limit related to this topic:
NVIDIA/gpu-operator#441
The solution was to increase the max:
NVIDIA/gpu-operator#441 (comment)

Now it works on 2 workers node! Thanks

@cceyda
Copy link

cceyda commented Nov 20, 2023

I had to downgrade the nvidia driver from 545->535 for it to work. Based on the compatibility stuff of the gpu-operator:
https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/latest/release-notes.html
then followed the tutorial #3257 (comment) (Thank you)

@jiangxiaobin96
Copy link

jiangxiaobin96 commented Dec 19, 2023

It occurred to me over the weekend that we can actually enable GPU support in kind today by leveraging a feature we added to the nvidia-container-toolkit quite some time ago:

Read list of GPU devices from volume mounts instead of NVIDIA_VISIBLE_DEVICES

It relies on passing the list of GPUs you want to inject as volume mounts rather than via an environment variable.

Steps to enable this:

  1. Add nvidia as your default runtime in /etc/docker/daemon.json
  2. Restart docker (as necessary)
  3. Set accept-nvidia-visible-devices-as-volume-mounts = true in /etc/nvidia-container-runtime/config.toml
  4. Add the following to any kind nodes you want to have access to all GPUs in the system:
  extraMounts:
    - hostPath: /dev/null
      containerPath: /var/run/nvidia-container-devices/all

I've tested this in my local environment and it works as expected. The only caveat is that it only works for passing all GPUs into a node (i.e. I can't pick and choose some GPUs for one node and another set of GPUs for another node).

In general the nvidia-visible-devices-as-volume-mount feature allows you to do fine-grained injection of a subset of GPUs into a container. However, kind runs all of its nodes in --privileged containers, meaning that there is no way to prevent it from seeing all GPUs (even if you tell it you only want it to see a subset of the GPUs). If / when kind ever supports running worker nodes without --privileged this restriction should go away.

Hello, I follow this comment and meet new error

Error: failed to create containerd task: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: error during container init: error running hook #0: error running hook: exit status 1, stdout: , stderr: nvidia-container-cli.real: mount error: stat failed: /proc/driver/nvidia/capabilities: no such file or directory: unknown

How to solve this problem /proc/driver/nvidia/capabilities no such file or directory?

@xihajun
Copy link

xihajun commented Dec 20, 2023

Anyone can help with this issue?

Creating cluster "new-cluster" ...
 ✓ Ensuring node image (kindest/node:v1.27.3) 🖼
 ✓ Preparing nodes 📦 📦 📦
 ✓ Writing configuration 📜
 ✓ Starting control-plane 🕹️
 ✓ Installing CNI 🔌
 ✓ Installing StorageClass 💾
 ✗ Joining worker nodes 🚜
Deleted nodes: ["new-cluster-control-plane" "new-cluster-worker" "new-cluster-worker2"]
ERROR: failed to create cluster: failed to join node with kubeadm: command "docker exec --privileged new-cluster-worker kubeadm join --config /kind/kubeadm.conf --skip-phases=preflight --v=6" failed with error: exit status 1
Command Output: I1220 00:26:42.732281     158 join.go:412] [preflight] found NodeName empty; using OS hostname as NodeName
I1220 00:26:42.732373     158 joinconfiguration.go:76] loading configuration from "/kind/kubeadm.conf"
I1220 00:26:42.734232     158 controlplaneprepare.go:225] [download-certs] Skipping certs download
I1220 00:26:42.734258     158 join.go:529] [preflight] Discovering cluster-info
I1220 00:26:42.734283     158 token.go:80] [discovery] Created cluster-info discovery client, requesting info from "new-cluster-control-plane:6443"
I1220 00:26:42.754681     158 round_trippers.go:553] GET https://new-cluster-control-plane:6443/api/v1/namespaces/kube-public/configmaps/cluster-info?timeout=10s 200 OK in 19 milliseconds

Solved it, out of RAM

But now got

│ Failed to allocate device vector A (error code CUDA driver version is insufficient for CUDA runtime version)!                  │

@jiangxiaobin96
Copy link

Anyone can help with this issue?

Creating cluster "new-cluster" ...
 ✓ Ensuring node image (kindest/node:v1.27.3) 🖼
 ✓ Preparing nodes 📦 📦 📦
 ✓ Writing configuration 📜
 ✓ Starting control-plane 🕹️
 ✓ Installing CNI 🔌
 ✓ Installing StorageClass 💾
 ✗ Joining worker nodes 🚜
Deleted nodes: ["new-cluster-control-plane" "new-cluster-worker" "new-cluster-worker2"]
ERROR: failed to create cluster: failed to join node with kubeadm: command "docker exec --privileged new-cluster-worker kubeadm join --config /kind/kubeadm.conf --skip-phases=preflight --v=6" failed with error: exit status 1
Command Output: I1220 00:26:42.732281     158 join.go:412] [preflight] found NodeName empty; using OS hostname as NodeName
I1220 00:26:42.732373     158 joinconfiguration.go:76] loading configuration from "/kind/kubeadm.conf"
I1220 00:26:42.734232     158 controlplaneprepare.go:225] [download-certs] Skipping certs download
I1220 00:26:42.734258     158 join.go:529] [preflight] Discovering cluster-info
I1220 00:26:42.734283     158 token.go:80] [discovery] Created cluster-info discovery client, requesting info from "new-cluster-control-plane:6443"
I1220 00:26:42.754681     158 round_trippers.go:553] GET https://new-cluster-control-plane:6443/api/v1/namespaces/kube-public/configmaps/cluster-info?timeout=10s 200 OK in 19 milliseconds

Solved it, out of RAM

But now got

│ Failed to allocate device vector A (error code CUDA driver version is insufficient for CUDA runtime version)!                  │

cuda version and driver version do not match.

creatorrr added a commit to creatorrr/skypilot that referenced this pull request Jan 2, 2024
Add nvidia-driver host map to kind node. Edited `generate_kind_config.py` to add:
```yaml
  # part of GPU workaround
  extraMounts:
    - hostPath: /dev/null
      containerPath: /var/run/nvidia-container-devices/all
```


For more details, see: kubernetes-sigs/kind#3257 (comment)
@alexeadem
Copy link

alexeadem commented Jan 30, 2024

Steps to enable this:

  1. Add nvidia as your default runtime in /etc/docker/daemon.json
    [...]

Thanks @klueska I have linux server at home. But I have been running K8s training for my team using kind on their W11 laptops. Would like to dive into Kubeflow/Notebooks-with-GPUs scenario (I know, an overkill if not for training purpose, could have just used WSL2+CUDA). Would you be able translate the above to Docker Desktop on Windows environment?

The daemon config is accessible via the Docker Desktop UI. But I need some help on the nvidia-container-runtime/config.toml location on Windows. Though, as I read the Nvidia page on container runtime and toolkit I came away thinking they may not be applicable to WSL2, as:

docker run -it --rm --gpus all ubuntu nvidia-smi

seems to work with just installing CUDA support. But Kubeflow seems different in that Nvidia runtime is still needed for the new containers I would be building... if I understand the container runtime concept correctly?

You may want to look at this comment running Kubeflow, gpu-operator and wsl2 qbo with kind mages or directly into kind

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/provider/docker Issues or PRs related to docker area/provider/podman Issues or PRs related to podman cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. size/M Denotes a PR that changes 30-99 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet