Add API for CDI --devices flag in Docker and Podman for mapping GPUs #3290

lukeogg · 2023-06-28T19:49:50Z

This PR adds support for passing GPU parameters to Nvidia Container Toolkit through the CDI specification. Although there is a way to map all GPUs to a single node with device mounts, more granularity is desired. This PR will support mapping various device combinations to different nodes as needed.

Would resolve Issue 3164

Adds API for devices with a list of strings in the format specified in the CDI specification.

  devices:
  - "nvidia.com/gpu=0"
  - "nvidia.com/gpu=1"

Uses the CDI validation package to validate values passed in the devices API.
Supports both docker and podman
This relies on the upstream PR for Docker being merged and released - presumably in Docker v25.
Checks the docker version if devices have been specified.
Add documentation for mapping GPUs.

All GPUs mapped to a single control-plane:

{{< codeFromInline lang="yaml" >}}
kind: Cluster
apiVersion: kind.x-k8s.io/v1alpha4
nodes:
- role: control-plane
  devices:
  - "nvidia.com/gpu=all"

Specific GPUs mapped to specific worker nodes based on index:

kind: Cluster
apiVersion: kind.x-k8s.io/v1alpha4
nodes:
- role: control-plane
- role: worker
  devices:
  - "nvidia.com/gpu=0"
- role: worker
  devices:
  - "nvidia.com/gpu=1"

k8s-ci-robot · 2023-06-28T19:49:57Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: lukeogg
Once this PR has been reviewed and has the lgtm label, please assign bentheelder for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

k8s-ci-robot · 2023-06-28T19:49:59Z

Hi @lukeogg. Thanks for your PR.

I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

go.mod

stmcginnis

/ok-to-test

Looks like this should be fine to podman, and I think the docker version handling makes sense.

Some nice markdown cleanup too. ;)

pkg/internal/apis/config/validate_test.go

elezar

LGTM.

Some minor comments on the CDI device and documentation side of things.

pkg/internal/apis/config/types.go

elezar · 2023-07-04T13:00:02Z

pkg/internal/apis/config/validate.go

@@ -192,6 +196,17 @@ func validatePortMappings(portMappings []PortMapping) error {
 	return nil
 }

+func validateDevices(devices []string) error {
+	for _, device := range devices {
+		device := strings.TrimSpace(device)


You should be able to use "github.com/container-orchestrated-devices/container-device-interface/pkg/parser" and call parser.IsQualifiedName(device) here. This is used in docker/cli#4084 and significantly reduces the dependencies pulled in.

It should also be fine to allow podman and / or docker to perform this validation though.

I'll give that a try.

Please update to what Evan suggests to reduce the dependencies pulled in.

@BenTheElder What do you think?

@klueska I am using parser.ParseQualifiedName(device) instead of parser.ParseQualifiedName(device). The results are the same except you get the additional error message about why your device string is invalid. I think this is preferable. ParseQualifiedName() is called in IsQualifiedName().

This doesn't change the dependencies at all. There are tests on github.com/container-orchestrated-devices/container-device-interface/pkg/parser pulling in the additional packages. Maybe we should put the tests in a different package? @elezar wdyt?

I can update the tests in that package to use a different import for comparison or to be in the parser_test package if require. As a matter of interest, which package that is equivalent to require would be preferred?

I don't personally have a preference - I haven't found any with fewer dependencies. I was thinking use the parser_test package to keep things very lightweight.

I have created cncf-tags/container-device-interface#149 with an update to the test package. This should prevent github.com/stretchr/testify/require from getting pulled in.

Sorry, a few things came up back in the real world and here with #3277 etc., the dependency discussion wound up forked in https://github.com/kubernetes-sigs/kind/pull/3290/files#r1304873501

I'm going to try to get #3335 after which when this PR is rebased and go mod tidy it should be clearer what we're actually talking about for the new dependencies.

See linked comment for stance on deps.

Currently for example we have a lot of CRI inspired types with in-tree dependency-free code instead, to avoid creating dep hell for our users. That may not be so reasonable here.

site/content/docs/user/configuration.md

elezar · 2023-07-04T13:06:13Z

site/content/docs/user/configuration.md

+    containerPath: /var/run/nvidia-container-devices/all
+{{< /codeFromInline >}}
+
+Note: this method only support adding `all` GPUs to a single node. If you want to add specific GPUs to specific nodes, you will need to use the `devices` API.


I don't know whether this is strictly true. Using /var/run/nvidia-container-devices/0 or /var/run/nvidia-container-devices/{{DEVICE_UUID}} should allow individual devices to be made available.

cc @klueska

As mentioned in #3257 (comment), we can't achieve any finer granularity than all (regardless of whether we do volume mounts or use CDI) because the node's container is started with --privileged. Even if you tell it to selectively inject just a single GPU, the node will be able to see all of them because --privileged pulls in all device nodes under /dev.

Doing some testing today and this does allow access to all GPUs. I'll update the documentation. It is fairly convenient with Kind, however, as you don't need to inject drivers or install them on the node images.

klueska

In general, this seems reasonable. The only real issue I see is the limitation around not being able to inject individual GPUs because of the --privileged flag.

I'm assuming this will trip up many users thinking they should be able to selectively give different nodes access to different GPUs (based on their CDI device name), but then suddenly all nodes have access to all GPUs.

klueska · 2023-07-17T20:36:36Z

pkg/cluster/internal/providers/docker/provision.go

+	// Append CDI device args (used for GPU support)
+	if len(node.CDIDevices) > 0 {
+		// Check for docker > 25
+		ver := Version()
+		if ver != "dev" || strings.Split(ver, ".")[0] < "25" {
+			return nil, errors.Errorf("using devices api in kind requires Docker >= v25, but found %q", ver)
+		}
+
+		// Append args for each device
+		for _, device := range node.CDIDevices {
+			args = append(args, "--device", strings.TrimSpace(device))
+		}
+	}
+


Will also allow passing "standard" devices via the CDIDevices setting? It looks like there is a new validate() call that will prevent this though. Is the validate call made sometime before this is done here?

Yes, validation is called as part of the initial passing for the node configuration. parser.ParseQualifiedName(device) as part of this.

klueska · 2023-07-17T20:39:54Z

pkg/internal/apis/config/validate.go

@@ -192,6 +196,17 @@ func validatePortMappings(portMappings []PortMapping) error {
 	return nil
 }

+func validateDevices(devices []string) error {
+	for _, device := range devices {
+		device := strings.TrimSpace(device)


Please update to what Evan suggests to reduce the dependencies pulled in.

klueska · 2023-07-17T20:43:41Z

site/content/docs/user/configuration.md

+- role: control-plane
+- role: worker
+  devices:
+  - "nvidia.com/gpu=0"


Note that this will not give you access to just GPU 0, but rather all GPUs on the system.

This is because --privileged is passed to the docker call that creates the node, thus giving it access to all devices under /dev (including all GPU devices).

Individual GPU injection will only be supported once if/when kind is able to start nodes without --privileged.

Ok, this reduces the value of adding an API for CDI devices in my opinion.

To be clear, this is the same for all other mechanisms too. The --privileged flag will mean that any container will have access to all devices and the only value add is that the driver libraries are also injected.

Ok, this reduces the value of adding an API for CDI devices in my opinion.

This isn't true. The use of the device mounts uses our "legacy" stack which has a number of shortcomings that using CDI addresses. It also means that users can access devices from other vendors such as Intel or their own device / resource definitions.

the only value add is that the driver libraries are also injected.

I don't supposed we can just trigger driver library injection directly in a simple way and reduce the API surface to "yes I'd like the GPU drivers"?

I spent today putting this together:
https://github.com/klueska/kind-with-gpus-examples

It should be a good starting point for using GPUs with kind independent of this PR

I've also pinged the docker maintainers that helped us push CDI through to see what their thoughts are here:
https://dockercommunity.slack.com/archives/C04MZQZJE94/p1712086101448089

Was pointed to this on that thread:
moby/moby#39702 (comment)

If moby/moby#47663 becomes available we'll have to version detect to use it to avoid breaking all of the users on older docker installs, and I'll be pretty wary of the branching codepath behaving differently vs --privileged.

We have only once required a newer docker version (when we started to require cgroupns=private be available) which caused a lot of consternation even though that change was aimed at mitigating the runc misc issues that caused kind to very flaky/broken ...

It would've been nice when kind was being built, but now we have to consider that people undoubtedly have test environments that depend on being in a --privileged node :/

klueska · 2023-07-17T20:44:17Z

site/content/docs/user/configuration.md

+    containerPath: /var/run/nvidia-container-devices/all
+{{< /codeFromInline >}}
+
+Note: this method only support adding `all` GPUs to a single node. If you want to add specific GPUs to specific nodes, you will need to use the `devices` API.


This limitation of all is true even in the CDI case.

Ok, this reduces the value of adding an API for CDI devices in my opinion.

Same here.

k8s-ci-robot · 2023-07-17T20:59:05Z

@lukeogg: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name	Commit	Details	Required	Rerun command
pull-kind-verify	`d2a13ba`	link	true	`/test pull-kind-verify`
pull-kind-conformance-parallel-dual-stack-ipv4-ipv6	`d2a13ba`	link	true	`/test pull-kind-conformance-parallel-dual-stack-ipv4-ipv6`

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

elezar

Thanks @lukeogg. Some minor comments / suggestions from my side.

elezar · 2023-08-04T14:14:15Z

pkg/cluster/internal/providers/docker/provision.go

+		if ver != "dev" || strings.Split(ver, ".")[0] < "25" {
+			return nil, errors.Errorf("using devices api in kind requires Docker >= v25, but found %q", ver)
+		}


Note: This also requires that the Daemon be configured in experimental mode.

What about dropping this check entirely and relying on the error reported by the Docker daemon. Should be something along the lines of "no driver for "cdi"". Or is this UX to confusing?

I prefer this approach, rather than make kind to have awareness of the provider versions

elezar · 2023-08-04T14:18:11Z

pkg/cluster/internal/providers/docker/util.go

@@ -33,6 +33,16 @@ func IsAvailable() bool {
 	return strings.HasPrefix(lines[0], "Docker version")
 }

+// Version gets the version of docker available on the system
+func Version() string {
+	cmd := exec.Command("docker", "version", "--format", "'{{.Server.Version}}'")


If we use docker info we can also extract information as to whether experimental is enabled.

Does changing this function to assertCDISupported() or something similar make sense here? We could then check the combination of server version and experimental to see whether it is supported.

elezar · 2023-08-04T14:19:53Z

pkg/cluster/internal/providers/podman/provision.go

@@ -212,6 +212,13 @@ func runArgsForNode(node *config.Node, clusterIPFamily config.ClusterIPFamily, n
 		args...,
 	)

+	// Append CDI device args (used for GPU support)
+	if len(node.CDIDevices) > 0 {
+		for _, device := range node.CDIDevices {


Is a compatibility check similar to that for docker also required here? CDI support was only really added with Podman 4.1.0. As I've stated before, I think it's also fine to rely on the error returned by the CLI itself, but that may not be the nicest UX.

elezar · 2023-08-04T14:23:35Z

pkg/cluster/internal/providers/podman/provision.go

+	// Append CDI device args (used for GPU support)
+	if len(node.CDIDevices) > 0 {
+		for _, device := range node.CDIDevices {
+			args = append(args, "--device", strings.TrimSpace(device))


Instead of having to remember to call strings.TrimSpace whereever the elements of node.CDIDevices are used, can we do that somewhere once?

elezar · 2023-08-04T14:24:31Z

site/content/docs/user/configuration.md

+
+As a pre-requisite you install the [NVIDIA Container Toolkit](https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/install-guide.html) installed on the host.
+
+Using `devices` for GPU support requires Docker v25 or later. A [CDI specification](https://github.com/container-orchestrated-devices/container-device-interface) will need to be generated for your device. For Nvidia GPU devices see notes [here.](https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/install-guide.html#container-device-interface-cdi-support)


Suggested change

Using `devices` for GPU support requires Docker v25 or later. A [CDI specification](https://github.com/container-orchestrated-devices/container-device-interface) will need to be generated for your device. For Nvidia GPU devices see notes [here.](https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/install-guide.html#container-device-interface-cdi-support)

Using `devices` for GPU support requires Docker v25 or Podman v4.1.0 or later. A [CDI specification](https://github.com/container-orchestrated-devices/container-device-interface) will need to be generated for your device. For NVIDIA GPU devices see notes [here.](https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/install-guide.html#container-device-interface-cdi-support)

elezar · 2023-08-04T14:25:36Z

site/content/docs/user/configuration.md

+
+Using `devices` for GPU support requires Docker v25 or later. A [CDI specification](https://github.com/container-orchestrated-devices/container-device-interface) will need to be generated for your device. For Nvidia GPU devices see notes [here.](https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/install-guide.html#container-device-interface-cdi-support)
+
+GPU devices can be mapped to Kind node copntainers with the devices API:


Question: Is it Kind or KinD?

KIND or kind. #3290 (comment)

elezar · 2023-08-04T14:26:11Z

site/content/docs/user/configuration.md

+
+Steps to enable this:
+
+1. Add nvidia as your default runtime in `/etc/docker/daemon.json` If you have the [NVIDIA Container Toolkit installed](https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/install-guide.html) this can be done with: `sudo nvidia-ctk runtime configure --runtime=docker --set-as-default`


Suggested change

1. Add nvidia as your default runtime in `/etc/docker/daemon.json` If you have the [NVIDIA Container Toolkit installed](https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/install-guide.html) this can be done with: `sudo nvidia-ctk runtime configure --runtime=docker --set-as-default`

1. Add `nvidia` as your default runtime in `/etc/docker/daemon.json` If you have the [NVIDIA Container Toolkit installed](https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/install-guide.html) this can be done with: `sudo nvidia-ctk runtime configure --runtime=docker --set-as-default`

elezar · 2023-08-04T14:29:44Z

site/content/docs/user/configuration.md

+- role: control-plane
+- role: worker
+  devices:
+  - "nvidia.com/gpu=0"


Ok, this reduces the value of adding an API for CDI devices in my opinion.

This isn't true. The use of the device mounts uses our "legacy" stack which has a number of shortcomings that using CDI addresses. It also means that users can access devices from other vendors such as Intel or their own device / resource definitions.

elezar · 2023-08-04T14:30:04Z

site/content/docs/user/configuration.md

 worker or control-plane (in HA mode),
-KIND runs `kubeadm join` which can be configured using the 
+KIND runs `kubeadm join` which can be configured using the


Kind vs KinD vs KIND?

The docs consistently use "KIND" or "kind" but not "KinD". We're not using the acronym anymore as it may be podman or in the future something else. Generally just kind like the CLI, but sometimes "KIND" or kind (code formatted) is less ambiguously the tool vs the english word.

We should put a note in https://kind.sigs.k8s.io/docs/contributing/development/#documentation

BenTheElder · 2023-08-24T21:13:39Z

go.mod

@@ -5,6 +5,7 @@ go 1.16
 require (
 	github.com/BurntSushi/toml v1.0.0
 	github.com/alessio/shellescape v1.4.1
+	github.com/container-orchestrated-devices/container-device-interface v0.6.0


regarding dependency management, I'm working on #3335 to switch the modules to 1.17+, which will enable https://go.dev/ref/mod#graph-pruning and make the transitive dependencies clear.

just glancing at the go.sum changes, there's a concerningly large set of deps, but we can't tell with pre 1.17 go.mod files what is actually in the binary.

It remains important that kind the go libraries and CLI are trivially embeddable with minimal dependencies which also must meet kubernetes / CNCF's (allowed licenses) dependency standards. We have a number of users using kind directly in test tools etc and we don't want to bring in more dependencies.

https://github.com/cncf-tags/container-device-interface/blob/main/go.mod has a lot of deps for what I would expect to just be a spec/parser.

lukeogg · 2023-08-30T16:21:06Z

I will get back to this in the next week or so. Been out for a bit.

k8s-ci-robot · 2024-04-02T06:34:43Z

PR needs rebase.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Add api for Docker --devices flag

0486254

k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Jun 28, 2023

k8s-ci-robot requested a review from aojea June 28, 2023 19:49

k8s-ci-robot added the area/provider/docker Issues or PRs related to docker label Jun 28, 2023

k8s-ci-robot requested a review from neolit123 June 28, 2023 19:49

k8s-ci-robot added area/provider/podman Issues or PRs related to podman needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Jun 28, 2023

k8s-ci-robot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Jun 28, 2023

BenTheElder reviewed Jun 29, 2023

View reviewed changes

go.mod Outdated Show resolved Hide resolved

lukeogg added 2 commits June 29, 2023 14:52

fix: remove dependancy on pkg for validation

5ff36e4

fix: remove device string validation test

830a0dd

stmcginnis reviewed Jun 30, 2023

View reviewed changes

pkg/internal/apis/config/validate_test.go Outdated Show resolved Hide resolved

k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Jun 30, 2023

elezar reviewed Jul 4, 2023

View reviewed changes

lukeogg added 4 commits July 17, 2023 13:41

feat: Add in validation based on CDI package

0d20235

fix: Change name of Devices to CDIDevices

6e4314d

fix: add "dev" Docker version for testsing

b125982

docs: A few updates to devices doc

d2a13ba

lukeogg requested review from BenTheElder, elezar, stmcginnis and klueska July 17, 2023 20:32

klueska reviewed Jul 17, 2023

View reviewed changes

elezar mentioned this pull request Aug 4, 2023

Rename test package to parser_test cncf-tags/container-device-interface#149

Closed

elezar reviewed Aug 4, 2023

View reviewed changes

BenTheElder reviewed Aug 24, 2023

View reviewed changes

elezar mentioned this pull request Sep 21, 2023

REQUEST: New membership for elezar kubernetes/org#4472

Closed

9 tasks

k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Apr 2, 2024

klueska mentioned this pull request Apr 2, 2024

new security-opt: privileged-without-host-devices (for safe DinD with Kata) moby/moby#39702

Closed


		As a pre-requisite you install the [NVIDIA Container Toolkit](https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/install-guide.html) installed on the host.

		Using `devices` for GPU support requires Docker v25 or later. A [CDI specification](https://github.com/container-orchestrated-devices/container-device-interface) will need to be generated for your device. For Nvidia GPU devices see notes [here.](https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/install-guide.html#container-device-interface-cdi-support)


		Using `devices` for GPU support requires Docker v25 or later. A [CDI specification](https://github.com/container-orchestrated-devices/container-device-interface) will need to be generated for your device. For Nvidia GPU devices see notes [here.](https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/install-guide.html#container-device-interface-cdi-support)

		GPU devices can be mapped to Kind node copntainers with the devices API:


		Steps to enable this:

		1. Add nvidia as your default runtime in `/etc/docker/daemon.json` If you have the [NVIDIA Container Toolkit installed](https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/install-guide.html) this can be done with: `sudo nvidia-ctk runtime configure --runtime=docker --set-as-default`

	1. Add nvidia as your default runtime in `/etc/docker/daemon.json` If you have the [NVIDIA Container Toolkit installed](https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/install-guide.html) this can be done with: `sudo nvidia-ctk runtime configure --runtime=docker --set-as-default`
	1. Add `nvidia` as your default runtime in `/etc/docker/daemon.json` If you have the [NVIDIA Container Toolkit installed](https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/install-guide.html) this can be done with: `sudo nvidia-ctk runtime configure --runtime=docker --set-as-default`

Add API for CDI --devices flag in Docker and Podman for mapping GPUs #3290

Are you sure you want to change the base?

Add API for CDI --devices flag in Docker and Podman for mapping GPUs #3290

Conversation

lukeogg commented Jun 28, 2023

k8s-ci-robot commented Jun 28, 2023

k8s-ci-robot commented Jun 28, 2023

stmcginnis left a comment

Choose a reason for hiding this comment

elezar left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

klueska left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

lukeogg Jul 24, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

lukeogg Jul 24, 2023 • edited Loading

Choose a reason for hiding this comment

k8s-ci-robot commented Jul 17, 2023 • edited Loading

elezar left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

BenTheElder Aug 24, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

lukeogg commented Aug 30, 2023

k8s-ci-robot commented Apr 2, 2024

lukeogg Jul 24, 2023 •

edited

Loading

lukeogg Jul 24, 2023 •

edited

Loading

k8s-ci-robot commented Jul 17, 2023 •

edited

Loading

BenTheElder Aug 24, 2023 •

edited

Loading