Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pre-install nvidia container runtime + drivers on GPU instances #11628

Merged
merged 9 commits into from
Sep 11, 2021

Conversation

olemarkus
Copy link
Member

@olemarkus olemarkus commented May 30, 2021

  • install nvidia container runtime
  • install nvidia drivers
  • make it opt-in to avoid clash with those who prebake/use gpu operator or similar
  • add device plugin as addon

@k8s-ci-robot k8s-ci-robot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. area/nodeup labels May 30, 2021
@olemarkus
Copy link
Member Author

Worth mentioning that we override the containerd configuration override. Or at least modify it. kOps only knows about if the instance will have GPUs during nodeup, while cloudup will set the config override to the kOps default config by default (hence the containerd config override will always be set to the assumed "final" config).

@olemarkus olemarkus force-pushed the gpu-runtime branch 2 times, most recently from 9798157 to 57fc757 Compare May 30, 2021 09:36
@k8s-ci-robot k8s-ci-robot added size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. and removed size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels May 30, 2021
@olemarkus olemarkus requested a review from hakman May 30, 2021 09:43
@k8s-ci-robot k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label May 31, 2021
@hakman
Copy link
Member

hakman commented Jun 9, 2021

The idea is good in general, we should make this easy for people.

In particular, here are some of my thoughts:

  • I think the vision for nodeup is to become dumb, as dump as possible. This is why we moved the containerd logic out of it. I think you can call it defensive programming, but probably it's a longer topic that would be good for tomorrow's office hours.
  • Maybe add this as a new option of the container runtime, allowing to override of the runtime to something other than runc. Describing instance types just for this purpose seems wasteful for every node. It can also be done much earlier when generating the cloud model.
  • The APT source task, maybe would be better just as APT key instead and a separate file task for the repo file.
  • The APT list is focused on Ubuntu, there is a different list for debian10 for example.
  • The APT key should be sent as a file asset url, same as the containerd package.
  • containerd config is done at the moment outside nodeup. The reason it didn't change was because of logic moving to nodeup that we want to avoid. Again, topic for office hours. Though, it either should move completely to nodeup or manage it outside it as we do now.

@olemarkus
Copy link
Member Author

* I think the vision for nodeup is to become dumb, as dump as possible. This is why we moved the containerd logic out of it. I think you can call it defensive programming, but probably it's a longer topic that would be good for tomorrow's office hours.

* Maybe add this as a new option of the container runtime, allowing to override of the runtime to something other than runc. Describing instance types just for this purpose seems wasteful for every node. It can also be done much earlier when generating the cloud model.

The above has a challenge if someone creates a mixed instance group with both GPU and non-GPU instance types. I am not sure why anyone would do that, but it is possible, and if nvidia runtime is used on non-GPU, it will break.

That being said, we could do some hardware probing instead, I guess. Just seemed more reliable to describe the instance type.

* The APT source task, maybe would be better just as APT key instead and a separate file task for the repo file.

* The APT list is focused on Ubuntu, there is a different list for debian10 for example.

Yeah, I am aware. I plan on adding other distros as needed and in the interim solve this with some distro checks and documentation.

* The APT key should be sent as a file asset url, same as the containerd package.

Thanks. I didn't like the way I did this one. asset file url is a much better idea.

* containerd config is done at the moment outside nodeup. The reason it didn't change was because of logic moving to nodeup that we want to avoid. Again, topic for office hours. Though, it either should move completely to nodeup or manage it outside it as we do now.

If we find a way to do this in cloudup I'd also like to see if this can be put the final config aux. I don't like using the override flag as a carry mechanism cloudup -> nodeup.

@hakman
Copy link
Member

hakman commented Jun 9, 2021

* I think the vision for nodeup is to become dumb, as dump as possible. This is why we moved the containerd logic out of it. I think you can call it defensive programming, but probably it's a longer topic that would be good for tomorrow's office hours.

* Maybe add this as a new option of the container runtime, allowing to override of the runtime to something other than runc. Describing instance types just for this purpose seems wasteful for every node. It can also be done much earlier when generating the cloud model.

The above has a challenge if someone creates a mixed instance group with both GPU and non-GPU instance types. I am not sure why anyone would do that, but it is possible, and if nvidia runtime is used on non-GPU, it will break.

That being said, we could do some hardware probing instead, I guess. Just seemed more reliable to describe the instance type.

No need for that, we have validation. If we can validate that instances should not have a mix of ARM and AMD, we can make sure GPU instances are validated in same way

* The APT source task, maybe would be better just as APT key instead and a separate file task for the repo file.

* The APT list is focused on Ubuntu, there is a different list for debian10 for example.

Yeah, I am aware. I plan on adding other distros as needed and in the interim solve this with some distro checks and documentation.

Meant that the code checks for Debian family instead of Ubuntu specific

* The APT key should be sent as a file asset url, same as the containerd package.

Thanks. I didn't like the way I did this one. asset file url is a much better idea.

👍

* containerd config is done at the moment outside nodeup. The reason it didn't change was because of logic moving to nodeup that we want to avoid. Again, topic for office hours. Though, it either should move completely to nodeup or manage it outside it as we do now.

If we find a way to do this in cloudup I'd also like to see if this can be put the final config aux. I don't like using the override flag as a carry mechanism cloudup -> nodeup.

I agree, but to keep this on track we can do it later, when the config aux is merged. I agree that aux would be a better place for it.

@olemarkus
Copy link
Member Author

So if we don't allow mixing GPU and non-GPU instance types, building the config cloudup should be fine.
I would need to send some setting to nodeup telling it to do those package installs. Again seems like something worth putting on aux.

@hakman
Copy link
Member

hakman commented Jun 9, 2021

A prerequisite would be to allow configuring containerd for each IG, not sure if possible now. Once that is possible, the package installs can just be done in the ContainerRuntime builder. Package installs would still be hardcoded I guess for now based on OS type.

@olemarkus
Copy link
Member Author

Right. I think that is a challenge with how nodeup config is being built. Probably the containerd config should be built based on IG and cluster spec and then written to one of the configs rather than writing it back to the IG/cluster spec. But we are a long way away from that.

I think continuing the current path makes sense for now. There is nothing here that is exposed to the user, so things can be moved later on (maybe even before 1.22 GA)

@k8s-ci-robot k8s-ci-robot added area/api and removed needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. labels Jun 29, 2021
@olemarkus
Copy link
Member Author

I didn't do many of the APT changes. Using assets was challenging as it is arch dependent, which doesn't make sense in this case. It could also run into run-time issues if we use hashes (upstream can change keys) and it is the TLS cert we trust here.

@olemarkus
Copy link
Member Author

I'll defer device addon to another PR.

@olemarkus olemarkus changed the title [WIP] Pre-install nvidia container runtime + drivers on GPU instances Pre-install nvidia container runtime + drivers on GPU instances Jun 29, 2021
@k8s-ci-robot k8s-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jun 29, 2021
@johngmyers
Copy link
Member

Determining the arch of an instance and whether or not it has a GPU appears to be firmly in the bailiwick of nodeup. I don't see why there's an objection to putting such logic in there. Making the admin configure this through the API is just causing them grief.

I do not understand or agree with this desire to make nodeup dumb. It should do the things it is well suited for and not do the things it is not well suited for.

@hakman
Copy link
Member

hakman commented Jun 30, 2021

GPU stuff is something that very few operators need or use. Checking each node in every cluster if it has that capability just for that is not exactly ideal. More over, this has to work on the various supported platforms, not just AWS.

@johngmyers
Copy link
Member

Can't nodeup to the equivalent of lspci or somesuch?

ig.Spec.NodeLabels = make(map[string]string)
}
ig.Spec.NodeLabels["kops.k8s.io/gpu"] = "1"
ig.Spec.Taints = append(ig.Spec.Taints, "nvidia.com/gpu:NoSchedule")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is a behavioral change? i.e. workloads that previously ran on GPU instances will need to add this toleration to remain schedulable?

Or is there a webhook or similar that will add this automatically?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch. I think this one should additionally be blocked by NvidiaGPU.enabled. That way one has explicitly subscribed to kOps way of doing things. I'll add docs to match as well.

@@ -765,6 +765,7 @@ func addNodeupPermissions(p *Policy, enableHookSupport bool) {
addASLifecyclePolicies(p, enableHookSupport)
p.unconditionalAction.Insert(
"ec2:DescribeInstances", // aws.go
"ec2:DescribeInstanceTypes",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was originally a little sad that we couldn't get this from the metadata or DescribeInstances. But I came round to your point of view, because InstanceTypes are generic data that won't really be sensitive, and if there's one we should get rid of it's probably DescribeInstances :-)

@justinsb
Copy link
Member

justinsb commented Sep 5, 2021

I think this looks great, and as we agreed in office hours we can nest it under containerd.

My one remaining challenge is the automatic taint, because I'm worried that we'll break clusters that upgrade that are using GPU instances until users add that toleration. Or is there some other mechanism at play here: e.g. everyone is already using this toleration, or there's a webhook that adds it?

@olemarkus
Copy link
Member Author

These concerns should be addressed now.

@justinsb
Copy link
Member

Thanks @olemarkus

This lgtm now and we did agree on this at our last discussion; I'm going to approve but hold until after office hours.

/approve
/lgtm
/hold

@k8s-ci-robot k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Sep 10, 2021
@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Sep 10, 2021
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: justinsb

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Sep 10, 2021
@hakman
Copy link
Member

hakman commented Sep 10, 2021

/lgtm

@olemarkus
Copy link
Member Author

/hold cancel

@k8s-ci-robot k8s-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Sep 11, 2021
@k8s-ci-robot k8s-ci-robot merged commit 1b431b4 into kubernetes:master Sep 11, 2021
@k8s-ci-robot k8s-ci-robot modified the milestones: v1.22, v1.23 Sep 11, 2021
k8s-ci-robot added a commit that referenced this pull request Sep 16, 2021
…628-origin-release-1.22

Automated cherry pick of #11628: Add nvidia configuration to the api
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. area/addons area/api area/documentation area/nodeup area/provider/aws Issues or PRs related to aws provider cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. lgtm "Looks good to me", indicates that a PR is ready to be merged. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants