-
Notifications
You must be signed in to change notification settings - Fork 201
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[WIP] GPU support #655
[WIP] GPU support #655
Conversation
Hi @SubhasmitaSw. Thanks for your PR. I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
f777d9f
to
b12ccd7
Compare
7aca1ba
to
0ec3e07
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great work, lets chat via zoom about the comments.
api/v1beta1/gcpmachine_types.go
Outdated
type AcceleratorConfig struct { | ||
// Type is the type of the GPU accelerator to be used for the GCP machine. | ||
// +required | ||
Type string `json:"acceleratorType"` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We discussed having a new type for the accelerator type, and then constants with the allowed values. Like DiskType
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should probably also use // +kubebuilder:validation:Enum
api/v1beta1/gcpmachine_types.go
Outdated
|
||
const ( | ||
// DefaultAcceleratorType is the default type of GPU acclerator to be used for the GCP machine if not specified. | ||
DefaultAcceleratorType = "nvidia-tesla-k80" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we want to have a default accelerator type? Or should be make sure the user chooses?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As we have a default accelerator type as nvidia-tesla-t4
does that affect the code while reconciling here in the code base?
Like if we make the default as nvidia-tesla-k80
here then what will be the difference?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would say if as user is choosing to add an accelerator it should be required they say which type instead of defaulting. So marking it required and having no default.
api/v1beta1/gcpmachine_types.go
Outdated
// Count is the number of accelerators to be used for the GCP machine. | ||
// Defaults to 1. | ||
// +optional | ||
Count int64 `json:"acceleratorCount,omitempty"` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can probably use // +kubebuilder:default:=1
here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A few other points:
- We should set a minimum value (
// +kubebuilder:validation:Minimum
) - Do we need this to be int64?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For the current case the validation of accelerators during the instance creation does not allow int
to be used, further if we change the static validation function to a conditional check
during reconciliation we can use int
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As a principle, our API doesn't have to match the GGCP exactly, so we have use different data types if needed (as long as we can caste without loosing data).
api/v1beta1/gcpmachine_types.go
Outdated
@@ -102,6 +135,10 @@ type GCPMachineSpec struct { | |||
// +optional | |||
AdditionalMetadata []MetadataItem `json:"additionalMetadata,omitempty"` | |||
|
|||
// BuildName is the name of the build to use for the GCP instance. | |||
// +optional | |||
BuildName string `json:"buildName,omitempty"` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Whats the purpose of BuildName
? I think we probably don't need this.
cloud/scope/machine.go
Outdated
ClusterGetter cloud.ClusterGetter | ||
Machine *clusterv1.Machine | ||
GCPMachine *infrav1.GCPMachine | ||
AcceleratorConfig *infrav1.AcceleratorConfig |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The AcceleratorConfig
is already available via the GCPMachine
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes we can access it by params.GCPMachine.Spec.AcceleratorConfig
cloud/scope/machine.go
Outdated
@@ -61,6 +62,15 @@ func NewMachineScope(params MachineScopeParams) (*MachineScope, error) { | |||
return nil, errors.New("gcp machine is required when creating a MachineScope") | |||
} | |||
|
|||
if params.AcceleratorConfig == nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we are using the // +kubebuilder:default
tags then we won't need this. If there is custom defaulting that can't be done via this tag then we can using the defaulting webhook:
cloud/scope/machine.go
Outdated
ClusterGetter cloud.ClusterGetter | ||
Machine *clusterv1.Machine | ||
GCPMachine *infrav1.GCPMachine | ||
AcceleratorConfig *infrav1.AcceleratorConfig |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same comment as above
controllers/gcpmachine_controller.go
Outdated
} | ||
|
||
// Supported GPU types. | ||
var ( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will be handled via the API, see previous comment around the new type for "accelerator type"
6954d6b
to
b008844
Compare
8bb605f
to
b008844
Compare
797adcf
to
350b655
Compare
350b655
to
e623d64
Compare
I am picking this up again after the holiday break. |
// +kubebuilder:validation:Enum=TERMINATE;MIGRATE | ||
// +kubebuilder:default=MIGRATE | ||
// +optional | ||
OnHostMaintenance string `json:"onHostMaintenance,omitempty"` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As you've created a typed string, this field should be using that typed string
OnHostMaintenance string `json:"onHostMaintenance,omitempty"` | |
OnHostMaintenance OnHostMaintenance `json:"onHostMaintenance,omitempty"` |
Also, and I'm not sure how much this project will agree, Kube API conventions would recommend that the constant values here are Terminate
and Migrate
, matching the PascalCase convention for enums, rather than matching the cloud provider values
Not sure what the maintainers prefer but thought I should mention
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for raising this @JoelSpeed...especially considering the conversation on the confidential compute PR.
Which reminds me i need to get the image-builder PR fixed/merged....
The Kubernetes project currently lacks enough contributors to adequately respond to all PRs. This bot triages PRs according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /lifecycle stale |
The Kubernetes project currently lacks enough active contributors to adequately respond to all PRs. This bot triages PRs according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /lifecycle rotten |
/remove-lifecycle rotten |
- Add AcceleratorConfig - Add Default Accelerator types - Add webhooks for the new types - Add validations for the new types - Add conversion between different API types - Add GCPMachine CRD - Add GPU Cluster template - Update package - Add cluster template for GPU instance - Add unit tests for accelerator config - Documentation for GPU enabled cluster - E2E test for GPU enabled cluster - Add standby e2e test Signed-off-by: Aniruddha Basak <codewithaniruddha@gmail.com>
d33fd97
to
244060d
Compare
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: SubhasmitaSw The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
@SubhasmitaSw: The following tests failed, say
Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here. |
The Kubernetes project currently lacks enough contributors to adequately respond to all PRs. This bot triages PRs according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /lifecycle stale |
The Kubernetes project currently lacks enough active contributors to adequately respond to all PRs. This bot triages PRs according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /lifecycle rotten |
/remove-lifecycle rotten I will pick up the image building side so that we can get this merged. |
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs. This bot triages PRs according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /close |
@k8s-triage-robot: Closed this PR. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
/reopen |
@richardcase: Reopened this PR. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs. This bot triages PRs according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /close |
@k8s-triage-robot: Closed this PR. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
What type of PR is this?
/kind api-change
//kind feature
What this PR does / why we need it:
In support to accommodate the API adjustments needed to manage GPU acceleration of the GCP instances in CAPG.
Special notes for your reviewer:
This is a WIP PR, additional controller changes to be added successively.
TODOs:
Release note:
cc @richardcase @dims @cpanato