[WIP] GPU support #655

SubhasmitaSw · 2022-07-14T06:54:52Z

What type of PR is this?
/kind api-change
//kind feature

What this PR does / why we need it:
In support to accommodate the API adjustments needed to manage GPU acceleration of the GCP instances in CAPG.

Special notes for your reviewer:
This is a WIP PR, additional controller changes to be added successively.

TODOs:

squashed commits
includes documentation
adds unit tests

Release note:

NONE

cc @richardcase @dims @cpanato

k8s-ci-robot · 2022-07-14T06:55:00Z

Hi @SubhasmitaSw. Thanks for your PR.

I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

api/v1beta1/gcpmachine_types.go

controllers/gcpmachine_controller.go

richardcase

Great work, lets chat via zoom about the comments.

richardcase · 2022-07-19T13:35:54Z

api/v1beta1/gcpmachine_types.go

+type AcceleratorConfig struct {
+	// Type is the type of the GPU accelerator to be used for the GCP machine.
+	// +required
+	Type string `json:"acceleratorType"`


We discussed having a new type for the accelerator type, and then constants with the allowed values. Like DiskType.

We should probably also use // +kubebuilder:validation:Enum

richardcase · 2022-07-19T13:38:30Z

api/v1beta1/gcpmachine_types.go

+
+const (
+	// DefaultAcceleratorType is the default type of GPU acclerator to be used for the GCP machine if not specified.
+	DefaultAcceleratorType = "nvidia-tesla-k80"


Do we want to have a default accelerator type? Or should be make sure the user chooses?

As we have a default accelerator type as nvidia-tesla-t4 does that affect the code while reconciling here in the code base?
Like if we make the default as nvidia-tesla-k80 here then what will be the difference?

I would say if as user is choosing to add an accelerator it should be required they say which type instead of defaulting. So marking it required and having no default.

richardcase · 2022-07-19T13:38:48Z

api/v1beta1/gcpmachine_types.go

+	// Count is the number of accelerators to be used for the GCP machine.
+	// Defaults to 1.
+	// +optional
+	Count int64 `json:"acceleratorCount,omitempty"`


We can probably use // +kubebuilder:default:=1 here.

A few other points:

We should set a minimum value (// +kubebuilder:validation:Minimum)

Do we need this to be int64?

For the current case the validation of accelerators during the instance creation does not allow int to be used, further if we change the static validation function to a conditional check during reconciliation we can use int.

As a principle, our API doesn't have to match the GGCP exactly, so we have use different data types if needed (as long as we can caste without loosing data).

api/v1beta1/gcpmachine_types.go

richardcase · 2022-07-19T13:45:35Z

api/v1beta1/gcpmachine_types.go

@@ -102,6 +135,10 @@ type GCPMachineSpec struct {
 	// +optional
 	AdditionalMetadata []MetadataItem `json:"additionalMetadata,omitempty"`

+	// BuildName is the name of the build to use for the GCP instance.
+	// +optional
+	BuildName string `json:"buildName,omitempty"`


Whats the purpose of BuildName? I think we probably don't need this.

richardcase · 2022-07-19T13:46:32Z

cloud/scope/machine.go

+	ClusterGetter     cloud.ClusterGetter
+	Machine           *clusterv1.Machine
+	GCPMachine        *infrav1.GCPMachine
+	AcceleratorConfig *infrav1.AcceleratorConfig


The AcceleratorConfig is already available via the GCPMachine

Yes we can access it by params.GCPMachine.Spec.AcceleratorConfig

richardcase · 2022-07-19T13:49:01Z

cloud/scope/machine.go

@@ -61,6 +62,15 @@ func NewMachineScope(params MachineScopeParams) (*MachineScope, error) {
 		return nil, errors.New("gcp machine is required when creating a MachineScope")
 	}

+	if params.AcceleratorConfig == nil {


If we are using the // +kubebuilder:default tags then we won't need this. If there is custom defaulting that can't be done via this tag then we can using the defaulting webhook:

https://book.kubebuilder.io/reference/webhook-overview.html

https://github.com/kubernetes-sigs/cluster-api-provider-gcp/blob/main/api/v1beta1/gcpmachine_webhook.go#L99

richardcase · 2022-07-19T13:49:22Z

cloud/scope/machine.go

+	ClusterGetter     cloud.ClusterGetter
+	Machine           *clusterv1.Machine
+	GCPMachine        *infrav1.GCPMachine
+	AcceleratorConfig *infrav1.AcceleratorConfig


Same comment as above

richardcase · 2022-07-19T13:50:10Z

controllers/gcpmachine_controller.go

 }

+// Supported GPU types.
+var (


This will be handled via the API, see previous comment around the new type for "accelerator type"

richardcase · 2023-01-03T10:46:18Z

I am picking this up again after the holiday break.

JoelSpeed · 2023-01-26T10:46:52Z

api/v1beta1/gcpmachine_types.go

+	// +kubebuilder:validation:Enum=TERMINATE;MIGRATE
+	// +kubebuilder:default=MIGRATE
+	// +optional
+	OnHostMaintenance string `json:"onHostMaintenance,omitempty"`


As you've created a typed string, this field should be using that typed string

Suggested change

OnHostMaintenance string `json:"onHostMaintenance,omitempty"`

OnHostMaintenance OnHostMaintenance `json:"onHostMaintenance,omitempty"`

Also, and I'm not sure how much this project will agree, Kube API conventions would recommend that the constant values here are Terminate and Migrate, matching the PascalCase convention for enums, rather than matching the cloud provider values

Not sure what the maintainers prefer but thought I should mention

Thanks for raising this @JoelSpeed...especially considering the conversation on the confidential compute PR.

Which reminds me i need to get the image-builder PR fixed/merged....

k8s-triage-robot · 2023-04-26T11:42:48Z

The Kubernetes project currently lacks enough contributors to adequately respond to all PRs.

This bot triages PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the PR is closed

You can:

Mark this PR as fresh with /remove-lifecycle stale
Close this PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot · 2023-05-26T11:59:10Z

The Kubernetes project currently lacks enough active contributors to adequately respond to all PRs.

This bot triages PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the PR is closed

You can:

Mark this PR as fresh with /remove-lifecycle rotten
Close this PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

richardcase · 2023-06-01T09:03:16Z

/remove-lifecycle rotten

- Add AcceleratorConfig - Add Default Accelerator types - Add webhooks for the new types - Add validations for the new types - Add conversion between different API types - Add GCPMachine CRD - Add GPU Cluster template - Update package - Add cluster template for GPU instance - Add unit tests for accelerator config - Documentation for GPU enabled cluster - E2E test for GPU enabled cluster - Add standby e2e test Signed-off-by: Aniruddha Basak <codewithaniruddha@gmail.com>

k8s-ci-robot · 2023-10-12T04:20:46Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: SubhasmitaSw
Once this PR has been reviewed and has the lgtm label, please ask for approval from richardcase. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

k8s-ci-robot · 2023-10-12T04:21:27Z

@SubhasmitaSw: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name	Commit	Details	Required	Rerun command
pull-cluster-api-provider-gcp-apidiff	`244060d`	link	false	`/test pull-cluster-api-provider-gcp-apidiff`
pull-cluster-api-provider-gcp-test	`244060d`	link	true	`/test pull-cluster-api-provider-gcp-test`
pull-cluster-api-provider-gcp-build	`244060d`	link	true	`/test pull-cluster-api-provider-gcp-build`
pull-cluster-api-provider-gcp-verify	`244060d`	link	true	`/test pull-cluster-api-provider-gcp-verify`
pull-cluster-api-provider-gcp-e2e-test	`244060d`	link	true	`/test pull-cluster-api-provider-gcp-e2e-test`
pull-cluster-api-provider-gcp-make	`244060d`	link	true	`/test pull-cluster-api-provider-gcp-make`

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

k8s-triage-robot · 2024-01-22T07:24:53Z

The Kubernetes project currently lacks enough contributors to adequately respond to all PRs.

This bot triages PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the PR is closed

You can:

Mark this PR as fresh with /remove-lifecycle stale
Close this PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot · 2024-02-21T07:45:32Z

The Kubernetes project currently lacks enough active contributors to adequately respond to all PRs.

This bot triages PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the PR is closed

You can:

Mark this PR as fresh with /remove-lifecycle rotten
Close this PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

richardcase · 2024-02-21T10:30:40Z

/remove-lifecycle rotten

I will pick up the image building side so that we can get this merged.

k8s-triage-robot · 2024-03-22T11:27:52Z

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the PR is closed

You can:

Reopen this PR with /reopen
Mark this PR as fresh with /remove-lifecycle rotten
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close

k8s-ci-robot · 2024-03-22T11:27:56Z

@k8s-triage-robot: Closed this PR.

In response to this:

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied

After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied

After 30d of inactivity since lifecycle/rotten was applied, the PR is closed

You can:

Reopen this PR with /reopen

Mark this PR as fresh with /remove-lifecycle rotten

Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

richardcase · 2024-03-22T14:00:42Z

/reopen

k8s-ci-robot · 2024-03-22T14:00:47Z

@richardcase: Reopened this PR.

In response to this:

/reopen

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

k8s-triage-robot · 2024-04-21T14:08:54Z

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the PR is closed

You can:

Reopen this PR with /reopen
Mark this PR as fresh with /remove-lifecycle rotten
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close

k8s-ci-robot · 2024-04-21T14:08:59Z

@k8s-triage-robot: Closed this PR.

In response to this:

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied

After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied

After 30d of inactivity since lifecycle/rotten was applied, the PR is closed

You can:

Reopen this PR with /reopen

Mark this PR as fresh with /remove-lifecycle rotten

Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

k8s-ci-robot added needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Jul 14, 2022

k8s-ci-robot requested review from cpanato and sbueringer July 14, 2022 06:55

SubhasmitaSw force-pushed the gcp-gpu-api branch 2 times, most recently from f777d9f to b12ccd7 Compare July 14, 2022 09:22

aniruddha2000 reviewed Jul 14, 2022

View reviewed changes

api/v1beta1/gcpmachine_types.go Show resolved Hide resolved

api/v1beta1/gcpmachine_types.go Outdated Show resolved Hide resolved

controllers/gcpmachine_controller.go Outdated Show resolved Hide resolved

SubhasmitaSw force-pushed the gcp-gpu-api branch from 7aca1ba to 0ec3e07 Compare July 15, 2022 14:42

k8s-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Jul 19, 2022

richardcase requested changes Jul 19, 2022

View reviewed changes

k8s-ci-robot assigned richardcase Jul 19, 2022

aniruddha2000 force-pushed the gcp-gpu-api branch from 6954d6b to b008844 Compare July 22, 2022 17:09

k8s-ci-robot added size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. and removed size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Jul 22, 2022

SubhasmitaSw force-pushed the gcp-gpu-api branch from 8bb605f to b008844 Compare July 22, 2022 17:58

SubhasmitaSw force-pushed the gcp-gpu-api branch from 797adcf to 350b655 Compare July 22, 2022 18:04

aniruddha2000 force-pushed the gcp-gpu-api branch from 350b655 to e623d64 Compare July 22, 2022 18:18

k8s-ci-robot removed needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels Jul 22, 2022

eranco74 mentioned this pull request Jan 25, 2023

Add support for confidential compute #809

Merged

3 tasks

JoelSpeed reviewed Jan 26, 2023

View reviewed changes

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Apr 26, 2023

k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels May 26, 2023

k8s-ci-robot removed the lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. label Jun 1, 2023

aniruddha2000 force-pushed the gcp-gpu-api branch from d33fd97 to 244060d Compare October 12, 2023 04:20

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jan 22, 2024

k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Feb 21, 2024

k8s-ci-robot closed this Mar 22, 2024

k8s-ci-robot reopened this Mar 22, 2024

k8s-ci-robot closed this Apr 21, 2024

nicolas2bonfils mentioned this pull request Aug 23, 2024

GPU support #289

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WIP] GPU support #655

[WIP] GPU support #655

SubhasmitaSw commented Jul 14, 2022 •

edited

Loading

k8s-ci-robot commented Jul 14, 2022

richardcase left a comment

richardcase Jul 19, 2022

richardcase Jul 19, 2022

richardcase Jul 19, 2022

aniruddha2000 Jul 19, 2022

richardcase Jul 19, 2022

richardcase Jul 19, 2022

richardcase Jul 19, 2022

SubhasmitaSw Jul 20, 2022

richardcase Jul 21, 2022

richardcase Jul 19, 2022

richardcase Jul 19, 2022

aniruddha2000 Jul 19, 2022

richardcase Jul 19, 2022

richardcase Jul 19, 2022

richardcase Jul 19, 2022

richardcase commented Jan 3, 2023

JoelSpeed Jan 26, 2023

richardcase Jan 26, 2023

k8s-triage-robot commented Apr 26, 2023

k8s-triage-robot commented May 26, 2023

richardcase commented Jun 1, 2023

k8s-ci-robot commented Oct 12, 2023

k8s-ci-robot commented Oct 12, 2023

k8s-triage-robot commented Jan 22, 2024

k8s-triage-robot commented Feb 21, 2024

richardcase commented Feb 21, 2024

k8s-triage-robot commented Mar 22, 2024

k8s-ci-robot commented Mar 22, 2024

richardcase commented Mar 22, 2024

k8s-ci-robot commented Mar 22, 2024

k8s-triage-robot commented Apr 21, 2024

k8s-ci-robot commented Apr 21, 2024

	OnHostMaintenance string `json:"onHostMaintenance,omitempty"`
	OnHostMaintenance OnHostMaintenance `json:"onHostMaintenance,omitempty"`

[WIP] GPU support #655

[WIP] GPU support #655

Conversation

SubhasmitaSw commented Jul 14, 2022 • edited Loading

k8s-ci-robot commented Jul 14, 2022

richardcase left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

richardcase commented Jan 3, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

k8s-triage-robot commented Apr 26, 2023

k8s-triage-robot commented May 26, 2023

richardcase commented Jun 1, 2023

k8s-ci-robot commented Oct 12, 2023

k8s-ci-robot commented Oct 12, 2023

k8s-triage-robot commented Jan 22, 2024

k8s-triage-robot commented Feb 21, 2024

richardcase commented Feb 21, 2024

k8s-triage-robot commented Mar 22, 2024

k8s-ci-robot commented Mar 22, 2024

richardcase commented Mar 22, 2024

k8s-ci-robot commented Mar 22, 2024

k8s-triage-robot commented Apr 21, 2024

k8s-ci-robot commented Apr 21, 2024

SubhasmitaSw commented Jul 14, 2022 •

edited

Loading