Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] GPU support #655

Closed
wants to merge 1 commit into from

Conversation

SubhasmitaSw
Copy link
Contributor

@SubhasmitaSw SubhasmitaSw commented Jul 14, 2022

What type of PR is this?
/kind api-change
//kind feature

What this PR does / why we need it:
In support to accommodate the API adjustments needed to manage GPU acceleration of the GCP instances in CAPG.

Special notes for your reviewer:
This is a WIP PR, additional controller changes to be added successively.

TODOs:

  • squashed commits
  • includes documentation
  • adds unit tests

Release note:

NONE

cc @richardcase @dims @cpanato

@k8s-ci-robot k8s-ci-robot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. release-note-none Denotes a PR that doesn't merit a release note. kind/api-change Categorizes issue or PR as related to adding, removing, or otherwise changing an API cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Jul 14, 2022
@k8s-ci-robot
Copy link
Contributor

Hi @SubhasmitaSw. Thanks for your PR.

I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot added needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Jul 14, 2022
@SubhasmitaSw SubhasmitaSw force-pushed the gcp-gpu-api branch 2 times, most recently from f777d9f to b12ccd7 Compare July 14, 2022 09:22
api/v1beta1/gcpmachine_types.go Show resolved Hide resolved
api/v1beta1/gcpmachine_types.go Outdated Show resolved Hide resolved
controllers/gcpmachine_controller.go Outdated Show resolved Hide resolved
@k8s-ci-robot k8s-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Jul 19, 2022
Copy link
Member

@richardcase richardcase left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great work, lets chat via zoom about the comments.

type AcceleratorConfig struct {
// Type is the type of the GPU accelerator to be used for the GCP machine.
// +required
Type string `json:"acceleratorType"`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We discussed having a new type for the accelerator type, and then constants with the allowed values. Like DiskType.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should probably also use // +kubebuilder:validation:Enum


const (
// DefaultAcceleratorType is the default type of GPU acclerator to be used for the GCP machine if not specified.
DefaultAcceleratorType = "nvidia-tesla-k80"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want to have a default accelerator type? Or should be make sure the user chooses?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As we have a default accelerator type as nvidia-tesla-t4 does that affect the code while reconciling here in the code base?
Like if we make the default as nvidia-tesla-k80 here then what will be the difference?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would say if as user is choosing to add an accelerator it should be required they say which type instead of defaulting. So marking it required and having no default.

// Count is the number of accelerators to be used for the GCP machine.
// Defaults to 1.
// +optional
Count int64 `json:"acceleratorCount,omitempty"`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can probably use // +kubebuilder:default:=1 here.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few other points:

  • We should set a minimum value (// +kubebuilder:validation:Minimum)
  • Do we need this to be int64?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the current case the validation of accelerators during the instance creation does not allow int to be used, further if we change the static validation function to a conditional check during reconciliation we can use int.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As a principle, our API doesn't have to match the GGCP exactly, so we have use different data types if needed (as long as we can caste without loosing data).

api/v1beta1/gcpmachine_types.go Outdated Show resolved Hide resolved
api/v1beta1/gcpmachine_types.go Show resolved Hide resolved
@@ -102,6 +135,10 @@ type GCPMachineSpec struct {
// +optional
AdditionalMetadata []MetadataItem `json:"additionalMetadata,omitempty"`

// BuildName is the name of the build to use for the GCP instance.
// +optional
BuildName string `json:"buildName,omitempty"`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Whats the purpose of BuildName? I think we probably don't need this.

ClusterGetter cloud.ClusterGetter
Machine *clusterv1.Machine
GCPMachine *infrav1.GCPMachine
AcceleratorConfig *infrav1.AcceleratorConfig
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The AcceleratorConfig is already available via the GCPMachine

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes we can access it by params.GCPMachine.Spec.AcceleratorConfig

@@ -61,6 +62,15 @@ func NewMachineScope(params MachineScopeParams) (*MachineScope, error) {
return nil, errors.New("gcp machine is required when creating a MachineScope")
}

if params.AcceleratorConfig == nil {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we are using the // +kubebuilder:default tags then we won't need this. If there is custom defaulting that can't be done via this tag then we can using the defaulting webhook:

ClusterGetter cloud.ClusterGetter
Machine *clusterv1.Machine
GCPMachine *infrav1.GCPMachine
AcceleratorConfig *infrav1.AcceleratorConfig
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comment as above

}

// Supported GPU types.
var (
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will be handled via the API, see previous comment around the new type for "accelerator type"

@k8s-ci-robot k8s-ci-robot added size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. and removed size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Jul 22, 2022
@k8s-ci-robot k8s-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. and removed size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Jul 22, 2022
@k8s-ci-robot k8s-ci-robot removed needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels Jul 22, 2022
@richardcase
Copy link
Member

I am picking this up again after the holiday break.

// +kubebuilder:validation:Enum=TERMINATE;MIGRATE
// +kubebuilder:default=MIGRATE
// +optional
OnHostMaintenance string `json:"onHostMaintenance,omitempty"`

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As you've created a typed string, this field should be using that typed string

Suggested change
OnHostMaintenance string `json:"onHostMaintenance,omitempty"`
OnHostMaintenance OnHostMaintenance `json:"onHostMaintenance,omitempty"`

Also, and I'm not sure how much this project will agree, Kube API conventions would recommend that the constant values here are Terminate and Migrate, matching the PascalCase convention for enums, rather than matching the cloud provider values

Not sure what the maintainers prefer but thought I should mention

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for raising this @JoelSpeed...especially considering the conversation on the confidential compute PR.

Which reminds me i need to get the image-builder PR fixed/merged....

@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all PRs.

This bot triages PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the PR is closed

You can:

  • Mark this PR as fresh with /remove-lifecycle stale
  • Close this PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Apr 26, 2023
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough active contributors to adequately respond to all PRs.

This bot triages PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the PR is closed

You can:

  • Mark this PR as fresh with /remove-lifecycle rotten
  • Close this PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

@k8s-ci-robot k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels May 26, 2023
@richardcase
Copy link
Member

/remove-lifecycle rotten

@k8s-ci-robot k8s-ci-robot removed the lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. label Jun 1, 2023
- Add AcceleratorConfig
- Add Default Accelerator types
- Add webhooks for the new types
- Add validations for the new types
- Add conversion between different API types
- Add GCPMachine CRD
- Add GPU Cluster template
- Update package
- Add cluster template for GPU instance
- Add unit tests for accelerator config
- Documentation for GPU enabled cluster
- E2E test for GPU enabled cluster
- Add standby e2e test

Signed-off-by: Aniruddha Basak <codewithaniruddha@gmail.com>
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: SubhasmitaSw
Once this PR has been reviewed and has the lgtm label, please ask for approval from richardcase. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot
Copy link
Contributor

@SubhasmitaSw: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
pull-cluster-api-provider-gcp-apidiff 244060d link false /test pull-cluster-api-provider-gcp-apidiff
pull-cluster-api-provider-gcp-test 244060d link true /test pull-cluster-api-provider-gcp-test
pull-cluster-api-provider-gcp-build 244060d link true /test pull-cluster-api-provider-gcp-build
pull-cluster-api-provider-gcp-verify 244060d link true /test pull-cluster-api-provider-gcp-verify
pull-cluster-api-provider-gcp-e2e-test 244060d link true /test pull-cluster-api-provider-gcp-e2e-test
pull-cluster-api-provider-gcp-make 244060d link true /test pull-cluster-api-provider-gcp-make

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all PRs.

This bot triages PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the PR is closed

You can:

  • Mark this PR as fresh with /remove-lifecycle stale
  • Close this PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jan 22, 2024
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough active contributors to adequately respond to all PRs.

This bot triages PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the PR is closed

You can:

  • Mark this PR as fresh with /remove-lifecycle rotten
  • Close this PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

@k8s-ci-robot k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Feb 21, 2024
@richardcase
Copy link
Member

/remove-lifecycle rotten

I will pick up the image building side so that we can get this merged.

@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the PR is closed

You can:

  • Reopen this PR with /reopen
  • Mark this PR as fresh with /remove-lifecycle rotten
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close

@k8s-ci-robot
Copy link
Contributor

@k8s-triage-robot: Closed this PR.

In response to this:

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the PR is closed

You can:

  • Reopen this PR with /reopen
  • Mark this PR as fresh with /remove-lifecycle rotten
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@richardcase
Copy link
Member

/reopen

@k8s-ci-robot k8s-ci-robot reopened this Mar 22, 2024
@k8s-ci-robot
Copy link
Contributor

@richardcase: Reopened this PR.

In response to this:

/reopen

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the PR is closed

You can:

  • Reopen this PR with /reopen
  • Mark this PR as fresh with /remove-lifecycle rotten
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close

@k8s-ci-robot
Copy link
Contributor

@k8s-triage-robot: Closed this PR.

In response to this:

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the PR is closed

You can:

  • Reopen this PR with /reopen
  • Mark this PR as fresh with /remove-lifecycle rotten
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@nicolas2bonfils nicolas2bonfils mentioned this pull request Aug 23, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. kind/api-change Categorizes issue or PR as related to adding, removing, or otherwise changing an API lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. release-note-none Denotes a PR that doesn't merit a release note. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

10 participants