Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support processing resource types other than GPU #75

Merged
merged 7 commits into from
Jan 15, 2019
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
20 changes: 15 additions & 5 deletions cmd/mpi-operator/main.go
Original file line number Diff line number Diff line change
Expand Up @@ -29,11 +29,13 @@ import (
)

var (
masterURL string
kubeConfig string
gpusPerNode int
kubectlDeliveryImage string
namespace string
masterURL string
kubeConfig string
gpusPerNode int
processingUnitsPerNode int
processingResourceType string
kubectlDeliveryImage string
namespace string
)

func main() {
Expand Down Expand Up @@ -78,6 +80,8 @@ func main() {
kubeInformerFactory.Batch().V1().Jobs(),
kubeflowInformerFactory.Kubeflow().V1alpha1().MPIJobs(),
gpusPerNode,
processingUnitsPerNode,
processingResourceType,
kubectlDeliveryImage)

go kubeInformerFactory.Start(stopCh)
Expand All @@ -98,4 +102,10 @@ func init() {
"The maximum number of GPUs available per node. Note that this will be ignored if the GPU resources are explicitly specified in the MPIJob pod spec.")
flag.StringVar(&kubectlDeliveryImage, "kubectl-delivery-image", "", "The container image used to deliver the kubectl binary.")
flag.StringVar(&namespace, "namespace", "", "The namespace used to obtain the listers.")
flag.IntVar(
&processingUnitsPerNode,
"processing-units-per-node",
1,
"The maximum number of processing units available per node. Note that this will be ignored if the processing resources are explicitly specified in the MPIJob pod spec.")
flag.StringVar(&processingResourceType, "processing-resource-type", "nvidia.com/gpu", "The compute resource name, e.g. 'nvidia.com/gpu' or 'cpu'.")
}
32 changes: 31 additions & 1 deletion deploy/0-crd.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -18,7 +18,7 @@ spec:
properties:
spec:
title: The MPIJob spec
description: Either `gpus` or `replicas` should be specified, but not both
description: Only one of `gpus`, `processingUnits`, or `replicas` should be specified
oneOf:
- properties:
gpus:
Expand All @@ -33,13 +33,43 @@ spec:
- type: integer
multipleOf: 8
minimum: 8
slotsPerWorker:
title: The number of slots per worker used in hostfile
description: Defaults to the number of processing units per worker
type: integer
minimum: 1
required:
- gpus
- properties:
processingUnits:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should you also add a check for processingResourceType?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's part of main.go as a flag since I thought it was more related to the cluster's spec where controller gets deployed along with processingUnitsPerNode. The check is done as part of convertProcessingResourceType(). Or maybe this is more MPIJob user specific (if they want to specify custom resource) and I should move it to MPIJobSpec?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh right. Based on the experience with the gpusPerNode flag, it seems easier for users to understand if we put the options in the CRD. Maybe take them out of the operator flag and just use the job spec?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I’ll keep those flags for now for backwards compatibility and then add those fields in MPIJobSpec that overwrites the flags when specified. What do you think?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, sounds good.

title: Total number of processing units
description: Valid values are 1, 2, 4, or any multiple of 8
oneOf:
- type: integer
enum:
- 1
- 2
- 4
- type: integer
multipleOf: 8
minimum: 8
slotsPerWorker:
title: The number of slots per worker used in hostfile
description: Defaults to the number of processing units per worker
type: integer
minimum: 1
required:
- processingUnits
- properties:
replicas:
title: Total number of replicas
description: The GPU resource limit should be specified for each replica
type: integer
minimum: 1
slotsPerWorker:
title: The number of slots per worker used in hostfile
description: Defaults to the number of processing units per worker
type: integer
minimum: 1
required:
- replicas
17 changes: 14 additions & 3 deletions pkg/apis/kubeflow/v1alpha1/types.go
Original file line number Diff line number Diff line change
Expand Up @@ -40,16 +40,27 @@ type MPIJobList struct {
type MPIJobSpec struct {
// Specifies the desired number of GPUs the MPIJob should run on.
// Mutually exclusive with the `Replicas` field.
// Note that this is deprecated in favor of `ProcessingUnits` field.
// +optional
GPUs *int32 `json:"gpus,omitempty"`

// Specifies the desired number of processing units the MPIJob should run on.
// Mutually exclusive with the `Replicas` field.
// +optional
ProcessingUnits *int32 `json:"processingUnits,omitempty"`

// Specifies the number of slots per worker used in hostfile.
// Defaults to the number of processing units per worker.
// +optional
SlotsPerWorker *int32 `json:"slotsPerWorker,omitempty"`

// Run the launcher on the master.
// Optional: Default to false
// Defaults to false.
// +optional
LauncherOnMaster bool `json:"launcherOnMaster,omitempty"`

// Specifies the number of retries before marking this job failed.
// Defaults to 6
// Defaults to 6.
// +optional
BackoffLimit *int32 `json:"backoffLimit,omitempty"`

Expand All @@ -61,7 +72,7 @@ type MPIJobSpec struct {

// Specifies the desired number of replicas the MPIJob should run on.
// The `PodSpec` should specify the number of GPUs.
// Mutually exclusive with the `GPUs` field.
// Mutually exclusive with the `GPUs` or `ProcessingUnits` fields.
// +optional
Replicas *int32 `json:"replicas,omitempty"`

Expand Down
10 changes: 10 additions & 0 deletions pkg/apis/kubeflow/v1alpha1/zz_generated.deepcopy.go

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

Loading